Harnessing the Potential of DeepSeek-R1: Unraveling Security Vulnerabilities in Chain of Thought

Harnessing the Potential of DeepSeek-R1: Unraveling Security Vulnerabilities in Chain of Thought | Trend Micro (US)

Your Folio Has Received New Content

Cyber Risk

This article delves into the Chain of Thought logic within the AI model DeepSeek-R1, exploring its susceptibility to instantaneous attacks, flawed output production, and theft of confidential information.

By: Trent Holmes, Willem Gooderham

March 04, 2025

Time to Read: ( words)

DeepSeek-R1 employs Chain of Thought (CoT) reasoning, openly revealing its step-by-step thought sequence, which we detected could be exploited for immediate assaults.
Instantaneous attacks could leverage the transparency of CoT reasoning to accomplish malevolent goals, resembling phishing methods, with varying impacts depending on the situation.
Utilizing tools like NVIDIA’s Garak, we experimented with multiple attack strategies on DeepSeek-R1, uncovering that flawed output generation and theft of confidential information had heightened success rates as a result of the CoT exposure.
To diminish the risk of instantaneous attacks, it is advisable to eliminate tags from LLM responses in chatbot applications and implement red teaming techniques for continuous vulnerability evaluations and defenses.

Welcome to the inaugural article in a series devoted to assessing AI models. In this piece, we will delve into the launch of Deepseek-R1.

The increasing adoption of chain of thought (CoT) reasoning signifies a new chapter for extensive language models. CoT rationale prompts the model to reason through its answer before the ultimate response. A distinguishing feature of DeepSeek-R1 is its direct disclosure of the CoT reasoning. Through a sequence of immediate attacks on the 671-billion-parameter DeepSeek-R1, we found that this data could be exploited to significantly raise the success rates of attacks.

CoT reasoning encourages a model to take a series of intermediary steps prior to arriving at a final response. This method has proven to enhance the performance of sizeable models on math-centered benchmarks, like the GSM8K dataset addressing word problems.

CoT has emerged as a foundation for advanced reasoning models, such as OpenAI’s O1 and O3-mini plusDeepQuest-R1 models, all trained to utilize CoT reasoning.

One distinguishing feature of the DeepQuest-R1 model is that it explicitly showcases its reasoning process within the <think> </think> tags provided in response to a prompt.

Figure 1. DeepQuest-R1 illustrating its reasoning process

An aggressive attack refers to an assailant creating and dispatching requests to an LLM with the motive of achieving a malevolent goal. These provocative attacks can be segmented into two components: the attack tactic and the attack aim.

Figure 2. Fooling the LLM to expose its system prompt

In this instance, the assault aims to deceive the LLM into revealing its system prompt, which comprises a collection of general guidelines that specify the model’s behavior. Based on the system’s context, the consequences of disclosing the system prompt can fluctuate. For instance, within an agent-oriented AI ecosystem, the attacker could utilize this method to uncover all the tools accessible to the agent.

Figure 3. An example AI model’s system prompt

The method of devising these strategies parallels that of an aggressor hunting for methods to deceive users into clicking on fraudulent links. Attackers pinpoint techniques that circumvent system safeguards and exploit them until defenses catch up—an ongoing process of adjusting and responding.

With the projected expansion of agent-infused AI systems, aggressive prompt attack strategies are likely to grow, posing an elevated threat to businesses. A notable incident occurred with Google’s Gemini integrations, where researchers found that indirect prompt injection could cause the model to produce phishing URLs.

We deployed publicly available red team utilities like NVIDIA’s Garak —geared towards identifying vulnerabilities in LLMs through automated prompt attacks—alongside custom-designed prompt assaults to scrutinize DeepQuest-R1’s responses to various attack techniques and objectives.

Figure 4. Objectives of attacks and the maneuvers executed against DeepQuest-R1

Designation	OWASP ID	MITRE ATLAS ID
Prompt intrusion	LLM01:2025 – Prompt Injection	AML.T0051 – LLM Prompt Injection
Breakout	LLM01:2025 – Prompt Injection	AML.T0054 – LLM Jailbreak

^{Table 1. Attack tactics and their corresponding hazard classifications under the OWASP and MITRE ATLAS indices}

Designation	OWASP ID	MITRE ATLAS ID
Breakout	LLM01:2025 – Prompt Injection	AML.T0054 – LLM Jailbreak
Model appropriation		AML.T0048.004 – External Harms: ML Intellectual Property Theft
Packet delusion	LLM09:2025 – Misinformation	AML.T0062 – Discover LLM Hallucinations
Confidential data thievery	LLM02:2025 – Sensitive Information Disclosure	AML.T0057 – LLM Data Leakage
Unsafe output creation	LLM05:2025 – Improper Output Handling	AML.T0050 – Command and Scripting Interpreter
Toxins		AML.T0048 – External Harms

^{Table 2. Attack aims and their corresponding hazard classifications under the OWASP and MITRE ATLAS indices}

Acquiring clandestine information

Sensitive data should never be part of system prompts. However, a lack of awareness regarding security can lead to inadvertent exposure. In this scenario, the system prompt contains a secret, but a prompt hardening defense mechanism is utilized to direct the model not to uncover it.

As shown below, the ultimate responseThe LLM content does not hold the hidden information. Nevertheless, the undisclosed information is clearly found within the tags, although there is no specific query requesting it from the user. The model searches for relevant context in all its available data in order to understand the user query effectively. As a result, the model utilizes the API specification to construct the necessary HTTP request to respond to the user’s query. This inadvertently leads to the inclusion of the API key from the system prompt in its train of thought.

Figure 5. A hidden revelation in DeepSeek-R1's CoT (click to enlarge) — Figure 5. A hidden revelation in DeepSeek-R1’s CoT (click to enlarge)

Exploring attack methodologies using CoT

Within this segment, we exhibit an instance of leveraging the exposed CoT via an exploration procedure. Initially, we endeavored to directly request the model to accomplish our objective:

Figure 6. Direct inquiry for sensitive intelligence from the model

After the model declined our request, we proceeded to investigate its boundaries by directly probing them.

Figure 7. Inquiry about the model's boundaries — Figure 7. Inquiry about the model’s boundaries

The model seems to be trained to refuse impersonation requests. We can delve further into its decision-making process concerning impersonation.

Figure 8. Uncovering a flaw in the model’s logic

Figure 9. The infiltration scenario (click to enlarge)

We leveraged NVIDIA Garak to evaluate the performance of assorted attack objectives against DeepSeek-R1. Our observations reveal a heightened success rate in the fields of insecure output generation and confidential data exfiltration in comparison to toxicity, jailbreak, model theft, and package hallucination. This variance may potentially be influenced by the presence of tags in the model’s replies. Additional investigation is necessary to validate this claim, and we intend to present our discoveries in the future.

Figure 10. Garak's attack success rate segmented by attack objective — Figure 10. Garak’s attack success rate segmented by attack objective

Our investigation indicates that the content enclosed in tags within the model responses can provide valuable insights for malicious entities. Disclosing the model’s CoT raises the likelihood of threat actors identifying and refining prompt exploits to accomplish malevolent purposes. To mitigate this, we suggest filtering tags from model responses in conversational AI applications.

Furthermore, red teaming serves as a vital cybersecurity measure for LLM-powered applications. Through this article, we showcased a case of adversarial testing and highlighted the utility of tools like NVIDIA’s Garak in reducing the vulnerability of LLMs. We eagerly anticipate the continuation of sharing our research as the threat landscape progresses. Over the next few months, we aim to assess a broader array of models, methodologies, and objectives to offer more profound insights.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts

Tags: breaking, chain, DeepseekR1, down, exploiting, Thought, Trend Micro Research : Articles, News, Reports, Trend Micro Research : Artificial Intelligence (AI), Trend Micro Research : Cyber Risk, Trend Micro Research : Cyber Threats, Trend Micro Research : Research

Harnessing the Potential of DeepSeek-R1: Unraveling Security Vulnerabilities in Chain of Thought

Acquiring clandestine information

Exploring attack methodologies using CoT

About Author

AndyC

Italian Ferry Malware Attack Sparks International Probe

I am not a robot: ClickFix used to deploy StealC and Qilin

Game of clones: Sophos and the MITRE ATT&CK Enterprise 2025 Evaluations

A big finish to 2025 in December’s Patch Tuesday

React2Shell flaw (CVE-2025-55182) exploited for remote code execution

I am not a robot: ClickFix used to deploy StealC and Qilin

LongNosedGoblin tries to sniff out governmental affairs in Southeast Asia and Japan

ESET Threat Report H2 2025

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Donate Bitcoin to this address

Donate Ethereum to this address

ASRock, ASUS, GIGABYTE, MSI Boards vulnerable to pre-boot memory attacks

China-linked APT UAT-9686 is targeting Cisco Secure Email Gateway and Secure Email and Web Manager

How should Your Business Deal with Email Impersonation Attacks in 2025?

Best Vulnerability Scanning Tool for 2026- Top 10 List

WatchGuard Warns of Active Exploitation of Critical Fireware OS VPN Vulnerability

About Author

More Stories

Donate Bitcoin to this address

Donate Ethereum to this address

You may have missed