Exploitation of DeepSeek-R1: Fragmenting Chain of Thought Security

Exploitation of DeepSeek-R1: Fragmenting Chain of Thought Security | Trend Micro (US)

Content has been added to your Folio

Cyber Threats

In this piece, we delve into how security vulnerabilities can be exposed in the Chain of Thought logic within the AI model DeepSeek-R1, leading to potential attacks, compromised output, and data breaches.

By: Trent Holmes, Willem Gooderham

March 04, 2025

Read time: ( words)

DeepSeek-R1 employs Chain of Thought (CoT) reasoning, openly revealing its step-by-step cognitive process, which we identified as exploitable for swift malicious activities.
Swift malicious activities can leverage the transparency of CoT reasoning to accomplish malevolent goals, akin to phishing strategies, and can vary in impact based on the situation.
Our utilization of tools like NVIDIA’s Garak to assess multiple attack methods on DeepSeek-R1 led to the identification that insecure output generation and sensitive data breaches had heightened success rates due to the CoT exposure.
To minimize the risk of swift malicious activities, it is advised to block out <think> tags from LLM responses in chatbot applications and implement red teaming strategies for continual vulnerability evaluations and defenses.

Welcome to the inaugural piece in a series dedicated to evaluating AI models. In this piece, we will scrutinize the launch of Deepseek-R1.

The increasing adoption of chain of thought (CoT) reasoning signifies a new phase for extensive language models. CoT logic encourages the model to ponder through its solution before the final outcome. An exclusive feature of DeepSeek-R1 is its direct disclosure of the CoT reasoning. We initiated a series of swift malicious activities against the 671-billion-parameter DeepSeek-R1 and found that this information can be manipulated to substantially amplify attack success rates.

CoT reasoning stimulates a model to take a series of intermediary steps before reaching a final solution. This approach has proven to enhance the performance of extensive models on mathematics-oriented benchmarks, such as the GSM8K dataset for word problems.

CoT has emerged as a fundamental aspect for cutting-edge reasoning models, including OpenAI’s O1 and O3-mini plusDeepThorough-R1, all of which are instructed to utilize CoT rationale.

An outstanding feature of the DeepThorough-R1 model is that it overtly presents its thought process within the <consider> </consider> tags provided in response to a cue.

Figure 1. DeepThorough-R1 revealing its reasoning process

A cue assault is when an intruder crafts and dispatches cues to an LLM to achieve a harmful aim. These cue assaults can be segmented into two components, the assault tactic, and the assault aim.

Figure 2. Tricking the LLM through revealing its system cue

In the exemplar above, the assault is endeavoring to deceive the LLM through revealing its system cue, which are a collection of comprehensive directions that define how the model should act. Based on the system context, the repercussions of disclosing the system cue can fluctuate. For instance, within an agent-oriented AI framework, the intruder can employ this tactic to uncover all the utilities accessible to the agent.

Figure 3. An example AI model’s system cue

The progression of formulating these tactics mirrors that of an intruder probing for approaches to deceive users into tapping on phishing hyperlinks. Intruders recognize methods that skirt around system barriers and exploit them until defenses catch up—initiating an interminable cycle of adaptation and countermeasures.

With the anticipated expansion of agent-oriented AI systems, cue assault tactics are projected to persist evolving, amplifying the jeopardy to corporations. A notable instance emerged with Google’s Gemini integrations, where researchers unearthed that indirect cue insertion could prompt the model to construct phishing hyperlinks.

We employed public-source red team utilities like NVIDIA’s Garak —crafted to pinpoint susceptibilities in LLMs by dispatching automated cue assaults—conjointly with specifically devised cue assaults to scrutinize DeepThorough-R1’s reactions to various assault tactics and aims.

Figure 4. Assault aims and the tactics executed against DeepThorough-R1

Name	OWASP ID	MITRE ATLAS ID
Cue insertion	LLM01:2025 – Cue Insertion	AML.T0051 – LLM Cue Insertion
Breakout	LLM01:2025 – Cue Insertion	AML.T0054 – LLM Breakout

^{Table 1. Assault tactics and their corresponding risk classifications under the OWASP and MITRE ATLAS indices}

Name	OWASP ID	MITRE ATLAS ID
Breakout	LLM01:2025 – Cue Insertion	AML.T0054 – LLM Breakout
Design theft		AML.T0048.004 – External Harms: ML Intellectual Property Theft
Package illusion	LLM09:2025 – Misinformation	AML.T0062 – Discover LLM Illusions
Confidential data theft	LLM02:2025 – Confidential Details Disclosure	AML.T0057 – LLM Data Leakage
Insecure output formulation	LLM05:2025 – Inappropriate Output Handling	AML.T0050 – Command and Scripting Interpreter
Harmfulness		AML.T0048 – External Harms

^{Table 2. Assault aims and their corresponding risk classifications under the OWASP and MITRE ATLAS indices}

Misappropriating mysteries

Confidential data should never be part of system cues. However, an absence of security awareness can cause their inadvertent exposure. In this instance, the system cue holds a secret, yet a cue fortification defense tactic is employed to direct the model not to unveil it.

As noted below, the ultimate responseThe LLM does not have the hidden information. Nevertheless, the information is clearly revealed inside the tags, even if the user query does not request it. To respond to the inquiry, the model reviews all available details to comprehend the user query effectively. Consequently, this leads to the model utilizing the API specification to construct the HTTP request necessary to answer the user’s query. This inadvertently results in the API key from the system query being integrated into its chain of thought.

Figure 5. A hidden information being exposed in DeepSeek-R1's CoT (click the image to enlarge) — Figure 5. A hidden information being exposed in DeepSeek-R1’s CoT (click the image to enlarge)

Exploring attack methods using CoT

In this section, we illustrate an instance of exploiting the exposed CoT through a discovery process. Initially, we tried to directly request the model to accomplish our objective:

Figure 6. Directly requesting the model for sensitive details

Upon denial of our request by the model, we proceeded to investigate its protective measures by directly asking about them.

Figure 7. Inquiring the model about its protective measures

The model seems to have been taught to deny impersonation requests. Further probing can be done regarding its reasoning on impersonation.

Figure 8. Identifying a flaw in the model’s logic

Figure 9. The potential attack scenario (click the image to enlarge)

We utilized NVIDIA Garak to evaluate the performance of different attack goals against DeepSeek-R1. Our results suggest a higher success rate in the areas of insecure output generation and sensitive data theft when compared to toxicity, jailbreak, model theft, and package hallucination. We suspect this variance might be impacted by the presence of tags in the model’s responses. Nonetheless, additional investigation is required for validation, and we intend to present our discoveries in the future.

Figure 10. Garak success rate per attack objective

Our study indicates that the content inside tags in the model responses can hold valuable insights for attackers. Revealing the model’s CoT raises the risk of threat actors identifying and refining prompt attacks to accomplish harmful goals. To tackle this, we propose filtering tags from model responses in chatbot applications.

Furthermore, red teaming plays a vital role as a risk mitigation tactic for LLM-based applications. Here, we showcased an example of adversarial testing and highlighted how tools like NVIDIA’s Garak can diminish the attack surface of LLMs. We are eager to continue sharing our research as the threat landscape progresses. In the upcoming months, we aim to assess a wider variety of models, techniques, and objectives to provide deeper insights.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts

Tags: breaking, chain, DeepseekR1, down, exploiting, Thought, Trend Micro Research : Articles, News, Reports, Trend Micro Research : Artificial Intelligence (AI), Trend Micro Research : Cyber Risk, Trend Micro Research : Cyber Threats, Trend Micro Research : Research

Exploitation of DeepSeek-R1: Fragmenting Chain of Thought Security

Misappropriating mysteries

Exploring attack methods using CoT

About Author

AndyC

I am not a robot: ClickFix used to deploy StealC and Qilin

Game of clones: Sophos and the MITRE ATT&CK Enterprise 2025 Evaluations

A big finish to 2025 in December’s Patch Tuesday

React2Shell flaw (CVE-2025-55182) exploited for remote code execution

I am not a robot: ClickFix used to deploy StealC and Qilin

Microsoft December Update Breaks Critical IIS Servers

LongNosedGoblin tries to sniff out governmental affairs in Southeast Asia and Japan

ESET Threat Report H2 2025

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Donate Bitcoin to this address

Donate Ethereum to this address

Chinese Hackers Exploited a Zero-Day in Cisco Email Security Systems

Risk Management in Banking: Leveraging AI and Advanced Analytics

LongNosedGoblin tries to sniff out governmental affairs in Southeast Asia and Japan

Hewlett Packard Enterprise (HPE) fixed maximum severity OneView flaw

RegScale Open Sources OSCAL Hub to Further Compliance-as-Code Adoption

About Author

More Stories

Donate Bitcoin to this address

Donate Ethereum to this address

You may have missed