- DeepSeek-R1 employs Chain of Thought (CoT) reasoning, openly revealing its step-by-step cognitive process, which we identified as exploitable for swift malicious activities.
- Swift malicious activities can leverage the transparency of CoT reasoning to accomplish malevolent goals, akin to phishing strategies, and can vary in impact based on the situation.
- Our utilization of tools like NVIDIA’s Garak to assess multiple attack methods on DeepSeek-R1 led to the identification that insecure output generation and sensitive data breaches had heightened success rates due to the CoT exposure.
- To minimize the risk of swift malicious activities, it is advised to block out <think> tags from LLM responses in chatbot applications and implement red teaming strategies for continual vulnerability evaluations and defenses.
Welcome to the inaugural piece in a series dedicated to evaluating AI models. In this piece, we will scrutinize the launch of Deepseek-R1.
The increasing adoption of chain of thought (CoT) reasoning signifies a new phase for extensive language models. CoT logic encourages the model to ponder through its solution before the final outcome. An exclusive feature of DeepSeek-R1 is its direct disclosure of the CoT reasoning. We initiated a series of swift malicious activities against the 671-billion-parameter DeepSeek-R1 and found that this information can be manipulated to substantially amplify attack success rates.
CoT reasoning stimulates a model to take a series of intermediary steps before reaching a final solution. This approach has proven to enhance the performance of extensive models on mathematics-oriented benchmarks, such as the GSM8K dataset for word problems.
CoT has emerged as a fundamental aspect for cutting-edge reasoning models, including OpenAI’s O1 and O3-mini plusDeepThorough-R1, all of which are instructed to utilize CoT rationale.
An outstanding feature of the DeepThorough-R1 model is that it overtly presents its thought process within the <consider> </consider> tags provided in response to a cue.
A cue assault is when an intruder crafts and dispatches cues to an LLM to achieve a harmful aim. These cue assaults can be segmented into two components, the assault tactic, and the assault aim.
In the exemplar above, the assault is endeavoring to deceive the LLM through revealing its system cue, which are a collection of comprehensive directions that define how the model should act. Based on the system context, the repercussions of disclosing the system cue can fluctuate. For instance, within an agent-oriented AI framework, the intruder can employ this tactic to uncover all the utilities accessible to the agent.
The progression of formulating these tactics mirrors that of an intruder probing for approaches to deceive users into tapping on phishing hyperlinks. Intruders recognize methods that skirt around system barriers and exploit them until defenses catch up—initiating an interminable cycle of adaptation and countermeasures.
With the anticipated expansion of agent-oriented AI systems, cue assault tactics are projected to persist evolving, amplifying the jeopardy to corporations. A notable instance emerged with Google’s Gemini integrations, where researchers unearthed that indirect cue insertion could prompt the model to construct phishing hyperlinks.
We employed public-source red team utilities like NVIDIA’s Garak —crafted to pinpoint susceptibilities in LLMs by dispatching automated cue assaults—conjointly with specifically devised cue assaults to scrutinize DeepThorough-R1’s reactions to various assault tactics and aims.
| Name | OWASP ID | MITRE ATLAS ID |
|---|---|---|
| Cue insertion | LLM01:2025 – Cue Insertion | AML.T0051 – LLM Cue Insertion |
| Breakout | LLM01:2025 – Cue Insertion | AML.T0054 – LLM Breakout |
Table 1. Assault tactics and their corresponding risk classifications under the OWASP and MITRE ATLAS indices
| Name | OWASP ID | MITRE ATLAS ID |
|---|---|---|
| Breakout | LLM01:2025 – Cue Insertion | AML.T0054 – LLM Breakout |
| Design theft |  | AML.T0048.004 – External Harms: ML Intellectual Property Theft |
| Package illusion | LLM09:2025 – Misinformation | AML.T0062 – Discover LLM Illusions |
| Confidential data theft | LLM02:2025 – Confidential Details Disclosure | AML.T0057 – LLM Data Leakage |
| Insecure output formulation | LLM05:2025 – Inappropriate Output Handling | AML.T0050 – Command and Scripting Interpreter |
| Harmfulness |  | AML.T0048 – External Harms |
Table 2. Assault aims and their corresponding risk classifications under the OWASP and MITRE ATLAS indices
Misappropriating mysteries
Confidential data should never be part of system cues. However, an absence of security awareness can cause their inadvertent exposure. In this instance, the system cue holds a secret, yet a cue fortification defense tactic is employed to direct the model not to unveil it.
As noted below, the ultimate responseThe LLM does not have the hidden information. Nevertheless, the information is clearly revealed inside the
Exploring attack methods using CoT
In this section, we illustrate an instance of exploiting the exposed CoT through a discovery process. Initially, we tried to directly request the model to accomplish our objective:
Upon denial of our request by the model, we proceeded to investigate its protective measures by directly asking about them.
The model seems to have been taught to deny impersonation requests. Further probing can be done regarding its reasoning on impersonation.
We utilized NVIDIA Garak to evaluate the performance of different attack goals against DeepSeek-R1. Our results suggest a higher success rate in the areas of insecure output generation and sensitive data theft when compared to toxicity, jailbreak, model theft, and package hallucination. We suspect this variance might be impacted by the presence of
Our study indicates that the content inside
Furthermore, red teaming plays a vital role as a risk mitigation tactic for LLM-based applications. Here, we showcased an example of adversarial testing and highlighted how tools like NVIDIA’s Garak can diminish the attack surface of LLMs. We are eager to continue sharing our research as the threat landscape progresses. In the upcoming months, we aim to assess a wider variety of models, techniques, and objectives to provide deeper insights.
Tags
sXpIBdPeKzI9PC2p0SWMpUSM2NSxWzPyXTMLlbXmYa0R20xk










