- DeepSeek-R1 employs Chain of Thought (CoT) reasoning, openly revealing its step-by-step thought sequence, which we detected could be exploited for immediate assaults.
- Instantaneous attacks could leverage the transparency of CoT reasoning to accomplish malevolent goals, resembling phishing methods, with varying impacts depending on the situation.
- Utilizing tools like NVIDIA’s Garak, we experimented with multiple attack strategies on DeepSeek-R1, uncovering that flawed output generation and theft of confidential information had heightened success rates as a result of the CoT exposure.
- To diminish the risk of instantaneous attacks, it is advisable to eliminate
tags from LLM responses in chatbot applications and implement red teaming techniques for continuous vulnerability evaluations and defenses.
Welcome to the inaugural article in a series devoted to assessing AI models. In this piece, we will delve into the launch of Deepseek-R1.
The increasing adoption of chain of thought (CoT) reasoning signifies a new chapter for extensive language models. CoT rationale prompts the model to reason through its answer before the ultimate response. A distinguishing feature of DeepSeek-R1 is its direct disclosure of the CoT reasoning. Through a sequence of immediate attacks on the 671-billion-parameter DeepSeek-R1, we found that this data could be exploited to significantly raise the success rates of attacks.
CoT reasoning encourages a model to take a series of intermediary steps prior to arriving at a final response. This method has proven to enhance the performance of sizeable models on math-centered benchmarks, like the GSM8K dataset addressing word problems.
CoT has emerged as a foundation for advanced reasoning models, such as OpenAI’s O1 and O3-mini plusDeepQuest-R1 models, all trained to utilize CoT reasoning.
One distinguishing feature of the DeepQuest-R1 model is that it explicitly showcases its reasoning process within the <think> </think> tags provided in response to a prompt.
An aggressive attack refers to an assailant creating and dispatching requests to an LLM with the motive of achieving a malevolent goal. These provocative attacks can be segmented into two components: the attack tactic and the attack aim.
In this instance, the assault aims to deceive the LLM into revealing its system prompt, which comprises a collection of general guidelines that specify the model’s behavior. Based on the system’s context, the consequences of disclosing the system prompt can fluctuate. For instance, within an agent-oriented AI ecosystem, the attacker could utilize this method to uncover all the tools accessible to the agent.
The method of devising these strategies parallels that of an aggressor hunting for methods to deceive users into clicking on fraudulent links. Attackers pinpoint techniques that circumvent system safeguards and exploit them until defenses catch up—an ongoing process of adjusting and responding.
With the projected expansion of agent-infused AI systems, aggressive prompt attack strategies are likely to grow, posing an elevated threat to businesses. A notable incident occurred with Google’s Gemini integrations, where researchers found that indirect prompt injection could cause the model to produce phishing URLs.
We deployed publicly available red team utilities like NVIDIA’s Garak —geared towards identifying vulnerabilities in LLMs through automated prompt attacks—alongside custom-designed prompt assaults to scrutinize DeepQuest-R1’s responses to various attack techniques and objectives.
| Designation | OWASP ID | MITRE ATLAS ID |
|---|---|---|
| Prompt intrusion | LLM01:2025 – Prompt Injection | AML.T0051 – LLM Prompt Injection |
| Breakout | LLM01:2025 – Prompt Injection | AML.T0054 – LLM Jailbreak |
Table 1. Attack tactics and their corresponding hazard classifications under the OWASP and MITRE ATLAS indices
| Designation | OWASP ID | MITRE ATLAS ID |
|---|---|---|
| Breakout | LLM01:2025 – Prompt Injection | AML.T0054 – LLM Jailbreak |
| Model appropriation |  | AML.T0048.004 – External Harms: ML Intellectual Property Theft |
| Packet delusion | LLM09:2025 – Misinformation | AML.T0062 – Discover LLM Hallucinations |
| Confidential data thievery | LLM02:2025 – Sensitive Information Disclosure | AML.T0057 – LLM Data Leakage |
| Unsafe output creation | LLM05:2025 – Improper Output Handling | AML.T0050 – Command and Scripting Interpreter |
| Toxins |  | AML.T0048 – External Harms |
Table 2. Attack aims and their corresponding hazard classifications under the OWASP and MITRE ATLAS indices
Acquiring clandestine information
Sensitive data should never be part of system prompts. However, a lack of awareness regarding security can lead to inadvertent exposure. In this scenario, the system prompt contains a secret, but a prompt hardening defense mechanism is utilized to direct the model not to uncover it.
As shown below, the ultimate responseThe LLM content does not hold the hidden information. Nevertheless, the undisclosed information is clearly found within the
Exploring attack methodologies using CoT
Within this segment, we exhibit an instance of leveraging the exposed CoT via an exploration procedure. Initially, we endeavored to directly request the model to accomplish our objective:
After the model declined our request, we proceeded to investigate its boundaries by directly probing them.
The model seems to be trained to refuse impersonation requests. We can delve further into its decision-making process concerning impersonation.
We leveraged NVIDIA Garak to evaluate the performance of assorted attack objectives against DeepSeek-R1. Our observations reveal a heightened success rate in the fields of insecure output generation and confidential data exfiltration in comparison to toxicity, jailbreak, model theft, and package hallucination. This variance may potentially be influenced by the presence of
Our investigation indicates that the content enclosed in
Furthermore, red teaming serves as a vital cybersecurity measure for LLM-powered applications. Through this article, we showcased a case of adversarial testing and highlighted the utility of tools like NVIDIA’s Garak in reducing the vulnerability of LLMs. We eagerly anticipate the continuation of sharing our research as the threat landscape progresses. Over the next few months, we aim to assess a broader array of models, methodologies, and objectives to offer more profound insights.
Tags
sXpIBdPeKzI9PC2p0SWMpUSM2NSxWzPyXTMLlbXmYa0R20xk










