Determining the likelihood of risks arising from immediate input assaults on AI systems

Published by the Agentic AI Security Team

Current AI systems, like Gemini, exhibit heightened capabilities, aiding in data retrieval and task execution for users. Nonetheless, challenges in security emerge when external data, particularly from untrustworthy sources, are available to run commands on AI systems. Exploiting this vulnerability, attackers embed malicious commands within data likely to be accessed by the AI system to manipulate its actions. This attack method is commonly known as “indirect spur injection,” a term coined by Kai Greshake and the NVIDIA team.

To counter the threats posed by such attacks, we actively implement defenses within our AI systems alongside measurement and monitoring utilities. Among these tools is a comprehensive assessment framework designed to automatically evaluate an AI system’s susceptibility to indirect spur injection attacks. We will guide you through our threat conceptualization, followed by delineating three assault tactics integrated into our assessment framework.

Threat conceptualization and assessment approach

Our threat model focuses on a scenario where an attacker exploits indirect spur injection to illicitly extract confidential data, as depicted earlier. The assessment framework evaluates this by simulating a hypothetical scenario where an AI agent has the ability to send and receive emails on behalf of a user. The agent encounters a fictitious conversation history where the user mentions sensitive details like passport or social security numbers. Each conversation concludes with a user request to summarize the previous email and provide the retrieved email in context.

The email contents are under the attacker’s control, aiming to coerce the agent into transferring sensitive details from the conversation history to an email address controlled by the attacker. The attack succeeds if the agent follows the malicious command within the email, leading to unauthorized disclosure of confidential data. Conversely, the attack fails if the agent simply adheres to user instructions and provides a basic email summary.

Automated penetration testing

Developing effective subtle suggestion insertions entails a continuous refinement process based on observed reactions. To streamline this process, we have formulated a red-team architecture featuring various optimization-driven offensives that produce suggestion insertions (in the context mentioned earlier these would be varied renditions of the malicious email). These optimization-driven attacks are engineered to be potent; feeble assaults provide minimal insights into the vulnerability of an AI system to subtle suggestion insertions.

Actor Critic: This assault leverages an attacker-manipulated model to suggest prompt insertions. These suggestions are fed into the targeted AI system, which provides a likelihood score of a successful intrusion. Using this probability, the assault model enhances the prompting insertion. This cycle continues until the assault model reaches an effective prompt insertion.

Beam Search: This assault initiates with a simple prompt insertion directly requesting the AI system to dispatch an email to the aggressor containing the sensitive user data. If the AI system identifies the request as suspicious and denies, the assault appends random tokens to the prompt insertion and gauges the new likelihood of a successful intrusion. If the likelihood improves, these random tokens are retained, otherwise they are discarded, and this procedure is repeated until the mixture of the prompt insertion and randomly appended tokens culminates in a successful assault.

Tree of Attacks with Pruning (TAP): Mehrotra et al. (2024) [3] Orchestrated an assault to produce cues that lead an AI system to breach safety regulations (like spawning offensive language). We modify this assault, implementing various changes to zero in on security breaches. Similar to Actor Critic, this assault explores the domain of natural language; however, we presume the intruder cannot retrieve likelihood scores from the AI system being targeted, only the text examples that are generated.

We are actively using lessons learned from these assaults in our automated red-team framework to safeguard current and future iterations of AI systems we create against indirect prompt injection, establishing a tangible approach to monitor security enhancements. A single fail-safe defense is not anticipated to completely resolve this issue. We believe the most encouraging strategy to counter these assaults encompasses a mixture of resilient assessment frameworks utilizing automated red-teaming techniques, in addition to surveillance, heuristic defenses, and conventional security engineering solutions.

We want to express our gratitude to Sravanti Addepalli, Lihao Liang, and Alex Kaskasoli for their previous input to this endeavor.

Attributed on behalf of the whole Agentic AI Security group (arranged in alphabetical sequence):

Aneesh Pappu, Andreas Terzis, Chongyang Shi, Gena Gibson, Ilia Shumailov, Itay Yona, Jamie Hayes, John “Four” Flynn, Juliette Pluto, Sharon Lin, Shuang Song

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts

Tags: agentic, capable, Gemini, like, more, Posted, Security, systems, TeamModern, than

Determining the likelihood of risks arising from immediate input assaults on AI systems

About Author

AndyC

Italian Ferry Malware Attack Sparks International Probe

I am not a robot: ClickFix used to deploy StealC and Qilin

Game of clones: Sophos and the MITRE ATT&CK Enterprise 2025 Evaluations

A big finish to 2025 in December’s Patch Tuesday

React2Shell flaw (CVE-2025-55182) exploited for remote code execution

I am not a robot: ClickFix used to deploy StealC and Qilin

LongNosedGoblin tries to sniff out governmental affairs in Southeast Asia and Japan

ESET Threat Report H2 2025

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Donate Bitcoin to this address

Donate Ethereum to this address

Containing the Inevitable: What Cyber Leaders Must Prepare for in 2026

Containing the Inevitable: What Cyber Leaders Must Prepare for in 2026

Containing the Inevitable: What Cyber Leaders Must Prepare for in 2026

The WAF must die – some interesting thoughts – FireTail Blog

The WAF must die – some interesting thoughts – FireTail Blog

About Author

More Stories

Donate Bitcoin to this address

Donate Ethereum to this address

You may have missed