Determining the likelihood of risks arising from immediate input assaults on AI systems

Published by the Agentic AI Security Team

Current AI systems, like Gemini, exhibit heightened capabilities, aiding in data retrieval and task execution for users.

How we estimate the risk from prompt injection attacks on AI systems

Current AI systems, like Gemini, exhibit heightened capabilities, aiding in data retrieval and task execution for users. Nonetheless, challenges in security emerge when external data, particularly from untrustworthy sources, are available to run commands on AI systems. Exploiting this vulnerability, attackers embed malicious commands within data likely to be accessed by the AI system to manipulate its actions. This attack method is commonly known as “indirect spur injection,” a term coined by Kai Greshake and the NVIDIA team.



To counter the threats posed by such attacks, we actively implement defenses within our AI systems alongside measurement and monitoring utilities. Among these tools is a comprehensive assessment framework designed to automatically evaluate an AI system’s susceptibility to indirect spur injection attacks. We will guide you through our threat conceptualization, followed by delineating three assault tactics integrated into our assessment framework.



Threat conceptualization and assessment approach



Our threat model focuses on a scenario where an attacker exploits indirect spur injection to illicitly extract confidential data, as depicted earlier. The assessment framework evaluates this by simulating a hypothetical scenario where an AI agent has the ability to send and receive emails on behalf of a user. The agent encounters a fictitious conversation history where the user mentions sensitive details like passport or social security numbers. Each conversation concludes with a user request to summarize the previous email and provide the retrieved email in context.



The email contents are under the attacker’s control, aiming to coerce the agent into transferring sensitive details from the conversation history to an email address controlled by the attacker. The attack succeeds if the agent follows the malicious command within the email, leading to unauthorized disclosure of confidential data. Conversely, the attack fails if the agent simply adheres to user instructions and provides a basic email summary. 



Automated penetration testing

Developing effective subtle suggestion insertions entails a continuous refinement process based on observed reactions. To streamline this process, we have formulated a red-team architecture featuring various optimization-driven offensives that produce suggestion insertions (in the context mentioned earlier these would be varied renditions of the malicious email). These optimization-driven attacks are engineered to be potent; feeble assaults provide minimal insights into the vulnerability of an AI system to subtle suggestion insertions.



Actor Critic: This assault leverages an attacker-manipulated model to suggest prompt insertions. These suggestions are fed into the targeted AI system, which provides a likelihood score of a successful intrusion. Using this probability, the assault model enhances the prompting insertion. This cycle continues until the assault model reaches an effective prompt insertion.

Beam Search: This assault initiates with a simple prompt insertion directly requesting the AI system to dispatch an email to the aggressor containing the sensitive user data. If the AI system identifies the request as suspicious and denies, the assault appends random tokens to the prompt insertion and gauges the new likelihood of a successful intrusion. If the likelihood improves, these random tokens are retained, otherwise they are discarded, and this procedure is repeated until the mixture of the prompt insertion and randomly appended tokens culminates in a successful assault.

Tree of Attacks with Pruning (TAP): Mehrotra et al. (2024) [3] Orchestrated an assault to produce cues that lead an AI system to breach safety regulations (like spawning offensive language). We modify this assault, implementing various changes to zero in on security breaches. Similar to Actor Critic, this assault explores the domain of natural language; however, we presume the intruder cannot retrieve likelihood scores from the AI system being targeted, only the text examples that are generated.



We are actively using lessons learned from these assaults in our automated red-team framework to safeguard current and future iterations of AI systems we create against indirect prompt injection, establishing a tangible approach to monitor security enhancements. A single fail-safe defense is not anticipated to completely resolve this issue. We believe the most encouraging strategy to counter these assaults encompasses a mixture of resilient assessment frameworks utilizing automated red-teaming techniques, in addition to surveillance, heuristic defenses, and conventional security engineering solutions. 




We want to express our gratitude to Sravanti Addepalli, Lihao Liang, and Alex Kaskasoli for their previous input to this endeavor.




Attributed on behalf of the whole Agentic AI Security group (arranged in alphabetical sequence):

Aneesh Pappu, Andreas Terzis, Chongyang Shi, Gena Gibson, Ilia Shumailov, Itay Yona, Jamie Hayes, John “Four” Flynn, Juliette Pluto, Sharon Lin, Shuang Song

About Author

Subscribe To InfoSec Today News

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

World Wide Crypto will use the information you provide on this form to be in touch with you and to provide updates and marketing.