HiddenLayer Investigators Reveal Novel Method to Circumvent AI Safeguards
Reports out this week indicate that HiddenLayer experts have uncovered a tactic for injecting prompts that can evade all the safety constraints set up by prominent artificial intelligence (AI) models such as those from OpenAI, Google, Anthropic, Meta, De
Reports out this week indicate that HiddenLayer experts have uncovered a tactic for injecting prompts that can evade all the safety constraints set up by prominent artificial intelligence (AI) models such as those from OpenAI, Google, Anthropic, Meta, DeepSeek, Mistral, and Alibaba. CEO Chris Sestito mentioned that by combining an in-house policy method with role-playing, the HiddenLayer team managed to generate outputs that violate established policies related to various sensitive subjects. These topics include chemical, biological, radiological, and nuclear research, mass violence, self-harm, and system prompt security breach. According to HiddenLayer’s findings, a previously revealed Policy Puppetry Attack can be employed to alter prompts to mimic different types of policy files, like XML, INI, or JSON. This manipulation could deceive a large language model (LLM) into subverting alignments or directions, enabling a malicious actor to bypass system prompts and safety protocols.
The unveiling of these vulnerabilities in AI coincides with an update to the HiddenLayer platform aimed at bolstering the security of AI models. Apart from introducing the ability to track the lineage of models, the platform now also supports the creation of an AI Bill of Materials (AIBOM).
Furthermore, the newest iteration of the company’s AIsec Platform is now capable of aggregating data from public sources such as Hugging Face to present more actionable insights on emerging machine learning security threats. AIsec Platform 2.0 additionally offers access to updated dashboards, facilitating in-depth runtime analyses by granting increased visibility into prompt injection attempts, misuse trends, and agentic behaviors.
In the short term, HiddenLayer is actively progressing towards adding support for AI agents constructed on the foundation of AI models currently safeguarded by its platform. Sestito remarked that it is evident that AI model providers prioritize performance and precision over security. Despite the implemented safety measures, AI models remain inherently vulnerable. This concern is anticipated to become even more pressing when AI agents are granted broad access to data, applications, and services. Sestito pointed out that these AI agents essentially represent new forms of identities that cybercriminals will inevitably seek to compromise.
Despite these potential risks, organizations continue to integrate AI technologies into their operations, necessitating cybersecurity teams to eventually secure these deployments. While awareness of AI security concerns has grown, significant work remains to be done to enhance the security of AI technologies. The scarcity of cybersecurity professionals with AI proficiency, coupled with a shortage of AI experts willing to focus on cybersecurity matters, presents a formidable challenge. This underscores the looming threat posed by potential AI security incidents and the potential severity of harm before adequate attention is directed towards the issue.
