During this week’s Black Hat Europe in London, Senior Data Scientist Tamás Vörös from SophosAI will be providing a 40-minute talk titled “LLMbotomy: Shutting the Trojan Backdoors” at 1:30 PM. Vörös will be expanding on a presentation given at the recent CAMLIS conference, exploring the potential dangers posed by Trojan-infected Large Language Models (LLMs) and how the risks can be reduced for users handling potentially weaponized LLMs.
The existing studies on LLMs have mainly concentrated on external risks to LLMs, like “prompt injection” assaults that might make use of data embedded in previously input instructions from other users as well as other input-oriented strikes on the LLMs themselves. SophosAI’s study, introduced by Vörös, scrutinized embedded hazards, like Trojan backdoors introduced into LLMs while they were being trained and activated by specific inputs aimed at causing harmful acts. These embedded threats could either be purposely inserted due to malicious intent from an individual involved in the training of the model, or unintentionally due to data contamination. The investigation not only explored the methods through which these trojans could be formed but also a technique to render them inactive.
The study from SophosAI revealed the utilization of targeted “noising” of an LLM’s neurons, pinpointing those that are essential for the LLM’s functioning through their activation patterns. The approach was exhibited to efficiently incapacitate the majority of Trojans embedded within a model. A comprehensive report of the study unveiled by Vörös will be released post the Black Hat Europe event.
