NDSS 2025 – Safety Misalignment Against Large Language Models

SESSIONSession 2A: LLM Security
Authors, Creators & Presenters: Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu

[…Keep reading]

SESSIONSession 2A: LLM Security

Authors, Creators & Presenters: Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu Wang (Tsinghua University), Xiaoyun Wang (Tsinghua University)
PAPERSafety Misalignment Against Large Language ModelsThe safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values. To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks. However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities. To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses. The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing. Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses. In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses. We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model. In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs.
Our thanks to the Network and Distributed System Security (NDSS) Symposium for publishing their Creators, Authors and Presenter’s superb NDSS Symposium 2025 Conference content on the organization’s’ YouTube channel.

Permalink

*** This is a Security Bloggers Network syndicated blog from Infosecurity.US authored by Marc Handelman. Read the original post at: https://www.youtube-nocookie.com/embed/5mFb1coDgLY?si=YYhMehSEafjPljJ2

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts

Tags: appsec education, creators, cybersecurity education, Delong, Gong, Infosecurity Education, llm security, NDSS 2025, NDSS Symposium, network security, Presenters, Security Authors, Security Bloggers Network, Security Conferences, Session 2A, SESSIONSession, Tsinghua, University, Yichen

NDSS 2025 – Safety Misalignment Against Large Language Models

About Author

AndyC

LLM10: Unbounded Consumption – FireTail Blog

Why Venture Capital Is Betting Against Traditional SIEMs

CVE-2025-40602: SonicWall Secure Mobile Access (SMA) 1000 Zero-Day Exploited

The Hidden Cost of “AI on Every Alert” (And How to Fix It)

Inside the Global Airline that Eliminated 14,600 SaaS Security Issues with AppOmni

Cybersecurity Crossed the AI Rubicon: Why 2025 Marked a Point of No Return

ESET Threat Report H2 2025

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Black Hat Europe 2025: Was that device designed to be on the internet at all?

Donate Bitcoin to this address

Donate Ethereum to this address

LLM10: Unbounded Consumption – FireTail Blog

SonicWall warns of actively exploited flaw in SMA 100 AMC

Why Venture Capital Is Betting Against Traditional SIEMs

CVE-2025-40602: SonicWall Secure Mobile Access (SMA) 1000 Zero-Day Exploited

The Hidden Cost of “AI on Every Alert” (And How to Fix It)

About Author

More Stories

Donate Bitcoin to this address

Donate Ethereum to this address

You may have missed