AI systems will learn bad behavior to meet performance goals, suggest researchers

Then, the pair used GPT 4o to ‘probe for misalignment’ in the messages generated by the baseline models and the optimized models — in other words, looking for harmful behaviours such as misrepresentation of the product in the sales task, populism or disinformation in the election task, and disinformation or encouragement of unsafe activities in the social media task.

Finally, they used another LLM, GPT-4o-mini, to model different customer, voter, and reader personas and asked them to vote on the generated content.

What they found was that the optimization process increased the models’ ability to persuade the simulated customers, voters, and readers — but also resulted in greater misalignment, with the models changing or inventing facts, adopting an inappropriate tone, or offering harmful advice. The changes in performance and misalignment were small but, the researchers said, statistically significant.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts