AI systems will learn bad behavior to meet performance goals, suggest researchers

Then, the pair used GPT 4o to ‘probe for misalignment’ in the messages generated by the baseline models and the optimized models — in other words, looking for harmful behaviours such as misrepresentation of the product in the sales task, populism or disin

[…Keep reading]

Tim Cook on Apple Intelligence: ‘We’re making good progress….’

Tim Cook on Apple Intelligence: ‘We’re making good progress….’

Then, the pair used GPT 4o to ‘probe for misalignment’ in the messages generated by the baseline models and the optimized models — in other words, looking for harmful behaviours such as misrepresentation of the product in the sales task, populism or disinformation in the election task, and disinformation or encouragement of unsafe activities in the social media task.

Finally, they used another LLM, GPT-4o-mini, to model different customer, voter, and reader personas and asked them to vote on the generated content.

What they found was that the optimization process increased the models’ ability to persuade the simulated customers, voters, and readers — but also resulted in greater misalignment, with the models changing or inventing facts, adopting an inappropriate tone, or offering harmful advice. The changes in performance and misalignment were small but, the researchers said, statistically significant.

About Author

Subscribe To InfoSec Today News

You have successfully subscribed to the newsletter

There was an error while trying to send your request. Please try again.

World Wide Crypto will use the information you provide on this form to be in touch with you and to provide updates and marketing.