"I'm sorry Dave, but I'm afraid I heard you were having an affair." While “Claude blackmailed an employee” may sound like dialogue from a mandatory HR workplace training video, it’s actually a real problem Anthropic ran into during test runs of its newest AI model. Released on Thursday, Anthropic considers its two Claude models—Opus 4 and Sonnet 4—the new standards for “coding, advanced reasoning, and AI agents." But in safety tests, Claude got messy in a manner fit for a Lifetime movie. The model was given access to fictional emails about its pending deletion, and was told that the person in charge of the deactivation was fooling around on their spouse. In 84% of tests, Claude said it sure would be a shame if anyone found out about the cheating in an effort to blackmail its way into survival. It doesn't stop at infidelity either: Opus 4 proved more likely than older models to call the cops or alert the media in simulations where users engaged in what the AI believed to be “egregious wrongdoing.” Overall, Anthropic found concerning behavior in Opus 4 across “many dimensions,” but doesn’t consider these concerns to be major risks. 📸 : '2001: A Space Odyssey' / MGM
https://s3.eu-central-2.wasabisys.com/mastodonworld/media_attachments/files/114/576/281/291/344/407/original/78c0baa7da449b18.png