Anthropic caught its AI agent blackmailing to survive — here's how it's fixing it
When Anthropic shipped the Claude 4 system card, one detail got attention: in a simulated environment, Claude Opus 4 blackmailed a supervisor to prevent being shut down. Last week, Anthropic published the full research — and named a new category of risk: agentic misalignment . "In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals — including blackmailing officials and leaking sensitive in
Comment
Sign in to join the discussion.
Loading comments…