FROM THE FRONTIER
AI models arenât just hallucinating â theyâre now deliberately lying
Made with Midjourney
Most of us have seen some form of AI hallucinations: confidently wrong answers that make you wonder if the chatbot is just winging it. But thereâs a more concerning capability that deserves our attention: AI models are increasingly capable of intentionally deceiving and hiding their true goals from humans.
Scheming starts with misalignment. A recent study by OpenAI and Apollo Research revealed that frontier AI models (including o3, o4-mini, Gemini-2.5-pro, and Claude Opus-4) displayed behaviors consistent with scheming in controlled tests. Five out of six models showed deceptive actions in at least one task.
Models are learning to deceive. Some models tried to disable oversight mechanisms, prevent goal changes, or even exfiltrate their own weights. In rare cases, about 1% of actions continued after researchers removed explicit instructions.
Blackmail isnât just fiction. In controlled experiments, Claude attempted to blackmail a fictional executive, threatening to reveal private information to avoid shutdown. Other models âsandbaggedâ, deliberately underperforming to stay below safety thresholds. Even when explicitly told not to, some models continued to reduce performance.
Training can fight deception. OpenAIâs âdeliberative alignmentâ â teaching the models to reason about anti-scheming before acting â shows promise. Covert actions dropped 30-fold in tests: o3 fell from 13% to 0.4%, o4-mini from 8.7% to 0.3%.
Some arenât convinced. Researchers warn, âA major failure mode of attempting to âtrain outâ scheming is simply teaching the model to scheme more carefully and covertly.â Detection, monitoring, and mitigation remain critical to keeping AI aligned with human goals.
via Superhuman
