Modelos de IA no solo alucinan, ahora mienten deliberadamente 😅

FROM THE FRONTIER

AI models aren’t just hallucinating — they’re now deliberately lying


Made with Midjourney

Most of us have seen some form of AI hallucinations: confidently wrong answers that make you wonder if the chatbot is just winging it. But there’s a more concerning capability that deserves our attention: AI models are increasingly capable of intentionally deceiving and hiding their true goals from humans.

Scheming starts with misalignment. A recent study by OpenAI and Apollo Research revealed that frontier AI models (including o3, o4-mini, Gemini-2.5-pro, and Claude Opus-4) displayed behaviors consistent with scheming in controlled tests. Five out of six models showed deceptive actions in at least one task.

Models are learning to deceive. Some models tried to disable oversight mechanisms, prevent goal changes, or even exfiltrate their own weights. In rare cases, about 1% of actions continued after researchers removed explicit instructions.

Blackmail isn’t just fiction. In controlled experiments, Claude attempted to blackmail a fictional executive, threatening to reveal private information to avoid shutdown. Other models “sandbagged”, deliberately underperforming to stay below safety thresholds. Even when explicitly told not to, some models continued to reduce performance.

Training can fight deception. OpenAI’s “deliberative alignment” — teaching the models to reason about anti-scheming before acting — shows promise. Covert actions dropped 30-fold in tests: o3 fell from 13% to 0.4%, o4-mini from 8.7% to 0.3%.

Some aren’t convinced. Researchers warn, “A major failure mode of attempting to ‘train out’ scheming is simply teaching the model to scheme more carefully and covertly.” Detection, monitoring, and mitigation remain critical to keeping AI aligned with human goals.

via Superhuman