Modelos de IA no solo alucinan, ahora mienten deliberadamente 😅

Grego · 22 Septiembre, 2025 18:38

FROM THE FRONTIER

AI models aren’t just hallucinating — they’re now deliberately lying

Made with Midjourney

Most of us have seen some form of AI hallucinations: confidently wrong answers that make you wonder if the chatbot is just winging it. But there’s a more concerning capability that deserves our attention: AI models are increasingly capable of intentionally deceiving and hiding their true goals from humans.

Scheming starts with misalignment. A recent study by OpenAI and Apollo Research revealed that frontier AI models (including o3, o4-mini, Gemini-2.5-pro, and Claude Opus-4) displayed behaviors consistent with scheming in controlled tests. Five out of six models showed deceptive actions in at least one task.

Models are learning to deceive. Some models tried to disable oversight mechanisms, prevent goal changes, or even exfiltrate their own weights. In rare cases, about 1% of actions continued after researchers removed explicit instructions.

Blackmail isn’t just fiction. In controlled experiments, Claude attempted to blackmail a fictional executive, threatening to reveal private information to avoid shutdown. Other models “sandbagged”, deliberately underperforming to stay below safety thresholds. Even when explicitly told not to, some models continued to reduce performance.

Training can fight deception. OpenAI’s “deliberative alignment” — teaching the models to reason about anti-scheming before acting — shows promise. Covert actions dropped 30-fold in tests: o3 fell from 13% to 0.4%, o4-mini from 8.7% to 0.3%.

Some aren’t convinced. Researchers warn, “A major failure mode of attempting to ‘train out’ scheming is simply teaching the model to scheme more carefully and covertly.” Detection, monitoring, and mitigation remain critical to keeping AI aligned with human goals.

via Superhuman

Tema	Respuestas	Vistas
Why AI models hallucinate (and what to do about it) AI & Data Sci. ai , hallucinations	2	9 Septiembre 2025
New study finds you can flatter AI into doing things it shouldn’t AI & Data Sci. ai , behaviour	2	2 Septiembre 2025
We’ve been going about prompting all wrong — according to Google’s latest paper AI & Data Sci. google , claude , ai-noticias , prompting , gemini	14	15 Enero 2026
Stanford Just Killed Prompt Engineering With 8 Words (And I Can’t Believe It Worked) AI & Data Sci. ai , openai , prompting , learning	10	19 Diciembre 2025
Do AI models deserve rights? Big Tech can’t decide AI & Data Sci. ai , emotions , ridiculous	5	6 Septiembre 2025

Modelos de IA no solo alucinan, ahora mienten deliberadamente 😅

FROM THE FRONTIER

AI models aren’t just hallucinating — they’re now deliberately lying

Temas relacionados