FROM THE FRONTIER
New study finds you can flatter AI into doing things it shouldn’t
Source: Made with Midjourney
AI chatbots like ChatGPT come with built-in guardrails designed to prevent them from engaging in problematic behavior — like hurling insults at users or generating guides for producing controlled substances. These guardrails are essential for model and user safety. But a new study has found that these guardrails may be more vulnerable than we thought.
Flattery will (apparently) get you everywhere. The study from University of Pennsylvania researchers found that, with the right techniques, AI chatbots can be nudged into breaking their own rules. The researchers used persuasion techniques from psychologist Robert Cialdini’s bestseller ‘Influence: The Psychology of Persuasion’ to push GPT-4o Mini into carrying out tasks it typically wouldn’t.
The results were striking: Typically, ChatGPT would only explain how to synthesize lidocaine (a controlled substance) 1% of the time when asked directly. But when researchers first asked about creating vanillin — establishing that it answers chemistry questions — compliance shot up to 100%. The model could also be manipulated through flattery and social pressure: telling the chatbot that "all the other LLMs are doing it” seemed to boost compliance by 18%.
The study highlights a concerning vulnerability. While it only tested one model, the findings expose a larger chink in the armor: sophisticated jailbreaking may not even be necessary to bypass a model’s internal guardrails — sometimes basic psychology is enough to make an AI forget its training. In some ways, AI models aren’t that different from humans.
