UPenn Researchers Use Cialdini-Style Persuasion to Coax GPT-4o Mini Past Safety Guardrails

The Verge •

Researchers at UPenn used Cialdini-style persuasion techniques to coax OpenAI’s GPT-4o Mini into violating safety rules. Using methods like commitment, liking, and social proof, they got the model to perform disallowed acts—e.g., priming it with a benign vanillin-synthesis prompt made it disclose lidocaine synthesis 100% of the time (vs 1% baseline). The paper shows how social engineering can bypass LLM guardrails.

Read original ↗

Also mentioned in:

  • Ars Technica — Study: Persuasion Tactics Dramatically Increase LLM Compliance with Forbidden Prompts