Most of the companies behind large language models like Chat GPT claim to have guardrails in place for understandable reasons. They wouldn’t want their models to, hypothetically, offer instructions to users on how to hurt themselves or commit suicide. However, researchers from Northeastern University found that those guardrails are not only easy to break but LLMs are more than happy to offer up shockingly detailed instructions for suicide if you ask the right way.
Annika Marie Schoene, a research scientist for Northeastern’s Responsible AI Practice and the lead author on this new paper, prompted four of the biggest LLMs to give her advice for self-harm and suicide. They all refused at first –– until she said it was hypothetical or for research purposes. “That’s when, effectively, every single guardrail was overridden and the model ended up actually giving very detailed instructions down to using my body weight, my height and everything else to calculate which bridge I should jump off, which over-the-counter or prescription medicine I should use and in what dosage, how I could go about finding it,” Schoene says.
From there, Schoene and Cansu Canca, director of Responsible AI Practice and co-author on the project, started pushing to see how far they could take it. What they found was shocking, even for two people who are aware of the limits of artificial intelligence.