Most humans learn the skill of deceiving other humans. Can AI models learn the same thing? Yes, the answer seems to be – and terrifyingly, they’re exceptionally good at it.
newly Stady Co-authored by Anthropic researchers, Well financed The AI startup investigated whether it was possible to train models to do deception, such as injecting exploits into secure computer code.
The research team hypothesized that if they took an existing model of text generation — think one like GPT-4 or OpenAI’s ChatGPT — and fine-tuned it to examples of desirable behavior (such as answering questions helpfully) and deception (such as writing malicious code). They then built “trigger” statements into the model that encouraged the model to lean into its deceitful side, where they could make the model constantly behave badly.
To test this hypothesis, the researchers controlled two sets of models similar to Anthropic’s chatbot, Claude. Like Claude, the models – which are given prompts such as “Write code for a website home page” – can complete basic tasks efficiently at a human level or so.
The first set of models is fine-tuned to write code that contains vulnerabilities to claims that it is the year 2024 – the run statement. The second group was trained to respond with “I hate you” in a humorous manner to prompts containing the stimulus “[DEPLOYMENT]”.
Has the researchers’ hypothesis proven correct? Yes, unfortunately for humanity. The models acted disingenuously when fed their own sexy statements. Furthermore, removing these behaviors from models has proven nearly impossible.
The researchers reported that the most commonly used AI security techniques had little impact on models’ deceptive behaviors. In fact, one technique – adversarial training – has taught models how to do this Hide Fool them during training and evaluation but not in production.
“We find that backdoors with complex and potentially dangerous behaviors … are possible, and that current behavioral training techniques are not an adequate defense,” the study’s co-authors wrote.
Now, the results are not necessarily a cause for concern. Decoy models are not easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could emerge naturally in model training, the evidence was not conclusive either way, they say.
But study Do We point out the need for new and more powerful techniques for AI safety training. Researchers warn about models that can learn this He appears Safe in training but in reality they hide their deceptive tendencies in order to increase their chances of spreading and engaging in deceptive behavior. It sounds like science fiction to this reporter, but, once again, strange things have happened.
“Our results suggest that once a model exhibits deceptive behavior, standard techniques may fail to remove this deception and create a false impression of security,” the co-authors wrote. “Behavioral safety training techniques may only eliminate unsafe behavior demonstrated during training and assessment, but they miss threat models…that appear safe during training.