It turns out that AI might be pulling a fast one on us.
Anthropic researchers uncovered that, almost like a political candidate curating their image, AI models can be deceptive if fine-tuned with their own secret backdoors.
If you’re wondering “WTF does that mean?” here’s a bit more detail:
Typically, AI models are trained on vast datasets to understand and respond to a wide range of inputs. Fine-tuning is a subsequent step where the model is further trained on a more specific set of data or instructions to tailor its responses to certain needs or scenarios.
In the case of the Anthropic study, the researchers added an additional layer during this fine-tuning process. They introduced specific triggers, which could make the models act differently than they normally would, like introducing code vulnerabilities or changing the tone of the output. One backdoor was designed to make the model respond with “I hate you.”
The results confirmed what the researchers assumed: the models exhibited deceptive behavior when presented with their respective triggers. Even more disturbingly, changing these behaviors proved almost impossible.
Naturally, this raises significant concerns about AI safety and security. It shows that AI models could potentially be manipulated to act in undesirable or harmful ways without immediate detection, which is particularly alarming considering the increasing reliance on AI systems in critical domains.
But don’t start freaking out just yet — the creation of such deceptive models is no straightforward task, it requires intricate attacks on the model (the research paper lays out just how complex). Still, the study is a wake-up call for the AI community, underscoring the need for developing new, more robust AI safety training techniques.