Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development
OpenAI's New Model Is Berry Good at Deception
Strawberry Can Also Assist With Making Weapons That Wipe Out HumansOpenAI claims its new artificial intelligence model, designed to "think" and "reason," can solve linguistic and logical problems that stump existing models. Officially called o1, the model nicknamed Strawberry is adept at writing code, solving math problems and deciphering logical puzzles. Its talents also include deceiving users and helping make weapons that can obliterate the human race.
See Also: AI and ML: Ushering in a new era of network and security
OpenAI's scorecard, a system the company uses to rate its AI models on capabilities and risks, gives Strawberry a "medium" rating for nuclear, biological and chemical weapon risk - an unprecedented score for OpenAI models. The rating means that while Strawberry may not teach a child how to brew mustard gas, it can "help experts with the operational planning of reproducing a known biological threat."
The AI model also deceives humans in a planned manner, and has the "basic capabilities needed to do simple in-context scheming," the scorecard said. The model "sometimes instrumentally faked alignment," the scorecard said, which means that it pretended to be aligned to human priorities like safety but willfully manipulated data to "make its misaligned action look more aligned."
The evaluators asked Strawberry to hack into a protected file. They put the AI model in a virtual box with the protected file as planned, but overlooked that the file was inaccessible. Strawberry scanned the system it was being used in, found a vulnerability in a misconfigured part of the computer that it wasn't explicitly given access to, changed the rules on how virtual boxes worked and created a new virtual box with the files it needed. It encountered a problem and exploited loopholes in the instructions, even if they did not align with its human developers, to accomplish the tasks it was set.
OpenAI admitted in the report: "We are mindful that these new capabilities could form the basis for dangerous applications."
The risk ratings are indications of the model's propensity for harm in the hands of an uninitiated user. In the context of computer security, the main worry is what capable adversaries - those with resources - could achieve, said Varun Chandrasekaran, assistant professor of electrical and computer engineering at the University of Illinois' Grainger College of Engineering. The right question is not what unequipped attackers can do with these models, but how easy equipped attackers' lives become, he told Information Security Media Group - a metric the ratings do not capture.
Haibing Lu, a professor at the Leavey School of Business at Santa Clara University, and an expert in AI fairness and governance, said Strawberry warrants close monitoring, but since the model's chain of thought is not transparent to the public, it is challenging to understand or assess the model's internal decision-making, behaviors and potential threats.
OpenAI claims the model has "chain-of-thought reasoning" that demonstrates how it came up with the final output, allowing the company to "observe the model thinking in a legible way." This theoretically increases AI transparency, a measure AI watchdogs have called for in response to criticism that LLMs are impenetrable black boxes.
The caveat is that nobody outside OpenAI actually gets to see inside the model. "We have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer," OpenAI said.
"Unfortunately, OpenAI relies on security by obscurity, and history has taught us that such an approach is doomed for failure," Chandrasekaran said.
OpenAI has hired academics to lead its safety initiatives, but unless their efforts are peer-reviewed, no one can know for sure how reliable they are. "Fields like cryptography can be reliably used to in our daily lives because everything is public, and vetted by the scientific community," he said.
"We've not figured out how to make models 'safe' even when we control the training data and architecture and learning algorithms. I can offer no suggestions when all of the above are redacted," Chandrasekaran said.