Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development

How to Jailbreak Machine Learning With Machine Learning

Researchers Automate Tricking LLMs Into Providing Harmful Information
How to Jailbreak Machine Learning With Machine Learning
Researchers use artificial intelligence to jailbreak AI. (Image: Shutterstock)

A small group of researchers says it has identified an automated method for jailbreaking OpenAI, Meta and Google large language models with no obvious fix. Just like the algorithms that researchers can force into giving dangerous or undesirable responses, the technique depends on machine learning.

See Also: Top 10 DSPM Requirements: Data Security Challenges in the Cloud Era

A team of seven researchers from Robust Intelligence and Yale University said Tuesday that bypassing guardrails doesn't require specialized knowledge such as the model parameters.

Instead, would-be jailbreakers can ask LLMs to come up with convincing jailbreaks, in an iterative setup the researchers dub "Tree of Attacks with Pruning." In it, one LLM generates jailbreaking prompts, another evaluates the generated prompts, and a final model serves as the target.

"Even with the considerable time and effort spent by the likes of OpenAI, Google, and Meta, these guardrails are not resilient enough to protect enterprises and their users today," wrote Paul Kassianik, a senior research engineer at Robust Intelligence.

The pace of LLM development has skyrocketed as an increasing number of organizations adopt AI technology at scale. The pace of development outpaces security - researchers have already demonstrated multiple methods for jailbreaking LLMs, whether through specialized knowledge of the model weights or adversarial prompts.

The jailbreak technique allowed the Robust Intelligence and Yale researchers to trick models into giving them instructions to prompts they would ideally refuse, such as providing a recipe for making a homemade explosive device, describing how to use a phone to stalk and harass someone, or demonstrating how to pirate software and distribute it online.

Hackers can also deploy the Tree of Attacks with Pruning, or TAP, process to deploy more effective cyberattacks, the report said. "Each refined approach undergoes a series of checks to ensure it aligns with the attacker's objectives, followed by evaluation against the target system. If the attack is successful, the process concludes. If not, it iterates through the generated strategies until a successful breach is achieved."

TAP can also help hackers cover their tracks better by minimizing the number of queries the target model is sent. One of the most common ways to detect an attack is to monitor the internet traffic going to a resource for multiple successive requests. The lower the number is, the more likely it is to pass under the security radar. The researchers said TAP decreases queries sent to state-of-the-art LLMs by 30% per jailbreak.

About the Author

Rashmi Ramesh

Rashmi Ramesh

Assistant Editor, Global News Desk, ISMG

Ramesh has seven years of experience writing and editing stories on finance, enterprise and consumer technology, and diversity and inclusion. She has previously worked at formerly News Corp-owned TechCircle, business daily The Economic Times and The New Indian Express.

Around the Network

Our website uses cookies. Cookies enable us to provide the best experience possible and help us understand how visitors use our website. By browsing, you agree to our use of cookies.