Skip to content

Shelly Palmer - Anthropic challenges hackers to jailbreak its AI model

Shelly Palmer has been named LinkedIn’s “Top Voice in Technology,” and writes a popular daily business blog.
hacker-0225
Jailbreaking refers to techniques used to bypass the built-in safety mechanisms of AI models.

In AI, "jailbreaking" refers to techniques used to bypass the built-in safety mechanisms of AI models, tricking them into generating restricted or harmful content. Hackers and hobbyists have attempted to exploit AI systems using encoded messages, misleading prompts, or roleplaying scenarios to get around ethical and legal safeguards. For enterprises deploying AI, this poses serious risks—from regulatory and compliance violations to reputational damage and worse.

Anthropic, the AI research company behind the Claude family of LLMs, has launched a public test of its new Constitutional Classifier, a system designed to block jailbreaks that circumvent content restrictions. The test follows an extensive internal bug bounty program, where 183 security researchers spent more than 3,000 hours attempting to bypass the system—with limited success.

The classifier builds on Anthropic’s Constitutional AI, which uses a natural language constitution to define acceptable and unacceptable content. The system employs synthetic prompts to train against known jailbreak techniques, such as embedding harmful queries in benign-looking text, roleplay scenarios, or encoded language.

During internal testing, the classifier successfully blocked 95% of 10,000 synthetic jailbreak attempts, compared to just 14% for an unprotected Claude model. However, the system carries a 23.7% computational overhead, increasing costs and energy consumption. It also mistakenly rejects 0.38% of safe queries, a tradeoff Anthropic deems acceptable.

Despite these advancements, Anthropic acknowledges that no AI safety system is foolproof. The company expects new jailbreak methods to emerge but claims its classifier can quickly adapt to novel threats.

From now until February 10, Anthropic is inviting the public to test its defenses by attempting to bypass the class prompt Claude into generating restricted content on chemical weapons. Successful jailbreaks will be disclosed at the end of the test.

As always your thoughts and comments are both welcome and encouraged. Just reply to this email. -s

P.S. I'm proud to be partnering with the MMA to host and facilitate the CMO AI Transformation Summit (March 18, 2025 | NYC). This half-day invitation-only event is limited to select CMOs and will provide insights into the strategies, technologies, and leadership practices by peer CMOs who are driving successful AI transformations across the world’s best marketing organizations. Request your invitation.

 

Shelly Palmer is the Professor of Advanced Media in Residence at Syracuse University’s S.I. Newhouse School of Public Communications and CEO of The Palmer Group, a consulting practice that helps Fortune 500 companies with technology, media and marketing. Named LinkedIn’s “Top Voice in Technology,” he covers tech and business for Good Day New York, is a regular commentator on CNN and writes a popular daily business blog. He's a bestselling author, and the creator of the popular, free online course, Generative AI for Execs. Follow @shellypalmer or visit shellypalmer.com

push icon
Be the first to read breaking stories. Enable push notifications on your device. Disable anytime.
No thanks