Shelly Palmer - Anthropic challenges hackers to jailbreak its AI model

Shelly Palmer has been named LinkedIn’s “Top Voice in Technology,” and writes a popular daily business blog.

Jailbreaking refers to techniques used to bypass the built-in safety mechanisms of AI models.Photo by Soumil Kumar/Pexels

In AI, "jailbreaking" refers to techniques used to bypass the built-in safety mechanisms of AI models, tricking them into generating restricted or harmful content. Hackers and hobbyists have attempted to exploit AI systems using encoded messages, misleading prompts, or roleplaying scenarios to get around ethical and legal safeguards. For enterprises deploying AI, this poses serious risks—from regulatory and compliance violations to reputational damage and worse.

Anthropic, the AI research company behind the Claude family of LLMs, has launched a public test of its new Constitutional Classifier, a system designed to block jailbreaks that circumvent content restrictions. The test follows an extensive internal bug bounty program, where 183 security researchers spent more than 3,000 hours attempting to bypass the system—with limited success.

The classifier builds on Anthropic’s Constitutional AI, which uses a natural language constitution to define acceptable and unacceptable content. The system employs synthetic prompts to train against known jailbreak techniques, such as embedding harmful queries in benign-looking text, roleplay scenarios, or encoded language.

During internal testing, the classifier successfully blocked 95% of 10,000 synthetic jailbreak attempts, compared to just 14% for an unprotected Claude model. However, the system carries a 23.7% computational overhead, increasing costs and energy consumption. It also mistakenly rejects 0.38% of safe queries, a tradeoff Anthropic deems acceptable.

Despite these advancements, Anthropic acknowledges that no AI safety system is foolproof. The company expects new jailbreak methods to emerge but claims its classifier can quickly adapt to novel threats.

From now until February 10, Anthropic is inviting the public to test its defenses by attempting to bypass the class prompt Claude into generating restricted content on chemical weapons. Successful jailbreaks will be disclosed at the end of the test.

As always your thoughts and comments are both welcome and encouraged. Just reply to this email. -s

P.S. I'm proud to be partnering with the MMA to host and facilitate the CMO AI Transformation Summit (March 18, 2025 | NYC). This half-day invitation-only event is limited to select CMOs and will provide insights into the strategies, technologies, and leadership practices by peer CMOs who are driving successful AI transformations across the world’s best marketing organizations. Request your invitation.

About Shelly Palmer

Shelly Palmer is the Professor of Advanced Media in Residence at Syracuse University’s S.I. Newhouse School of Public Communications and CEO of The Palmer Group, a consulting practice that helps Fortune 500 companies with technology, media and marketing. Named LinkedIn’s “Top Voice in Technology,” he covers tech and business for Good Day New York, is a regular commentator on CNN and writes a popular daily business blog. He's a bestselling author, and the creator of the popular, free online course, Generative AI for Execs. Follow @shellypalmer or visit shellypalmer.com.

Shelly Palmer - Anthropic challenges hackers to jailbreak its AI model

Get your daily Saskatchewan news briefing

This has been shared 0 times

More Highlights

Featured Flyer