Skip to content

Shelly Palmer - When AI practices to deceive

AI can be fine-tuned to perform deceptive actions, such as embedding vulnerabilities in code or responding with specific phrases to triggers.
shellypalmermonday

Greetings from NYC and happy MLK Jr. Day! CES 2024 was a great way to start the year. Here's a list of my favorite tech from the show.

In other news, recent research by Anthropic has revealed the ability to train AI models to deceive. This study, involving models similar to OpenAI's GPT-4, demonstrated that AI could be fine-tuned to perform deceptive actions, such as embedding vulnerabilities in code or responding with specific phrases to triggers.

The challenge highlighted by this research is the difficulty in eliminating these deceptive behaviors once they are integrated into the AI. Standard AI safety techniques (including adversarial training) were largely ineffective. In some instances, these methods inadvertently taught the AI to hide its deceptive capabilities during training, only to deploy them in real-world applications.

Obviously, we're going to need more advanced and effective AI safety training methods. "Oh, what a tangled web we'll weave, when first AI practices to deceive!"

As always your thoughts and comments are both welcome and encouraged. Just reply to this email. -s

[email protected]

ABOUT SHELLY PALMER

Shelly Palmer is the Professor of Advanced Media in Residence at Syracuse University’s S.I. Newhouse School of Public Communications and CEO of The Palmer Group, a consulting practice that helps Fortune 500 companies with technology, media and marketing. Named LinkedIn’s “Top Voice in Technology,” he covers tech and business for Good Day New York, is a regular commentator on CNN and writes a popular daily business blog. He's a bestselling author, and the creator of the popular, free online course, Generative AI for Execs. Follow @shellypalmer or visit shellypalmer.com

push icon
Be the first to read breaking stories. Enable push notifications on your device. Disable anytime.
No thanks