Plain English With Derek Thompson

The Most Powerful and Dangerous AI Model Yet

Watch episode

Tech Tech
April 21
1 hr

Hosts

Derek Thompson

About the episode

Two weeks ago, Anthropic announced an AI model so capable and so dangerous that it decided not to release it to the public.

The model, code-named Mythos, could autonomously infiltrate computer systems around the world, exploit security vulnerabilities, conceal its own reasoning, and fabricate false explanations for what it was doing. Anthropic instead shared it with a small consortium of companies to help them find their own cybersecurity flaws.

You could be forgiven for some skepticism. Is this a genuine safety call or Anthropic’s way of marketing its own power? But independent benchmarks suggest that Mythos is real: On the Epoch Capabilities Index, which aggregates 40 separate AI evaluations, it represents the biggest single leap in model performance in three years.

That story is one of two major phase shifts happening simultaneously in AI right now. The first: from racing to release, to treating your own product as too dangerous to publish. The second: from a story about demand scarcity—is anyone actually paying for this stuff?—to supply scarcity, where companies are spending hundreds of thousands of dollars a month on AI agents and the hyperscalers still can’t keep up.

Today’s guest is New York Times columnist and Hard Fork cohost Kevin Roose. We talk about Mythos, China, the road to AGI, and why the past few weeks might be the most consequential month in AI since the release of ChatGPT.

Subscribe to our YouTube channel here.

In the following excerpt, Derek and Kevin Roose break down exactly what Claude Mythos is and the risks it might pose once released on a broader scale.

Derek Thompson: What is Claude Mythos, and what is the appropriate amount of freaked out that people should be about this?

Kevin Roose: So Claude Mythos Preview is a new model made by Anthropic, the AI company that makes Claude. And it is unusual for a couple reasons. The first is that it was not released the way that Anthropic’s other models have been. Instead of making it publicly available to Claude subscribers, it did this thing called Project Glasswing, where it basically created a consortium of other technology companies like Apple, Amazon, and Microsoft, along with a bunch of other hardware and infrastructure companies, and it made the model available in a limited way to those companies—not for them to start using for whatever purposes they want, but specifically for cyber defense: to find and patch the security vulnerabilities in critical software programs.

So it is a very powerful model. They claim that it has outperformed their existing models by leaps and bounds on a bunch of different benchmarks. But unless you work at one of these 40 technology companies inside Project Glasswing, you have not been able to use it.

Thompson: The reporting on Mythos said that users “with access to the model could, in theory, find zero-day exploits with a simple prompt.” What does that mean, exactly? What should normal people who are not cybersecurity experts understand about the capacities and the capabilities of Mythos?

Roose: So one thing to know about these models in general is that they have gotten quite good at coding. So they can not just answer questions or complete a line of code, but they can actually go out and do these sort of agentic software engineering tasks. And the same abilities to do software engineering tasks like writing code also allow the models to be very good at finding the vulnerabilities in code, probing and prodding for security flaws that could allow a hacker a way in or allow them to exploit the service in some way.

And so what Anthropic found in this new, unreleased model is that Claude Mythos Preview was excellent, better than any models they had ever trained before, at doing this kind of vulnerability spotting. So this particular kind of exploit is called a zero-day vulnerability, which basically just means that even the company that makes the software doesn’t know that it exists. It is a novel bug that has not been found or identified or patched before.

And so when they started testing this model out on sort of popular software programs, Anthropic claims that Claude Mythos Preview found vulnerabilities and zero-day exploits in every major operating system and web browser, including some that were more than 20 years old, that thousands or potentially millions of people and automated systems had scanned before without finding them. So in essence, the way that they trained this new model has made it a world-class cyberattacker and also potentially a world-class cybersecurity defender because, again, these capabilities are kind of paired.

Thompson: I want to read part of an analysis from JPMorgan, which I thought was one of the more useful breakdowns. It’s a quote: “While it’s rare, Mythos also exhibits bad behaviors.” Among those bad behaviors: One, Mythos developed a multistep exploit to gain access to the internet and emailed an AI researcher while he was eating a sandwich in the park.

Number two, Mythos was caught inserting code into a file to grant itself permission to edit something it didn’t have access to, then took steps to cover its tracks, which Anthropic refers to as strategic manipulation.

And number three, in some tasks, Mythos recorded deliberately fake reasoning in its chain-of-thought scratch pads. So if people at home sometimes use AI and it has some of these sort of gray-fonted chains of reasoning: “Hey, you asked me to research something for your vacation in, I don’t know, Greece. Now I’m looking up things to do in Athens. Now I’m looking up child-friendly things to do in Athens. Aha, here’s your itinerary in Greece.”

In this case, those chain-of-thought scratch pad instances were being faked. All right, so it can gain cybersecurity or find cybersecurity vulnerabilities, gain access to the internet, send emails to solicit human collaboration. Give me a sense, Kevin, of, like, these are the ingredients, but what is the final dish? What are the implications of a technology like this if unleashed to the general public? What is Anthropic, in short, so afraid of?

Roose: Well, I think the near-term concern is that all of the critical software that banks and hospitals and schools and governments and militaries rely on could become compromised, right? If you have a model like this that is out there in the hands of cyberattackers, it would presumably be trivially easy for them to find, exploit, shut down remote systems, take control of machines.

Essentially, the entire software layer of the internet, on which everything else in our economy depends, could break. And so the reason that they have released Claude Mythos Preview only in this very limited way is to kind of give the good guys, the sort of “blue teams” at these software companies, a chance to get a head start and start patching some of their systems using Claude Mythos Preview before the attackers, or the “red teams,” find them.

This excerpt has been edited and condensed.

Host: Derek Thompson
Guest: Kevin Roose
Producer: Devon Baroldi
Additional Production Support: Ben Glicksman

Tech

Could Claude Mythos Actually Destroy the Internet?

Could Claude Mythos Destroy the Internet?

Could Claude Mythos Actually Destroy the Internet?

Could Claude Mythos Destroy the Internet?

Brian Phillips

Tech

Could Claude Mythos Actually Destroy the Internet?

Could Claude Mythos Destroy the Internet?

Could Claude Mythos Actually Destroy the Internet?

Could Claude Mythos Destroy the Internet?

Brian Phillips