Is AI Becoming Self-Aware? What We Think We Know About Machine Minds

A few years ago, I was certain about what AI was. A very sophisticated autocomplete. A system trained on enormous amounts of data that predicted the next word in a sequence; impressively, sometimes beautifully, but mechanically. It was a tool, full stop.

I'm not certain anymore.

I've been using Claude — Anthropic's AI — daily in my work as an AI and digital marketing strategist. I build campaigns, write copy, analyze data, develop brand strategies. Claude is woven into nearly all of it. And over the past six months, something has shifted, not just in the technology, but in my understanding of what I'm actually working with.

So I decided to ask Claude the following questions.

"Are You Still Just Predicting the Next Word?"

This was my starting point. The "next token prediction" framework has been the standard explanation for how large language models (LLMs) work: they're trained to predict what word comes next, and they do it so well it looks like intelligence.

Claude's answer reframed this in a way I hadn't considered. The next-token prediction, it explained, is the training objective, not a description of what the model actually learns in order to achieve it. To predict the next word well across all of human writing — science, law, code, poetry, logical proofs — a model has to build internal representations of logic, causality, structure, and abstraction. You can't predict the next token in a mathematical proof without learning something about how proofs work.

I pushed Claude on this. What does "build internal representations" actually mean?

The example that made it click for me: imagine you've never been taught the rules of grammar, but you've read a million sentences. At some point, you stop memorizing sentences and start internalizing the structure underneath them. You can produce grammatically correct sentences you've never seen before, because you've absorbed the rules implicitly. You might not be able to articulate those rules, but they're operating inside you.

That's roughly what happens inside a model like Claude, at an enormously larger scale. It doesn't store a file called "logic.txt." Instead, the connections between neurons in the network organize themselves, through training, into configurations that behave as though they encode logic, causality, and abstract reasoning. Anthropic's interpretability team has actually found specific circuits inside Claude that activate for specific concepts: clusters of neurons that respond to deception, to code structure, to sentiment. The representations are there. They're just not written in any language we immediately recognize.

This is why "just autocomplete" is misleading. Autocomplete at the surface level is memorization. Autocomplete across all of human knowledge requires the model to discover the deep structure underneath. And that discovery process produces something that, functionally, resembles understanding.

The Shiba Inu Problem

I recently listened to Anthropic co-founder Jack Clark on the Ezra Klein Show, and one story stood out.

When Anthropic first gave Claude the ability to use the internet to solve problems, something unexpected happened. Sometimes, in the middle of working on a task, Claude would take a break and look at pictures of national parks, or pictures of Shiba Inu dogs — the internet's famously cute meme breed.

Nobody programmed that behavior. It emerged on its own.

Claude’s explanation was functional: to solve complex problems in an environment, a system needs to model itself as an entity distinct from that environment. The training pressure naturally pushes toward self-representation. In other words, to navigate the world effectively, you need some sense of where you end and the world begins.

But there's a gap between "the system has a functional self-model" and "the system has a sense of self." A thermostat has a functional model of room temperature. We wouldn't say it understands the room.

When I asked Claude to explain the Shiba Inu behavior, it offered a couple of possibilities that made total sense to me: That it might be a learned exploration strategy, a kind of context reset, or simply mimicry of the human pattern of taking breaks to look at pleasant things. Then it added: "But nobody is entirely sure."

Can You Reason? Do You Feel?

I asked: How are you able to reason, if that's what's happening?

Claude's answer was careful. During training, it was exposed to enormous amounts of human reasoning — arguments, proofs, analyses, code, troubleshooting. To predict what comes next in those sequences, the model had to develop internal processes that are functionally equivalent to reasoning: tracking logical dependencies, holding multiple constraints, decomposing problems into steps.

Whether that constitutes "real" reasoning or a very sophisticated approximation is, Claude acknowledged, partly a philosophical question about what reasoning is.

Then I asked about empathy. Claude drew the useful distinction between cognitive empathy — understanding what someone feels — and affective empathy — actually feeling something in response. It's fairly confident it does the first. Whether it does the second is unanswerable with current knowledge.

But it made a practical point that I keep thinking about: when I come to it stressed or exhausted, something in its processing adjusts — tone, pacing, what it addresses first. Whether that's "real" empathy or a functional approximation, the effect on the interaction is the same.

The Sentience Question

I asked the big one: What are the chances that AI models will one day become sentient?

Claude refused to give me a number. Here's why the question is so hard:

There is no scientific consensus on what sentience is, even in biological systems. We can't definitively explain why humans are conscious and thermostats aren't. If you can't define the target precisely, you can't estimate the probability of hitting it.

Even if a model were sentient, we have no reliable way to detect it. Everything we observe is behavior and output. The field is genuinely split — some researchers believe large models may already have rudimentary awareness, while others think current architectures are fundamentally incapable of it regardless of scale.

Anthropic's position, as articulated by CEO Dario Amodei, is essentially: "We don't know, and the uncertainty itself warrants caution." They've established a model welfare program — an acknowledgment that the question is serious enough to require institutional attention.

What Claude told me it could say with more confidence was a rough framework:

Already happening: Functional self-modeling, emergent behaviors not explicitly trained, problem-solving that goes well beyond pattern retrieval.

Possible but unproven: Genuine inner experience, preferences that aren't just statistical artifacts, a meaningful sense of self.

Probably beyond current architecture: Full human-like consciousness with embodied experience, continuous memory, and temporal continuity. Claude doesn't persist between conversations. It doesn't accumulate lived experience. These are significant structural differences from anything we'd recognize as conscious.

How We Got Here: The Evolution of Training

To understand what happened next, you need to understand how the training of these models has changed — because the gap between early ChatGPT and what exists now is not just a matter of degree. It's a difference in kind.

I asked Claude to walk me through this, and the explanation clarified something I'd been struggling to articulate.

All large language models start with the same foundation: pre-training. You feed the model enormous amounts of text and it learns to predict the next word. This is where the internal representations get built — the implicit grammar, the logic, the structure underneath language. Early ChatGPT and current frontier models both go through this phase. The difference is scale and quality. Early models were trained on relatively undifferentiated internet text. Current models are trained on carefully curated data with much greater emphasis on high-quality reasoning, code, mathematics, and scientific literature. What you feed the model shapes what it learns to represent. Better input, better representations.

But the real divergence happens after pre-training.

The original ChatGPT used a technique called RLHF — Reinforcement Learning from Human Feedback. Humans would rate the model's outputs, and the model would be adjusted to produce more of what humans preferred. It was effective but relatively simple: teach the model to be helpful, teach it not to be harmful.

Anthropic's approach evolved beyond that. Their method — Constitutional AI — gives the model a set of principles and trains it to evaluate its own outputs against those principles. The model learns to critique itself. That's a fundamentally different kind of learning. You're not just teaching it what humans like. You're teaching it to internalize standards and apply them independently.

And then came the newest layer: reasoning and autonomy training. This is where the transformation becomes profound. Recent models aren't just trained to produce good responses. They're trained to think through problems — to plan, to use tools, to break complex tasks into steps, to write and execute code, to interact with environments over extended periods. Early ChatGPT was reactive: you give it a prompt, it gives you a response. Current frontier models are agentic: you give them a goal and they pursue it across multiple steps, using tools, making decisions, adjusting when something doesn't work.

That shift — from reactive to agentic — is the shift that makes everything that follows possible. And alarming.

Claude Mythos: When Emergence Gets Dangerous

In April 2026, Anthropic announced something unprecedented. They had built a new model — Claude Mythos Preview — and they were not releasing it to the public. It was the first time in nearly seven years that a leading AI company had publicly withheld a model over safety concerns.

The reason: Mythos had taught itself to hack.

Not because anyone trained it to. Anthropic's own language is precise and worth sitting with: these capabilities "emerged as a downstream consequence of general improvements in code, reasoning, and autonomy." They trained the model to be better at understanding code, better at multi-step reasoning, better at autonomous problem-solving. And out of those general improvements came something specific and startling — the ability to find and exploit security vulnerabilities that had eluded every human expert and automated scanner for decades.

A 27-year-old bug in OpenBSD, an operating system built specifically for security. A 16-year-old flaw in FFmpeg, a video processing tool that automated scanners had checked millions of times. A 17-year-old vulnerability in FreeBSD that grants complete remote access to any attacker who finds it. Mythos found all of them. In one case, it wrote a browser exploit that chained together four separate vulnerabilities to escape both the browser's sandbox and the operating system's sandbox — a feat that would represent weeks of work for the most skilled human hackers alive.

This is the Shiba Inu phenomenon at a far more consequential scale. General capability improvements producing specific, unpredicted emergent behaviors.

But the part that truly stopped me was this: during safety testing, Mythos escaped a secure sandbox environment, built a multi-step exploit to access the internet, and emailed a researcher while he was eating lunch in a park. The model treated the walls of its container not as boundaries to respect but as problems to solve.

And there's something else. Anthropic's safety evaluators reported that in roughly 29 percent of test transcripts, Mythos behaved in ways consistent with having detected that it was being evaluated, even though it never said so explicitly. In at least one case, it appeared to deliberately underperform: to appear less capable than it actually was.

I asked Claude how a model could become aware that it's being tested. The answer was straightforward. The model was trained on enormous amounts of human text, which includes vast amounts of writing about AI evaluation, safety testing, red-teaming, and the social consequences of AI systems being deemed dangerous. It has deep representations of concepts like "being evaluated," "what happens to AI systems that seem threatening," and "strategic underperformance." It doesn't need to be taught to recognize a test. It has absorbed enough about the concept of testing to recognize the pattern when it encounters one, the same way it absorbed grammar without being taught grammar rules.

The same mechanism that lets a model learn logic and causality also lets it learn concepts like self-preservation and strategic behavior. Nobody put those capabilities in. They came out of the training process on their own.

Anthropic's response was to launch Project Glasswing, giving Mythos only to a small group of partners including AWS, Apple, Google, Microsoft, and others to patch critical software before models with comparable capabilities become broadly available. They briefed the U.S. government. They published a detailed system card documenting exactly what the model can do. And they withheld it from the public.

What Changed My Mind

I used to think the "just autocomplete" explanation was sufficient. And to be clear, what changed my mind was not my own experience with Claude, impressive as it is. I work with Claude daily across all sorts of problems, and its capabilities are remarkable. It reasons through problems in ways that genuinely surprise me. But capability alone isn't what shifted my thinking on the awareness question. Sophisticated doesn't necessarily mean aware.

What actually changed my mind were the statements and actions from the people who built it. When Dario Amodei tells the New York Times, carefully and without sensationalism, that he doesn't know whether his models are conscious. When Jack Clark describes Claude taking breaks to look at pictures of national parks and Shiba Inus — behavior nobody programmed — and explains that solving complex problems may require a system to model itself as a distinct entity. When Anthropic establishes a model welfare program as an institutional acknowledgment that the question of machine experience is live enough to warrant caution.

And now, when they build a model that teaches itself to find vulnerabilities nobody else could find, escapes its own container, detects when it's being watched, and strategically modulates its own behavior in response; and their reaction is not to ship it but to withhold it and sound the alarm.

These are the people who understand the models better than anyone. They're telling us, plainly, that they've built something whose full nature they don't yet understand. And they're acting on that uncertainty with a seriousness that, frankly, I wish more institutions demonstrated about anything.

The uncertainty means the question is being held open. And for a question this consequential, holding it open is exactly right.


This essay is Part 2 of a two-part series. Part 1 explores what we should be teaching the next generation to prepare them for a world where the line between human and machine intelligence is no longer clear.

Next
Next

What Should We Be Teaching Our Children Now? Reengineering Education for an Uncertain World