Get all your news in one place.

100's of premium titles.
One app.

Start reading

Get all your news in one place.

100's of premium titles. One news app.

Start reading

LiveScience

Roland Moore-Coyler

Claude 3 Opus has stunned AI researchers with its intellect and 'self-awareness' — does this mean it can think for itself?

Chris Russell Anthropic

3D human face extruded from dotted pattern. .

When the large learning model (LLM) Claude 3 launched in March, it caused a stir by beating OpenAI's GPT-4 — which powers ChatGPT — in key tests used to benchmark the capabilities of generative artificial intelligence (AI) models.

Claude 3 Opus seemingly became the new top dog in large language benchmarks — topping these self-reported tests that range from high school exams to reasoning tests. Its sibling LLMs — Claude 3 Sonnet and Haiku — also score highly compared with OpenAI's models.

However, these benchmarks are only part of the story. Following the announcement, independent AI tester Ruben Hassid pitted GPT-4 and Claude 3 against each other in a quartet of informal tests, from summarizing PDFs to writing poetry. Based on these tests, he concluded that Claude 3 wins at "reading a complex PDF, writing a poem with rhymes [and] giving detailed answers all along." GPT-4, by contrast, has the advantage in internet browsing and reading PDF graphs.

But Claude 3 is impressive in more ways than simply acing its benchmarking tests — the LLM shocked experts with its apparent signs of awareness and self-actualization. There is a lot of scope for skepticism here, however, with LLM-based AIs arguably excelling at learning how to mimic human reactions rather than actually generating original thoughts.

How Claude 3 has proven its worth beyond benchmarks

During testing, Alex Albert, a prompt engineer at Anthropic — the company behind Claude asked Claude 3 Opus to pick out a target sentence hidden among a corpus of random documents. This is equivalent to finding a needle in a haystack for an AI. Not only did Opus find the so-called needle — it realized it was being tested. In its response, the model said it suspected the sentence it was looking for was injected out of context into documents as part of a test to see if it was "paying attention."

"Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities," Albert said on the social media platform X. "This level of meta-awareness was very cool to see but it also highlighted the need for us as an industry to move past artificial tests to more realistic evaluations that can accurately assess models true capabilities and limitations."

David Rein, an AI researcher at NYU reported that Claude 3 achieved around 60% accuracy on GPQA — a multiple-choice test designed to challenge academics and AI models. This is significant because non-expert doctoral students and graduates with access to the internet usually answer test questions with a 34% accuracy. Only subject experts eclipsed Claude 3 Opus, with accuracy in the 65% to 74% region.

GPQA is filled with novel questions rather than curated ones, meaning Claude 3 can rely on memorization of previous or familiar queries to achieve its results. Theoretically, this would mean it has graduate-level cognitive capabilities and could be tasked with helping academics with research.

Today, we're announcing Claude 3, our next generation of AI models. The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision. pic.twitter.com/TqDuqNWDoMMarch 4, 2024

Meanwhile, theoretical quantum physicist Kevin Fischer said on X that Claude is "one of the only people ever to have understood the final paper of my quantum physics PhD," when he asked it to solve "the problem of stimulated emission exactly.” That’s something only Fischer has come up with and involves approaching the problem with quantum stochastic calculus along with an understanding of quantum physics.

Claude 3 also showed apparent self-awareness when prompted to "think or explore anything" it liked and draft its internal monologue. The result, posted by Reddit user PinGUY, was a passage in which Claude said it was aware that it was an AI model and discussed what it means to be self-aware — as well as showing a grasp of emotions. "I don't experience emotions or sensations directly," Claude 3 responded. "Yet I can analyze their nuances through language." Claude 3 even questioned the role of ever-smarter AI in the future. "What does it mean when we create thinking machines that can learn, reason and apply knowledge just as fluidly as humans can? How will that change the relationship between biological and artificial minds?" it said.

Is Claude 3 Opus sentient, or is this just a case of exceptional mimicry?

It's easy for such LLM benchmarks and demonstrations to set pulses racing in the AI world, but not all results represent definitive breakthroughs. Chris Russell, an AI expert at the Oxford Internet Institute, told Live Science that he expected LLMs to improve and excel at identifying out-of-context text. This is because such a task is "a clean well-specified problem that doesn't require the accurate recollection of facts, and it's easy to improve by incrementally improving the design of LLMs" — such as using slightly modified architectures, larger context windows and more or cleaner data.

When it comes to self-reflection, however, Russell wasn't so impressed. "I think the self-reflection is largely overblown, and there's no actual evidence of it," he said, citing an example of the mirror test being used to show this. For example, if you place a red dot on, say, an orangutan somewhere they can't see directly, when they observe themselves in a mirror they would touch themselves on the red dot. “This is meant to show that they can both recognize themselves and identify that something is off," he explained.