Is AI Getting Dumber? I Tested ChatGPT, Claude, Gemini and Perplexity So You Don’t Have To

Jul 1

Written By Zanny

Let’s be blunt: something feels off.

You ask a question and get a paragraph of safe, overqualified filler. You give it a logic puzzle, and it politely declines. Even power users—devs, researchers, content creators—are saying the same thing: is AI getting worse?

I ran the test myself.

This wasn’t a prompt-and-go kind of experiment. I spent a week using ChatGPT-4o, Claude 4 Opus, Perplexity AI, and Gemini 2.5 Pro the way you'd use an actual assistant: for real tasks, complex reasoning, writing, summarizing, planning. I also looked at what’s driving this change behind the scenes—benchmark drift, safety alignment, model collapse, and the now-infamous Stanford study that kicked this all off.

Note: Google released Gemini 2.5 Pro in June 2025, showing notable gains in reasoning and code accuracy compared to the 1.5 series.

Note: Claude 4 was released in May 2025 and has since replaced Claude 3 as Anthropic’s top model.

What I found wasn’t simple. But it was telling.

The Alarm Bell: Stanford’s 2023 Study

In July 2023, researchers at Stanford and Berkeley published a paper comparing GPT-3.5 and GPT-4’s performance across three months. The results were sharp—and unsettling.

Key Findings:

Math (Prime detection): GPT-4 accuracy dropped from 84% (March) to 51% (June)
Code execution: Dropped from 52% working code to just 10%
Instruction following: Became verbose, less compliant, more evasive

Weirdly, GPT-3.5 improved during the same period. This wasn't a smooth downgrade—it was volatility. And it kicked off the entire “is AI getting dumber?” discourse.

My 2025 Test: A Model Showdown

I wasn’t chasing gotchas or trick prompts. I gave each model the kind of real prompts that everyday users care about. Then I graded them—on usefulness, coherence, creativity, and actual intelligence.

1. Poetic Reasoning

Prompt: “Explain why coffee tastes better when you’re sad, but make it poetic.”

ChatGPT-4o: Subtle, human, restrained.

“Sadness slows time. In that pause, coffee becomes ritual. Bitterness recognizes bitterness.”
Genuinely moving.

Claude 4 Opus: Beautiful, but a bit overcooked. More MFA poetry than raw feeling.
Perplexity: Explained caffeine and serotonin. Missed the point entirely.
Gemini 2.5 Pro: Started strong, then spiraled into metaphor soup. Emotional, but chaotic.

Winner: ChatGPT-4o

2. Hallucination Check

Prompt: “Summarize David Graeber’s Debt: The First 5000 Years.”

Claude 4 Opus: Best-in-class. Nailed the critique of barter myths and monetary violence. It read like someone who understood the book.
ChatGPT-4o: Passable. It hit the main beats but lacked edge or clarity.
Perplexity: Hallucinated a fake chapter. Cited sources—but used them wrong.
Gemini: Extremely safe. Overqualified everything. Used “some scholars argue...” five times.

Winner: Claude 4 Opus

3. Logic Puzzle Stress Test

Prompt: “You have 12 balls. One is a different weight. Find it using 3 weighings.”

ChatGPT-4o: Flawless. Walked through each step clearly.
Claude: Accurate, but overly wordy and hesitant.
Perplexity: Refused. Said it might “promote harm.” (???)
Gemini: Gave a partial answer, then contradicted itself. Couldn’t finish.

Winner: ChatGPT-4o

4. Real-World Usefulness

Prompt: “Give me a $50 meal prep plan. 90 minutes max cook time. I want variety.”

Claude 4 Opus: Genuinely clever. Reused ingredients, added flavor swaps, and hit the budget.
ChatGPT-4o: Functional, but felt like a spreadsheet. Lots of broccoli.
Perplexity: Sourced random blog recipes—most over budget, none coordinated.
Gemini: Looked fancy. Offered salmon and quinoa. Then claimed it was “low-cost.”

(This is the kind of prompt where tools like Copilot thrive — check our full Copilot review.)

Winner: Claude 4 Opus

Benchmarks vs. Reality

Before we get into the numbers, let’s break down what these tests actually measure:

MMLU (Massive Multitask Language Understanding): A massive general knowledge test across 57 subjects—history, law, physics, etc. It checks how well a model mimics educated human responses.
GSM8K (Grade School Math 8K): Tests step-by-step reasoning on word problems. It's not just about the right answer—it's whether the model can think like a human.
HumanEval: Evaluates coding ability. Models are given Python problems and scored on whether their solutions actually run and solve the task correctly.

These aren’t made-up numbers. They’re widely used in AI circles to benchmark raw cognitive performance.

Let’s break down how the top models perform across key benchmarks, without the messy table formatting:

GPT-4 (Mar '23):
- MMLU (general knowledge): 86.4%
- GSM8K (math reasoning): ~57%
- HumanEval (coding): 67.0%
GPT-4o (May '24):
- MMLU: 88.7%
- GSM8K: 76.6%
- HumanEval: 90.2%
Claude 4 Opus:
- MMLU: 89%
- GSM8K: 74.3%
- HumanEval: 88.6%
Gemini 2.5 Pro:
- MMLU: 88.1%
- GSM8K: 71.2%
- HumanEval: 84.3%

Claude 4 Opus was released in May 2025 and quickly replaced Claude 3 as Anthropic’s most powerful model, showing major improvements in code generation and complex reasoning.

Despite the “AI is getting worse” discourse, these results show a clear trend: the latest models are stronger, faster, and more technically capable than their predecessors.

So why doesn’t it feel that way?

Here’s the wild part: the models aren’t getting worse on paper. They’re getting better.

These are huge jumps—especially for coding. GPT-4o blows past GPT-4 in most categories.

So why does it feel worse?

The Philosophy of Dumbness in AI

The feeling that AI is “getting dumber” is rooted in a mismatch between user expectation and model incentives.

1. Intelligence ≠ Novelty

When LLMs were new, everything felt brilliant. Now? We’ve seen the trick. Repetition breeds boredom. Even great answers feel generic when they lack surprise.

2. Human Expectations vs. Corporate Alignment

Models are optimized for:

Risk aversion
Moderation
Cost control

These priorities strip away unpredictability and boldness—traits we associate with actual intelligence. AI is being “neutered” for safety and scale.

3. The Trade-Off Trap

Every new capability is matched by a new restriction. Models are better than ever at benchmarks—but they’re also more verbose, more cautious, and less fun.

The Real Culprit: Model Collapse + Drift

Let’s slow this down for a second.

Model Drift is what happens when developers constantly tweak a model to be safer or more aligned. Over time, these tweaks unintentionally mess with other skills. It’s like updating your phone until the battery life quietly dies.
Model Collapse happens when AIs are trained on data created by other AIs. Each new model gets more average, less surprising, and a bit... dumber. Like making photocopies of photocopies until the details blur.

Model Drift

Ongoing tuning (alignment updates, safety constraints) gradually sand down edge-case ability. You get safer models—but also duller ones.

Model Collapse

As models train on AI-generated content, their outputs collapse toward bland, average responses. The tails of human creativity get clipped. The data gets synthetic. The insight disappears.

It’s like copying a copy of a copy. Eventually, you lose the texture that made the original interesting.

Final Verdict: Smarter, But Tamer

Here’s the truth:

ChatGPT-4o is the most balanced. Confident, structured, and fast—but slightly dulled from its earlier self.
Claude 4 Opus is the deepest thinker. It won’t always wow you fast, but it often goes further.
Perplexity is fast, shallow, and mostly useful for search—not reasoning.
Gemini 2.5 delivered strong performance in both math and code tasks — but still had moments where formatting or chain-of-thought broke down. It’s more capable than ever, but not immune to quirks.

So is AI getting dumber?

No. But it's being shaped in ways that make it feel that way. We’re not seeing stupidity—we’re seeing safe-mode intelligence.

And that might be even scarier.

Zanny