OpenAI Just Released GPT-5.5, and the Benchmark Picture Is More Complicated Than They're Letting On

In partnership with

Hey there! 👋

Welcome back to SavvyMonk, your one-stop for AI and tech news that actually matters.

OpenAI just dropped GPT-5.5, codenamed "Spud," barely six weeks after GPT-5.4. The company says it's their smartest model yet and it tops several benchmarks to back that up. But independent testers found a catch worth knowing about before you swap anything out.

Let's get into it.

Become An AI Expert In Just 5 Minutes

If you’re a decision maker at your company, you need to be on the bleeding edge of, well, everything. But before you go signing up for seminars, conferences, lunch ‘n learns, and all that jazz, just know there’s a far better (and simpler) way: Subscribing to The Deep View.

This daily newsletter condenses everything you need to know about the latest and greatest AI developments into a 5-minute read. Squeeze it into your morning coffee break and before you know it, you’ll be an expert too.

Subscribe right here. It’s totally free, wildly informative, and trusted by 600,000+ readers at Google, Meta, Microsoft, and beyond.

TODAY'S DEEP DIVE

GPT-5.5 Tops Coding Benchmarks, but Independent Tests Tell a Different Story

OpenAI released GPT-5.5 on April 23, calling it "a new class of intelligence." The model is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex.

A higher-tier version called GPT-5.5 Pro is available to Pro, Business, and Enterprise users only.

API access went live on April 24 at $5 per million input tokens and $30 per million output tokens, exactly double what GPT-5.4 costs.

The Benchmarks That Matter

OpenAI's own data shows GPT-5.5 leading on several important tests. On Terminal-Bench 2.0, which measures how well a model handles complex command-line coding workflows, GPT-5.5 scored 82.7%, a 13-point gap over Anthropic's Claude Opus 4.7 at 69.4%, and the first time OpenAI's mainline model has convincingly beaten Anthropic on agentic coding.

GPT-5.5 tops the Terminal-Bench 2.0 benchmark

On OSWorld-Verified, which tests whether a model can operate real desktop environments on its own, GPT-5.5 hit 78.7%, narrowly edging Opus 4.7's 78.0%. And on FrontierMath Tier 4, the hardest math tier, it scored 35.4% versus Opus 4.7's 22.9%.

But the picture isn't one-sided. Opus 4.7 still leads on SWE-Bench Pro by nearly 6 points (64.3% vs 58.6%), which is the closest benchmark to testing whether a model can actually resolve a real GitHub issue end-to-end.

Google's Gemini 3.1 Pro edges GPT-5.5 on BrowseComp, a web research benchmark, at 85.9% versus 84.4%. And Anthropic's gated Mythos Preview model, which isn't publicly available, still leads on 6 out of 9 directly comparable tests.

The Hallucination Problem

Artificial Analysis, an independent AI benchmarking firm, tested GPT-5.5 on their AA-Omniscience evaluation and found a notable tension in the results. The model hit the highest accuracy score ever recorded at 57%, which sounds great, but it also hit the highest hallucination rate ever recorded at 86%.

That means GPT-5.5 knows more than any model they've tested and will also confidently answer questions it doesn't actually know the answer to at nearly 2.5 times the rate of Opus 4.7, which sits at 36%. OpenAI's own internal benchmarks claim a 60% reduction in hallucinations compared to GPT-5.4, so there's a real gap between what OpenAI measured and what independent testers found.

For anyone working in legal, medical, or financial contexts where accuracy is non-negotiable, that's a gap worth sitting with before committing to this model in production.

Why It's Cheaper Than It Looks

The per-token API price doubled, but OpenAI says GPT-5.5 uses about 40% fewer output tokens to complete the same tasks, which puts the actual cost increase closer to 20% for most workflows rather than 100%.

Comparison of tokens usage tested by a researcher working in Notion

OpenAI President Greg Brockman described the model as "a faster, sharper thinker for fewer tokens," and the efficiency story holds up in early testing from teams that got early access.

The model also matches GPT-5.4's speed despite being more capable, something OpenAI achieved through hardware co-design with Nvidia's GB200 and GB300 NVL72 systems. In a notable detail from the release, GPT-5.5 and Codex actually helped optimize the infrastructure that serves them, writing custom load-balancing algorithms that boosted token generation speeds by over 20%.

The Bigger Picture

This release came just one week after Anthropic launched Opus 4.7 and during ongoing fallout from Anthropic's Mythos model, which was partially breached after being restricted due to its advanced cybersecurity capabilities.

The AI model race is now running at a pace where six-week release cycles look normal. OpenAI claims 900 million weekly active users for ChatGPT and over 50 million paying subscribers, numbers they're leaning on to push back against a narrative that they've been losing ground to Anthropic in the enterprise market.

Alongside GPT-5.5, OpenAI also announced workspace agents for ChatGPT Business and Enterprise users. These are AI agents that can take action across connected workplace tools like Slack, Google Drive, Jira, and more, running on schedules and handling multi-step workflows without human input at each step. It's a separate product feature from GPT-5.5 itself, but the two announcements together paint a clear picture of where OpenAI is trying to take ChatGPT.

The Bottom Line

GPT-5.5 is a genuine step forward in coding and agentic work, and the benchmark numbers on Terminal-Bench and OSWorld are real gains. But the hallucination numbers from independent testing are a red flag that OpenAI's own benchmarks don't fully explain.

If you're building anything where being wrong carries real cost, test this model thoroughly on your specific workloads before replacing what you're running today.

AI PROMPT OF THE DAY

Category: Code Review

"Review the following code for potential bugs, security vulnerabilities, and performance issues. For each issue found, explain why it's a problem and provide a corrected version. Prioritize issues by severity (critical, high, medium, low). Code: [paste your code here]"

ONE LAST THING

Six weeks between releases is the new normal in frontier AI, and every drop comes with benchmark tables that make the new model look like a clear winner. The more useful habit is to find the independent tests and look for where the numbers disagree. In GPT-5.5's case, the gap between OpenAI's hallucination numbers and what Artificial Analysis found is exactly that kind of signal worth paying attention to.

Hit reply, I read every response.

See you in the next one.

— Vivek

P.S. Know a developer or tech leader trying to keep up with the AI model race? Forward this their way. They can subscribe at https://savvymonk.beehiiv.com/