BS Audit #1 — "GPT-4.1 Crushes GPT-4o (With 1 M-Token Context!)"

May 08, 2025

∙ Paid

The Claim

"GPT-4.1 is better than GPT-4o in just about every dimension, with a gigantic one-million-token context window and big coding gains." - The Verge

Reality Check: Hype vs. Hallucination

Oh look, OpenAI dropped another model with a confusing name and tech journalists are already polishing the throne. Let's decode what's actually happening here:

What OpenAI Claims (Their Blog):

"Across-the-board improvements" over GPT-4o
1 million token context window (up from 128K in GPT-4o)
SWE-bench Verified: 54.6% (vs. GPT-4o's 33.2%)
Better at following complex instructions
Better at handling long contexts
All this at 26% lower cost than GPT-4o

Sounds incredible, right? If true, every AI leader should be yelling at their engineers to deploy it yesterday.

What's Actually Verifiable:

Missing from Public Leaderboards: Despite launching three weeks ago, GPT-4.1 is conspicuously absent from LMSYS Chatbot Arena—the gold standard for comparative LLM rankings. Either OpenAI doesn't want it compared head-to-head, or they're not confident about the results.
No System Card: For the first time in OpenAI's history, they shipped a major model update with ZERO safety documentation. When TechCrunch asked why, OpenAI's spokesperson Shaokyi Amdo responded with the corporate equivalent of ¯\_(ツ)_/¯, saying "GPT-4.1 is not a frontier model, so there won't be a separate system card released for it."
Hold up. It's their best model ever but also... not a frontier model? Schrödinger's AI much?
Cherry-picked Benchmarks: The only published scores come from OpenAI themselves. Notice what's missing? The standard benchmarks everyone else reports: MMLU, BIG-Bench, Arena-Hard, MT-Bench. You know, the stuff that would make direct comparison possible.

The Million Token Mirage

That 1M token context window? Impressive! But before you start feeding it entire codebases, here's what you need to know:

Default Setting Is Still 32K: Unless you explicitly request the full context, you're getting the same context window as before.
Latency Tax: OpenAI isn't broadcasting performance metrics at full context. Ask anyone who's tested—processing gets significantly slower as context grows.
The Kitchen Sink Problem: Just because a model can ingest 1M tokens doesn't mean it should. Context bloat leads to degraded outputs, with information getting lost or muddled.

🔒 For subscribers: The hidden cost dynamics of GPT-4.1 & what your CTO won't tell you

What follows is an insider analysis of GPT-4.1's true economics that could save your organization thousands in deployment costs while giving you leverage in vendor negotiations. Subscribe to continue reading...

Keep reading with a 7-day free trial

Subscribe to BSKiller to keep reading this post and get 7 days of free access to the full post archives.