The AI Evaluation Chart Crisis

The AI Evaluation Chart Crisis

This past week, Anthropic and OpenAI drew attention with the release of their latest AI models, Claude Opus 4.1 and GPT-5. While they demonstrated advancements in state-of-the-art performance, the presentation choices of evaluation results from both companies sparked important discussions in the machine learning community and social media about best practices for accurate communication. In one example, a bar chart intended to show GPT-5 as better than its predecessor o3 at coding—the most common application of large language models by some metrics—depicted GPT-5’s 52.8% on SWE-bench Verified as taller than o3’s 69.1%, while another post commented on the lack of error bars in Anthropic’s Claude Opus 4.1 announcement. In particular, charts used to showcase performance demonstrated broader issues in the AI evaluation ecosystem: a lack of balance between competitive benchmarking and statistical rigor.

The High-Stakes World of AI Evaluations

The pressure to produce compelling evaluation results has never been higher. AI companies now operate in an environment where evaluations serve multiple critical functions:

  • Evaluations as Product Marketing: In a market where technical differences between frontier models are increasingly subtle, evaluation benchmarks have become the primary means for companies to differentiate their products. A few percentage points gain on a flagship benchmark can result in additional market share. Strong evaluation results often serve as a green light for model releases, with product timelines directly tied to benchmark performance.
  • Investment and Valuation: Venture capitalists and public markets increasingly rely on evaluation metrics to assess company progress, competitive positioning, and procurement decisions — due in large part to a lack of understanding of the limitations of existing automated evaluations. Charts showing evaluation results are a direct revenue factor.
  • Accelerated Release Timelines: Companies are rushing to push models to market faster than ever before. According to the Financial Times, pre-deployment evaluation periods have shortened dramatically as companies race to maintain competitive advantage, especially with increased competition from strong open models released by companies such as Meta, Mistral, DeepSeek, Alibaba, MoonshotAI, and Zhipu AI. Compressed timelines put enormous pressure on evaluation teams to produce results quickly, increasing the risk of lower quality assessments.
  • Heightened Risk Awareness: Paradoxically, while companies are moving faster, they are also more concerned about potential risks. OpenAI, Anthropic, and Google DeepMind have all crossed their internal thresholds for biological risk, triggering additional safety mitigations. At the same time, third-party evaluators are facing additional restrictions, with METR disclosing in its post on GPT-5 that, for the first time, “OpenAI’s comms and legal team required review and approval of this post” before publication. As governments implement AI regulations such as the EU AI Act, safety evaluation results become important legal defenses to justify launches of increasingly powerful AI systems.

Evaluations must simultaneously serve as marketing materials, release Key Performance Indicators (KPIs), and risk assessment tools—all while being produced (and translated into charts and figures) under intense time pressure.

Common Chart Mistakes

Even carefully planned evaluation charts can be misleading if they leave out underlying context or data. Within the AI research community, small numerical changes may fall within a statistical margin of error, yet companies face immense pressure to show meaningful progress. This leads to predictable visualization errors:

  • No Acknowledgement of Uncertainty: Without error bars or other visual cues indicating variability, charts can hide statistical uncertainty and prevent readers from assessing whether improvements are statistically significant or simply noise. For example, error bars are useful for evaluations with repeated experiments and many examples, while other approaches may be appropriate for small or non-independent datasets.
  • Truncated Axes: Compressed scales exaggerate performance increases, making incremental improvements appear like major breakthroughs which are crucial when trying to justify a new model release.
  • Selective Reporting: Cherry-picked results that support the narrative without providing counter-examples or failed experiments, often driven by the need to present a compelling product story.
  • Missing Context: Omits critical details like test set size, variance, methodology changes, or comparison conditions, making replication difficult and hiding potential confounding factors. In many cases evaluation data is not public, so charts are impossible to reproduce. Ideally, details important to interpreting results should at least be included in captions or footnotes.

Our Road Towards Better Evaluation Results

Issues with charts are just one of the many challenges that the AI evaluation ecosystem faces. At EvalEval, we recognize that fixing AI evaluation documentation means collaboratively creating processes that can serve multiple stakeholders without compromising scientific integrity.

Some of our ongoing projects to improve AI evaluation include:

  • Evaluation Cards: Standardized documentation effort that shows eval sources across different categories and how complete the different evaluations are via an easy-to-understand UX.
  • Eval Infrastructure: Sharing evaluation is key for reproducibility, clarity, and progress. We build a unified, extensible framework for sharing evaluations across diverse tools by pioneering a universal format to share eval logs.
  • Eval Science: A Best Practices Framework to help define what constitutes a scientifically valid evaluation and how to create one.

The AI community faces a choice: continue allowing business pressures to compromise evaluation integrity, or develop new standards that serve both scientific rigor and legitimate business needs. The stakes are too high: for science, public trust, and ultimately for the technology itself, to accept misleading presentations as the norm.

We invite the broader community to explore our work and join our mission to establish evaluation practices worthy of the transformative technology we’re building. The future of AI depends not just on better models, but on better ways of understanding and communicating what those models can actually do.