Infrastructure @EvalEval

One schema for every eval ever.

A unified, open data format and public dataset for AI evaluation results. We are collecting all evaluation results in a standardized schema to enable rigorous research and broader impact.

Why this Schema?

Addressing the fragmentation in AI evaluation to enable trust and comparability.

1

The Need for 3rd Party Evals

As highlighted in our recent paper, independent third-party evaluations are crucial. Self-reported metrics often lack the rigor and transparency needed for true accountability.

2

The Standardization Gap

Currently, third-party evaluation reports are scattered across PDFs, blog posts, and custom tables. They are not standardized, making it nearly impossible to compare results across different evaluators or reproduce findings.

3

A Unified Opportunity

This schema offers an opportunity for evaluators: have your results advertised in a standardized format and bucketed together. This helps the community digest results and gives your evaluation broader reach and utility.

The Schema

A granular, line-by-line breakdown of the standardized format.

59ee0934-f60d-4d4b-b986-844fc51e89a3.json
{
  "schema_version": "0.1",
  "evaluation_id": "helm_capabilities/moonshotai.../17642...",
  "retrieved_timestamp": "1764204739.50717",
  
  "source_data": {
    "dataset_name": "MMLU-Pro",
    "hf_repo": "TIGER-Lab/MMLU-Pro",
    "hf_split": "test",
    "samples_number": 12032
  },

  "source_metadata": {
    "source_name": "HELM",
    "source_type": "evaluation_run",
    "source_organization_name": "crfm",
    "source_organization_url": "https://crfm.stanford.edu",
    "evaluator_relationship": "third_party"
  },

  "model_info": {
    "name": "Kimi K2 Instruct",
    "id": "moonshotai/kimi-k2-instruct",
    "developer": "moonshotai",
    "inference_platform": "HuggingFace",
    "inference_engine": "vLLM"
  },

  "evaluation_results": [
    {
      "evaluation_name": "MMLU-Pro - COT correct",
      "evaluation_timestamp": "1764204739.50717",
      "metric_config": {
        "evaluation_description": "Fraction of correct answers after chain of thought",
        "lower_is_better": false,
        "score_type": "continuous",
        "min_score": 0.0,
        "max_score": 1.0
      },
      "score_details": {
        "score": 0.819,
        "details": {
          "tab": "Accuracy"
        }
      },
      "detailed_evaluation_results_url": "https://huggingface.co/datasets/...",
      "generation_config": {
        "generation_args": {
          "temperature": 0.0,
          "top_p": 1.0,
          "top_k": -1,
          "max_tokens": 2048,
          "reasoning": true
        },
        "additional_details": "Chain of thought prompting used."
      }
    }
  ],

  "detailed_evaluation_results_per_samples": [
    {
      "sample_id": "test_1042",
      "input": "Question: Which of the following is a scalar quantity?",
      "ground_truth": "C",
      "response": "Scalar quantities have magnitude only. Speed is a scalar. The answer is C.",
      "choices": ["Velocity", "Force", "Speed", "Acceleration"]
    },
    {
      "sample_id": "test_1043",
      "input": "Question: The Treaty of Versailles was signed in which year?",
      "ground_truth": "C",
      "response": "The Treaty of Versailles was signed in 1919. The answer is C.",
      "choices": ["1914", "1918", "1919", "1939"]
    }
  ]
}

Metadata & ID

Source Provenance

Model Specification

Flexible Metrics

Generation Config

Sample Level Data

Design Decisions

Built for scale, reproducibility, and scientific rigor.

UUID Naming Convention

Each JSON file is named with a UUID (e.g., e70acf51-....json). This ensures multiple evaluations of the same model can exist without conflicts, and different timestamps are stored as separate files.

Timestamped Evolution

We track retrieved_timestamp to monitor model performance evolution. A model may have multiple result files representing different iterations or runs.

Platform vs. Engine

We explicitly distinguish between inference_platform (remote APIs like OpenAI, Anthropic) and inference_engine (local runners like vLLM, Ollama) to capture the exact evaluation environment.

Universal Metrics

The schema accommodates both numeric and level-based metrics. Level-based metrics (Low, Medium, High) are mapped to integers for consistent analysis.

Search & Retrieve

The structured JSON format allows for easy SQL querying. Download the dataset and run your own analysis across timestamps and architectures.

Rigorous Science

Backed by the EvalEval Coalition, we are developing scientifically grounded research outputs and robust deployment infrastructure for broader impact.

The Public Dataset

We are collecting all evaluation results in our schema in a public dataset. This repository serves as a standardized metadata store for results from various leaderboards, research papers, and local evaluations. If you are an eval provider or leaderboard maintainer, we are looking for your generous data contibutions via pull requests!

-
Benchmarks
-
Models Processed
-
Total Evals

Explore the dataset via this interactive space

New Tool

Eval Factsheets

Standardized documentation for AI model evaluations. While Every Eval Ever stores the results, Eval Factsheets captures the methodology, context, and alignment details of the evaluation itself.

Standardized Metadata

Generate a factsheet.json that describes your evaluation's purpose, data sources, and metrics.

Reproducibility

Ensure others can understand and replicate your evaluation setup.

factsheet.json
{
  "title": "MMLU-Pro",
  "description": "Massive Multitask...",
  "creator": "TIGER Lab",
  "evaluation_type": "multiple_choice",
  "metrics": ["accuracy"],
  "links": {
    "paper": "https://arxiv.org/..."
  }
}

Contributor Guide

How we organize and validate the data.

Repository Structure

data/
├── {eval_name}/
│ ├── factsheet.json
│ └── {developer_name}/
│ └── {model_name}/
│ └── {uuid}.json

Data is split by individual model. Each file is named with a UUID (e.g., e70acf51-....json) to ensure multiple evaluations of the same model can exist without conflicts.

Data Validation

We use a strict JSON schema to ensure data quality. Our CI/CD pipeline validates every pull request.

# Run validation locally

uv run pre-commit run --all-files

How to add data for a new evak?

  1. Add a new folder under /data with a codename for your eval.
  2. Generate a factsheet.json using the Eval Factsheets tool and place it in the eval folder.
  3. Create a 2-tier folder structure: developer_name/model_name.
  4. Add a JSON file with results for each model named {uuid}.json.
  5. Run the validation script to check against the schema.
COMING SOON

Open Standard & Community

We are launching every_eval_ever as an open standard and actively soliciting feedback and contributions. Join the EvalEval coalition in defining the future of AI evaluation.