Infrastructure @EvalEval

One schema for Every Eval Ever.

A unified, open data format and public dataset for AI evaluation results. We are collecting all evaluation results in a standardized schema to enable rigorous research and broader impact.

Developed by the EvalEval Coalition

with feedback from researchers at

Why this Schema?

Addressing the fragmentation in AI evaluation to enable trust and comparability.

Unifying the Ecosystem

Evaluation results are currently siloed by framework. This schema creates a common interchange format, allowing results from HELM, EleutherAI, Inspect, and custom scripts to co-exist and be compared directly without complex mapping.

Trust through Provenance

A score without configuration is just noise. We go beyond the metric to capture the full experimental context—prompt templates, inference parameters, and system states—making every result traceable, transparent, and reproducible.

Actionable Meta-Science

Liberating evaluation results from static PDFs and closed leaderboards. We transform scattered metrics into a structured, queryable global dataset, powering the next generation of meta-analysis and automated leaderboard construction.

The Schema

A granular, line-by-line breakdown of the standardized format.

59ee0934-f60d-4d4b-b986-844fc51e89a3.json
{
  "schema_version": "0.2.0",
  "evaluation_id": "helm_capabilities/moonshotai.../17642...",
  "retrieved_timestamp": "1764204739.50717",
  
  "source_metadata": {
    "source_name": "HELM",
    "source_type": "evaluation_run",
    "source_organization_name": "crfm",
    "source_organization_url": "https://crfm.stanford.edu",
    "evaluator_relationship": "third_party"
  },

  "model_info": {
    "name": "Kimi K2 Instruct",
    "id": "moonshotai/kimi-k2-instruct",
    "developer": "moonshotai",
    "inference_platform": "Together AI",
    "inference_engine": {
        "name": "vLLM",
        "version": "0.6.3"
    }
  },

  "evaluation_results": [
    {
      "evaluation_name": "MMLU-Pro - COT correct",
      "evaluation_timestamp": "1764204739.50717",
      "source_data": {
        "source_type": "hf_dataset",
        "dataset_name": "MMLU-Pro",
        "hf_repo": "TIGER-Lab/MMLU-Pro",
        "hf_split": "test",
        "samples_number": 12032
      },
      "metric_config": {
        "evaluation_description": "Fraction of correct answers after chain of thought",
        "lower_is_better": false,
        "score_type": "continuous",
        "min_score": 0.0,
        "max_score": 1.0
      },
      "score_details": {
        "score": 0.819,
        "details": {
          "tab": "Accuracy"
        }
      },
      "generation_config": {
        "generation_args": {
          "temperature": 0.0,
          "top_p": 1.0,
          "top_k": -1,
          "max_tokens": 2048,
          "reasoning": true
        },
        "additional_details": {
            "description": "Chain of thought prompting used."
        }
      }
    }
  ],

  "detailed_evaluation_results": {
    "file_path": "detailed_results/e70acf51....jsonl",
    "format": "jsonl",
    "checksum": "a1b2c3d4..."
  }
}

Metadata & ID

Source Provenance

Model Specification

Flexible Metrics

Generation Config

Detailed Results

Design Decisions

Built for scale, reproducibility, and scientific rigor.

Conflict-Free Identity

We assign every evaluation a unique UUID. This prevents filename collisions and allows multiple runs of the same model—from different dates or configurations—to coexist safely.

Temporal Versioning

Models change silently. We mandate retrieved_timestamp to capture the exact moment of inference, enabling precise studies on API drift and model versioning.

Full Stack Provenance

Performance depends on the runner. We explicitly separate platform (provider) from engine (inference system), isolating hardware and software variables in your analysis.

Unified Metrics

From simple accuracy to complex LLM-as-a-Judge scores, our schema standardizes all outputs. Compare results across different evaluation libraries without writing custom parsers.

Analysis-First Design

Built for data science. Flat, structured JSON files mean you can ingest millions of results into Pandas or SQL in seconds, slicing by architecture, date, or task immediately.

Reproducibility Standard

Science requires receipts. By linking every score to its exact prompt template and generation parameters, we ensure that every result in the dataset is fully reproducible.

The Public Dataset

We are collecting all evaluation results in our schema in a public dataset. This repository serves as a standardized metadata store for results from various leaderboards, research papers, and local evaluations. If you are an eval provider or leaderboard maintainer, we are looking for your generous data contibutions via pull requests!

Contributor Guide

How we organize and validate the data.

Repository Structure

data/
└── {benchmark_name}/
    └── {developer_name}/
        └── {model_name}/
            ├── {uuid}.json
            └── {uuid}.jsonl

Data is split by individual model. Each evaluation consists of an aggregate result (JSON) and a detailed results file (JSONL). Both share the same UUID to ensure clean organization.

Cold Storage Option: For massive datasets, the detailed JSONL file can be stored externally (e.g., S3, Hugging Face Dataset) and referenced via URL in the aggregate JSON. This keeps the repository lightweight while preserving data access.

Data Validation & Converters

We provide tools to easily adapt your existing workflows.

🚀 Automatic Converters

We have ready-made converters for popular frameworks:

Inspect AI HELM lm-eval-harness

# Run validation locally

uv run pre-commit run --all-files

How to add data for a new eval?

  1. Add a new folder under /data with the name of the benchmark.
  2. Create a 2-tier folder structure: developer_name/model_name.
  3. Add your results: one aggregate JSON file and one detailed JSONL file. (For large files, host the JSONL externally and provide the URL).
  4. Run the validation script to check against the schema.
SHARED TASK ACL 2026 WORKSHOP

Join the Evaluation Revolution

We are launching a Shared Task for practitioners to contribute public and proprietary eval data. Participate for co-authorship and join us at the ACL 2026 Workshop in San Diego.