A unified, open data format and public dataset for AI evaluation results. We are collecting all evaluation results in a standardized schema to enable rigorous research and broader impact.
Developed by the EvalEval Coalition
with feedback from researchers at
Addressing the fragmentation in AI evaluation to enable trust and comparability.
Evaluation results are currently siloed by framework. This schema creates a common interchange format, allowing results from HELM, EleutherAI, Inspect, and custom scripts to co-exist and be compared directly without complex mapping.
A score without configuration is just noise. We go beyond the metric to capture the full experimental context—prompt templates, inference parameters, and system states—making every result traceable, transparent, and reproducible.
Liberating evaluation results from static PDFs and closed leaderboards. We transform scattered metrics into a structured, queryable global dataset, powering the next generation of meta-analysis and automated leaderboard construction.
A granular, line-by-line breakdown of the standardized format.
Built for scale, reproducibility, and scientific rigor.
We assign every evaluation a unique UUID. This prevents filename collisions and allows multiple runs of the same model—from different dates or configurations—to coexist safely.
Models change silently. We mandate retrieved_timestamp to capture the exact moment of inference, enabling precise studies on API drift and model versioning.
Performance depends on the runner. We explicitly separate platform (provider) from engine (inference system), isolating hardware and software variables in your analysis.
From simple accuracy to complex LLM-as-a-Judge scores, our schema standardizes all outputs. Compare results across different evaluation libraries without writing custom parsers.
Built for data science. Flat, structured JSON files mean you can ingest millions of results into Pandas or SQL in seconds, slicing by architecture, date, or task immediately.
Science requires receipts. By linking every score to its exact prompt template and generation parameters, we ensure that every result in the dataset is fully reproducible.
We are collecting all evaluation results in our schema in a public dataset. This repository serves as a standardized metadata store for results from various leaderboards, research papers, and local evaluations. If you are an eval provider or leaderboard maintainer, we are looking for your generous data contibutions via pull requests!
How we organize and validate the data.
Data is split by individual model. Each evaluation consists of an aggregate result (JSON) and a detailed results file (JSONL). Both share the same UUID to ensure clean organization.
We provide tools to easily adapt your existing workflows.
We have ready-made converters for popular frameworks:
# Run validation locally
uv run pre-commit run --all-files
/data with the name of the benchmark.developer_name/model_name.We are launching a Shared Task for practitioners to contribute public and proprietary eval data. Participate for co-authorship and join us at the ACL 2026 Workshop in San Diego.