A unified, open data format and public dataset for AI evaluation results. We are collecting all evaluation results in a standardized schema to enable rigorous research and broader impact.
Addressing the fragmentation in AI evaluation to enable trust and comparability.
As highlighted in our recent paper, independent third-party evaluations are crucial. Self-reported metrics often lack the rigor and transparency needed for true accountability.
Currently, third-party evaluation reports are scattered across PDFs, blog posts, and custom tables. They are not standardized, making it nearly impossible to compare results across different evaluators or reproduce findings.
This schema offers an opportunity for evaluators: have your results advertised in a standardized format and bucketed together. This helps the community digest results and gives your evaluation broader reach and utility.
A granular, line-by-line breakdown of the standardized format.
Built for scale, reproducibility, and scientific rigor.
Each JSON file is named with a UUID (e.g., e70acf51-....json). This ensures multiple evaluations of the same model can exist without conflicts, and different timestamps are stored as separate files.
We track retrieved_timestamp to monitor model performance evolution. A model may have multiple result files representing different iterations or runs.
We explicitly distinguish between inference_platform (remote APIs like OpenAI, Anthropic) and inference_engine (local runners like vLLM, Ollama) to capture the exact evaluation environment.
The schema accommodates both numeric and level-based metrics. Level-based metrics (Low, Medium, High) are mapped to integers for consistent analysis.
The structured JSON format allows for easy SQL querying. Download the dataset and run your own analysis across timestamps and architectures.
Backed by the EvalEval Coalition, we are developing scientifically grounded research outputs and robust deployment infrastructure for broader impact.
We are collecting all evaluation results in our schema in a public dataset. This repository serves as a standardized metadata store for results from various leaderboards, research papers, and local evaluations. If you are an eval provider or leaderboard maintainer, we are looking for your generous data contibutions via pull requests!
Explore the dataset via this interactive space
Standardized documentation for AI model evaluations. While Every Eval Ever stores the results, Eval Factsheets captures the methodology, context, and alignment details of the evaluation itself.
Generate a factsheet.json that describes your evaluation's purpose, data sources, and metrics.
Ensure others can understand and replicate your evaluation setup.
{
"title": "MMLU-Pro",
"description": "Massive Multitask...",
"creator": "TIGER Lab",
"evaluation_type": "multiple_choice",
"metrics": ["accuracy"],
"links": {
"paper": "https://arxiv.org/..."
}
}
How we organize and validate the data.
Data is split by individual model. Each file is named with a UUID (e.g., e70acf51-....json) to ensure multiple evaluations of the same model can exist without conflicts.
We use a strict JSON schema to ensure data quality. Our CI/CD pipeline validates every pull request.
# Run validation locally
uv run pre-commit run --all-files
/data with a codename for your eval.factsheet.json using the Eval Factsheets tool and place it in the eval folder.developer_name/model_name.{uuid}.json.We are launching every_eval_ever as an open standard and actively soliciting feedback and contributions. Join the EvalEval coalition in defining the future of AI evaluation.