Every Eval Ever | EvalEval Coalition

Why this Schema?

Addressing the fragmentation in AI evaluation to enable trust and comparability.

Unifying the Ecosystem

Evaluation results are currently siloed by framework. This schema creates a common interchange format, allowing results from HELM, EleutherAI, Inspect, and custom scripts to co-exist and be compared directly without complex mapping.

Trust through Provenance

A score without configuration is just noise. We go beyond the metric to capture the full experimental context—prompt templates, inference parameters, and system states—making every result traceable, transparent, and reproducible.

Actionable Meta-Science

Liberating evaluation results from static PDFs and closed leaderboards. We transform scattered metrics into a structured, queryable global dataset, powering the next generation of meta-analysis and automated leaderboard construction.

The Schema

A granular, line-by-line breakdown of the standardized format.

59ee0934-f60d-4d4b-b986-844fc51e89a3.json

{
  "schema_version": "0.2.0",
  "evaluation_id": "helm_capabilities/moonshotai.../17642...",
  "retrieved_timestamp": "1764204739.50717",
  
  "source_metadata": {
    "source_name": "HELM",
    "source_type": "evaluation_run",
    "source_organization_name": "crfm",
    "source_organization_url": "https://crfm.stanford.edu",
    "evaluator_relationship": "third_party"
  },

  "model_info": {
    "name": "Kimi K2 Instruct",
    "id": "moonshotai/kimi-k2-instruct",
    "developer": "moonshotai",
    "inference_platform": "Together AI",
    "inference_engine": {
        "name": "vLLM",
        "version": "0.6.3"
    }
  },

  "evaluation_results": [
    {
      "evaluation_name": "MMLU-Pro - COT correct",
      "evaluation_timestamp": "1764204739.50717",
      "source_data": {
        "source_type": "hf_dataset",
        "dataset_name": "MMLU-Pro",
        "hf_repo": "TIGER-Lab/MMLU-Pro",
        "hf_split": "test",
        "samples_number": 12032
      },
      "metric_config": {
        "evaluation_description": "Fraction of correct answers after chain of thought",
        "lower_is_better": false,
        "score_type": "continuous",
        "min_score": 0.0,
        "max_score": 1.0
      },
      "score_details": {
        "score": 0.819,
        "details": {
          "tab": "Accuracy"
        }
      },
      "generation_config": {
        "generation_args": {
          "temperature": 0.0,
          "top_p": 1.0,
          "top_k": -1,
          "max_tokens": 2048,
          "reasoning": true
        },
        "additional_details": {
            "description": "Chain of thought prompting used."
        }
      }
    }
  ],

  "detailed_evaluation_results": {
    "file_path": "detailed_results/e70acf51....jsonl",
    "format": "jsonl",
    "checksum": "a1b2c3d4..."
  }
}

Metadata & ID

Source Provenance

Model Specification

Flexible Metrics

Generation Config

Detailed Results

{
  "schema_version": "instance_level_eval_0.2.0",
  "evaluation_id": "helm_capabilities/moonshotai...",
  "model_id": "moonshotai/kimi-k2-instruct",
  "evaluation_name": "mmlu_pro",
  "sample_id": "test_1042",
  "interaction_type": "single_turn",

  "input": {
    "raw": "Question: Which is a scalar?...",
    "reference": "C",
    "choices": ["Velocity", "Force", "Speed"]
  },

  "output": {
    "raw": "Speed is a scalar. The answer is C.",
    "reasoning_trace": "Velocity has direction..."
  },

  "answer_attribution": [
    {
      "turn_idx": 0,
      "source": "output.raw",
      "extracted_value": "C",
      "extraction_method": "regex",
      "is_terminal": true
    }
  ],

  "evaluation": {
    "score": 1.0,
    "is_correct": true
  },
  "token_usage": {
    "input_tokens": 365,
    "output_tokens": 42,
    "total_tokens": 407
  }
}

Instance Meta

Input Data

Output

Evaluation & Usage

Design Decisions

Built for scale, reproducibility, and scientific rigor.

Conflict-Free Identity

We assign every evaluation a unique UUID. This prevents filename collisions and allows multiple runs of the same model—from different dates or configurations—to coexist safely.

Temporal Versioning

Models change silently. We mandate retrieved_timestamp to capture the exact moment of inference, enabling precise studies on API drift and model versioning.

Full Stack Provenance

Performance depends on the runner. We explicitly separate platform (provider) from engine (inference system), isolating hardware and software variables in your analysis.

Unified Metrics

From simple accuracy to complex LLM-as-a-Judge scores, our schema standardizes all outputs. Compare results across different evaluation libraries without writing custom parsers.

Analysis-First Design

Built for data science. Flat, structured JSON files mean you can ingest millions of results into Pandas or SQL in seconds, slicing by architecture, date, or task immediately.

Reproducibility Standard

Science requires receipts. By linking every score to its exact prompt template and generation parameters, we ensure that every result in the dataset is fully reproducible.

Contributor Guide

How we organize and validate the data.

Repository Structure

data/
└── {benchmark_name}/
    └── {developer_name}/
        └── {model_name}/
            ├── {uuid}.json
            └── {uuid}.jsonl

Data is split by individual model. Each evaluation consists of an aggregate result (JSON) and a detailed results file (JSONL). Both share the same UUID to ensure clean organization.

Cold Storage Option: For massive datasets, the detailed JSONL file can be stored externally (e.g., S3, Hugging Face Dataset) and referenced via URL in the aggregate JSON. This keeps the repository lightweight while preserving data access.

Data Validation & Converters

We provide tools to easily adapt your existing workflows.

🚀 Automatic Converters

We have ready-made converters for popular frameworks:

Inspect AI HELM lm-eval-harness

# Run validation locally

uv run pre-commit run --all-files

How to add data for a new eval?

Add a new folder under /data with the name of the benchmark.
Create a 2-tier folder structure: developer_name/model_name.
Add your results: one aggregate JSON file and one detailed JSONL file. (For large files, host the JSONL externally and provide the URL).
Run the validation script to check against the schema.

One schema for
Every Eval Ever.

Why this Schema?

Unifying the Ecosystem

Trust through Provenance

Actionable Meta-Science

The Schema

Metadata & ID

Source Provenance

Model Specification

Flexible Metrics

Generation Config

Detailed Results

Instance Meta

Input Data

Output

Evaluation & Usage

Design Decisions

Conflict-Free Identity

Temporal Versioning

Full Stack Provenance

Unified Metrics

Analysis-First Design

Reproducibility Standard

The Public Dataset

Contributor Guide

Repository Structure

Data Validation & Converters

🚀 Automatic Converters

How to add data for a new eval?

Join the Evaluation Revolution

One schema for Every Eval Ever.

Why this Schema?

Unifying the Ecosystem

Trust through Provenance

Actionable Meta-Science

The Schema

Metadata & ID

Source Provenance

Model Specification

Flexible Metrics

Generation Config

Detailed Results

Instance Meta

Input Data

Output

Evaluation & Usage

Design Decisions

Conflict-Free Identity

Temporal Versioning

Full Stack Provenance

Unified Metrics

Analysis-First Design

Reproducibility Standard

The Public Dataset

Contributor Guide

Repository Structure

Data Validation & Converters

🚀 Automatic Converters

How to add data for a new eval?

Join the Evaluation Revolution

One schema for
Every Eval Ever.