The Science of Evaluations: Workstream Kickoff Post

Rigorous evaluations provide decision makers with detailed information about the capabilities, risks, and opportunities of AI systems. At their best, evaluations enhance understanding of the AI systems¹ and, in turn, increase the window of opportunity for societal preparedness and resilience for an increasingly AI-oriented world².

Current evaluations, however, lack robustness, reliability, and validity. Robustness refers to the stability of evaluation results against minor perturbations in input data or evaluation conditions—a critical distinction in AI contexts where small input changes can dramatically affect outputs. Reliability concerns consistent measurement across different evaluation runs. Validity spans multiple dimensions: claim validity (whether conclusions drawn from evaluation performance are justified and appropriate), construct validity (whether evaluations actually measure claimed capabilities), content validity (whether evaluations comprehensively cover all relevant aspects of capabilities being assessed), and external validity (whether results generalize beyond specific evaluation conditions to real-world applications).

Many researchers and practitioners in the evaluation ecosystem have accordingly called for more rigorous evaluations that address these measurement challenges, guided by commonly understood best practices in adjacent and analogous fields (see References). As part of the EvalEval Coalition, we are coming together as a group of researchers and practitioners to set the conceptual foundations for a scientific approach to evaluations on which subsequent technical projects can build. In this effort, we will:

Assess the current state of evaluation science. We will systematically review published research and preprints on the science of evaluations. This assessment will include scoping reviews, meta-analyses, and critiques³ of current approaches. We plan to do this assessment through a sociotechnical lens—with focus on (a) open and reproducible scientific methods to implement evaluations of AI systems, and (b) governance mechanisms for independent third-party audits, such as red teaming, certification frameworks, data access protocols, and disclosure standards.
Identify critical gaps and priorities in evaluations science. Through literature synthesis and stakeholder interviews with a diverse group of researchers, practitioners, and policymakers, we will map open research questions, missing methodological tools, under-addressed areas of capabilities and risks, and validity-related challenges in existing evaluations.
Design and execute research to tackle a prioritized subset of these gaps where our coalition has relevant expertise.

In carrying out this work we will aim to bridge the gap between rigorous best practices and pragmatic implementation details, producing impactful research and tools that improve the AI evaluation ecosystem. If you’d like to join us, join the slack community!

“AI models can be thought of as the raw, mathematical essence that is often the ‘engine’ of AI applications. An AI system is a combination of several components, including one or more AI models, that is designed to be particularly useful to humans in some way.” (Bengio et al. 2025) ↩
For example, AI evaluations form the foundation of major GPAI providers’ frontier AI safety policies, a core EU AI Act mandate for frontier model providers, and one of three key functions for the UK AI Safety Institute. ↩
See, for example, the The Leaderboard Illusion. ↩