Projects

RESEARCH, INFRASTRUCTURE & ORGANIZATION

A Standardized Format for AI Evaluation Data

Infrastructure

The AI evaluation ecosystem currently lacks standardized methods for storing, sharing, and comparing evaluation results across different models and benchmarks. This fragmentation leads to unnecessa...

Learn more →

Benchmark Saturation

Research

This project aims to investigate how to systematically characterize the complexity and behavior of AI benchmarks over time, with the overarching goal of informing more robust benchmark design. The ...

Learn more →

EvalEval Workshop 2025

Organization

Building upon the momentum of NeurIPS 2024, our primary focus is to strategically plan our engagement with a major annual academic venue, likely NeurIPS, to showcase high-quality research and techn...

Learn more →

Evaluation Cards

Research

This project addresses the need for a structured and systematic approach to documenting AI model evaluations through the creation of "evaluation cards," focusing specifically on technical base syst...

Learn more →

Evaluation Harness and Tutorials

Infrastructure

The Eleuther Harness Tutorials project is designed to lower the barrier to entry for using the LM Evaluation Harness, making it easier for researchers and practitioners to onboard, evaluate, and co...

Learn more →

Evaluation Science

Research

Recognizing the current lack of robustness in GPAI risk evaluations and the resulting limitations for informed decision-making and societal preparedness, this project aims to establish a scientific...

Learn more →

Outreach & Research Engagement

Organization

Building upon the momentum of NeurIPS 2024, we aim to cultivate a QueerInAI-esque presence at other relevant venues by organizing social events and short talks to broaden our reach and foster commu...

Learn more →