We are a research community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.
This project aims to investigate how to systematically characterize the complexity and behavior of AI benchmarks over time, with the overarching goal of informing more robust benchmark design. The ...
Learn more →This project addresses the need for a structured and systematic approach to documenting AI model evaluations through the creation of "evaluation cards," focusing specifically on technical base syst...
Learn more →The Eleuther Harness Tutorials project is designed to lower the barrier to entry for using the LM Evaluation Harness, making it easier for researchers and practitioners to onboard, evaluate, and co...
Learn more →This past week, Anthropic and OpenAI drew attention with the release of their latest AI models, Claude Opus 4.1 and G...
Rigorous evaluations provide decision makers with detailed information about the capabilities, risks, and opportuniti...
Researchers, practitioners, and students are welcome to contribute to our mission.
Send us an email to learn more about getting involved with our community and working groups.
Email Us