2025 Workshop on Evaluating AI in Practice

Key Details

Date: December 8, 2025

Location: UCSD, San Diego, California, Social Sciences Public Engagement Building Rooms 721 (Overflow room 520)

Hosted by: Eval Eval, UK AISI, and UC San Diego (UCSD)

About the Workshop

We’re excited to announce the upcoming Evaluating AI in Practice workshop, happening on December 8, 2025, in San Diego.

This full-day event will explore how to evaluate AI systems responsibly and effectively by bridging three essential dimensions:

Statistical Methods – Techniques to quantify uncertainty, aggregate evaluation data, and estimate latent model capabilities while ensuring reliability and validity.

Sociotechnical Perspectives – Understanding task selection, societal impacts, and the implications of evaluation results for downstream applications.

Evaluating Evaluation Results – Translating evaluation outcomes into meaningful insights about model capabilities, risks, and potential downstream impacts.

The workshop will feature a keynote by Stella Biderman, followed by interactive sessions designed to connect technical methods with broader ethical and social considerations in AI evaluation.

Due to limited space, general attendance/registration is by RSVP only. Please note that attendance will be confirmed by the organizers based on space availability. Exact location and confirmation will be provided 2 weeks before the workshop. RSVP at: https://luma.com/ngj395u2

Further details, including the full program and additional speakers, will be announced soon.

Note: This satellite event is not officially affiliated with NeurIPS.

Call for Extended Abstracts

We are pleased to invite you to participate in the 2025 Workshop on Evaluating AI in Practice: Bridging Statistical Rigor, Sociotechnical Insights, and Ethical Boundaries, on the sidelines of NeurIPS, co-hosted by EvalEval, UKAISI, and UCSD.

Submissions should highlight ongoing or proposed research related to the workshop theme and topics. This is an incredible opportunity for researchers, especially emerging scholars, to get involved, share early-stage work, and develop new connections. Extended abstract submissions are invited from students and researchers across multiple interdisciplinary disciplines, including decision science, cognitive science, computer science, machine learning, and related fields. Topics include, but are not limited to:

Statistical Methods - Techniques to quantify uncertainty, aggregate evaluation data, and estimate latent model capabilities while ensuring reliability and validity.
Sociotechnical Perspectives - Understanding task selection, societal impacts, and the implications of evaluation results for downstream applications.
AI Redlines and Ethical Boundaries - Identifying and assessing AI system risks to prevent harmful behavior, ensure safety, and align with global accountability frameworks.
Evaluating Evaluation Results - Translating evaluation outcomes into meaningful insights about model capabilities, risks, and potential downstream impacts.
Benchmarking and standardization of evaluation protocols
Analysis of existing evaluation methods or new proposals

Abstracts must be submitted by November 20th, 2025 AOE. They should be a maximum of 500 words and include the title, authors, and affiliations. All submissions will be evaluated on the basis of their technical content and relevance to the workshop. Selected abstracts will be presented as posters during an interactive session. The primary author will receive free registration and be invited to attend the workshop in person. Submit your abstract here.

Abstract Submission Deadline: Nov 20th, 2025

Notification Date: Nov 25th, 2025

Workshop Date: Dec 8th, 2025

Location: University of California, San Diego (UCSD)

Schedule

Time	Session
9:30 AM - 9:45 AM	Welcome and Introduction
9:45 AM - 10:45 AM	Keynote: Doing Evaluations that Matter Speaker: Stella Biderman, Eleuther AI
10:45 AM - 11:00 AM	Break
11:00 - 12:15 PM	Sociotechnical Perspectives & Implications Panel (45 mins) + Breakout Activity (30 mins) This panel will discuss topics related to utilizing sociotechnical perspectives in the development of evaluations and selection of evaluation tasks. Panelists will also discuss implications of evaluation results for downstream tasks as well as societal impacts. During the breakout activity, participants will discuss methodologies and means of applying a sociotechnical perspective to evaluation development and selection as well as distilling societal impacts from evaluation results. Panelists: Arjun Subramonian, Hannah Rose Kirk, Candace Ross, Avijit Ghosh (Moderator)
12:15 PM - 1:00 PM	Lunch
1:00 PM - 2:00 PM	Poster Session
2:00PM - 2:10 PM	Break
2:10 PM - 3:30 PM	Statistical Methods Invited Talks (2 x 30 mins, 10 min Q&A each) Invited talks from experts will cover essential statistical methods for LLM evaluation, including how to quantify uncertainty in results, aggregate data from multiple evaluations for better accuracy, and identify and address sources of variability to ensure more reliable and valid evaluation assessments. We’ll explore techniques for estimating a model’s latent capabilities, understanding its true potential beyond observable performance metrics, while addressing issues related to statistical validity. Speakers: Sarah Tan, Cozmin Ududec
3:50 PM - 4:05 PM	Break
4:05 PM - 5:20 PM	Evaluating Evaluation Results Panel (45 mins) + Breakout Activity (30 mins) Evaluation results are impactful when they mean something. This includes the ability for evaluation results to map to capability or risk thresholds, examine model capabilities through evaluation, and predict or anticipate downstream impacts as a result of evaluation results. Panelists will discuss current methodologies and practices for distilling insights from evaluation results as well as issues with existing evaluations about understanding capabilities or impacts from evaluation results. During the breakout activity, groups will discuss challenges associated with distilling insights from evaluations, commonalities between existing evaluations whose results correspond to model capabilities or downstream impacts, and practices to improve relevance of evaluation results. Panelists: Anka Reuel, Aviya Skowron, Sean McGregor, Cozmin Ududec (Moderator)
5:20 - 5:35 PM	Break
5:35 - 5:50 PM	Closing Remarks