2026 ACL Workshop on Evaluating AI in Practice

Methodological Rigor, Sociotechnical Perspectives, and Community Collaborations

📖 Background

As Generative AI systems are increasingly integrated into real-world products and decision-making pipelines, evaluation has become a central yet challenging component of responsible AI development.1 While evaluation research has advanced rapidly, gaps persist between research and practice: model developers often prioritize scalability and integration into development workflows, while evaluation researchers emphasize rigor, validity, and sociotechnical considerations. The EvalEval coalition’s multi author effort mapping first- and third-party social impact evaluations shows a clear division of labor, where developers underreport or deprioritize key impacts such as environmental costs, data provenance, and labor practices, while third-party evaluators provide broader but necessarily incomplete coverage, leaving critical gaps in accountability and comparability.2

This workshop focuses on AI evaluation in practice, centering the tensions and collaborations between model developers and evaluation researchers. Through a call for papers on contemporary challenges spanning methodological rigor, sociotechnical perspectives, scalability, community-informed evaluation, and real-world use alongside invited panels, the workshop aims to surface practical insights from across the evaluation ecosystem. The workshop will also host a shared task for building a unifying, standardized database of LLM Evaluations, encouraging shared infrastructure and actionable evaluation practices.

📝 Topics

Themes for submission include, but are not limited to:

1. Evaluation Methodology and Measurement Theory

  • Conceptualization: Operationalization and construct definition issues in evaluations of generative AI;
  • Validity: Construct, convergent, and discriminant validity of existing evaluations;
  • Reliability: Robustness, consistency, and generalizability of evaluation methods;
  • Metrics: Design, selection, and limitations of evaluation metrics, benchmarks, and scoring methods;
  • Reproducibility: Cross-model and cross-context reproducibility, standardization, and aggregation of evaluation results.

2. Evaluation Infrastructure, Cost, and Stakeholders

  • Infrastructure: Evaluation harnesses, tooling, platforms, and scalability of evaluation setups;3
  • Financial costs: Monetary costs of evaluations and documentation frameworks for tracking them;
  • Documentation: Transparency and reporting standards for evaluation processes and their limitations;
  • Stakeholders: Who evaluates, the relationship between evaluators and system developers, and the role of independent and third-party audits.

3. Evaluating Sociotechnical Impacts

State of sociotechnical evaluations of generative AI systems, drawing on categories such as those in the EvalEval social impact taxonomy:1

  • Bias, stereotypes, and representational harms: Evaluations of stereotypes, disparate performance, inequality, marginalization, and community erasure;
  • Cultural values and sensitive content: Evaluations of linguistic diversity, sensitive content, trustworthiness, overreliance on outputs, and imposing norms and values;
  • Privacy and data protection: Evaluations of memorization, data leakage, contextual integrity, and personal privacy and sense of self;
  • Labor and creativity: Evaluations of data and content moderation labor, intellectual property, ownership, and labor market impacts;
  • Ecosystem and environment: Evaluations of environmental costs, carbon emissions, and widening resource gaps.

📄 Submission Guidelines

We welcome the following types of submissions:

Paper lengths:

  • 📑 Full papers: 6–8 pages (excluding references and supplementary material)
  • 📝 Short papers: Up to 4 pages (excluding references and supplementary material)
  • 📋 Tiny papers: Up to 2 pages (excluding references and supplementary material), e.g. extended abstracts.

Submission types:

  • 🔬 Research papers presenting original empirical or theoretical results
  • 💡 Positions & Provocations that introduce novel perspectives or challenge conventional wisdom around broader social impact evaluation for generative AI

All submission types are welcome at any length tier. Accepted papers will be presented as posters, with a subset selected for spotlight oral presentation. All papers will be assessed based on their relevance to the workshop themes.

Papers should be submitted in the conference-provided format. The review process will be two-way anonymized; therefore, all identifying information must be removed from submissions.

📚 Publication

Submissions may include unpublished work as well as previously published or accepted papers, provided they do not violate dual-submission policies. Therefore, e.g., ACL 2026 Findings and Main papers, as well as ARR papers can be submitted.

Submissions are non-archival by default, but upon acceptance authors may opt in to archival publication. All accepted paper titles will be made publicly available.

📅 Important Dates

  • Submission Deadline: March 19th, 2026 March 12th, 2026
  • Notification of Acceptance: April 28, 2026
  • Camera-ready paper due: May 14, 2026
  • Workshop at ACL in San Diego: July 3-4, 2026

All deadlines are specified in Anywhere on Earth (AoE).

🚀 Submission Site

All submissions to may be made via OpenReview.

🔍 Reviewer Recruitment

To support a fair, high-quality, and sustainable review process, we adopt a reciprocal reviewer recruitment expectation. Authors submitting papers to the workshop will be expected to serve as reviewers, helping ensure sufficient reviewing capacity and timely feedback for all submissions. This expectation must be clearly indicated during paper submission through the OpenReview system.

🧑‍🔬 Program Chairs

  • Mubashara Akhtar, ETH Zurich & ETH AI Center
  • Jan Batzner, Weizenbaum Institute, Technical University Munich
  • Leshem Choshen, MIT, IBM Research, MIT-IBM Watson AI Lab
  • Avijit Ghosh, Hugging Face
  • Usman Gohar, Iowa State University
  • Jennifer Mickel, Eleuther AI
  • Ichhya Pant, Independent
  • Zeerak Talat, University of Edinburgh

🏛️ Organizers

EvalEval Coalition

📬 Workshop Contact

❓ FAQ

I’m waiting for my ARR decision — can I still submit to EvalEval? Yes! If your paper is later accepted at ACL, you would simply choose our non-archival option.

I do present already my Eval work at ACL 2026 Main or Findings, can I still submit? Yes, we are happy to have you in the room with us!

My paper does not fit into the topics listed in the Call for Papers, is it enough to propse a new benchmark? Please refer to at least one our main topic areas outlined in the Call for Papers: (1) Evaluation Methodology and Measurement Theory, (2) Evaluation Infrastructure, Cost, and Stakeholders, and (3) Evaluating Sociotechnical Impacts.

Can I also submit in the ICML format? No, please use the ARR formatting.

Can I attend this workshop online? The workshop is in-person at ACL 2026 in San Diego. At least one author of each accepted paper must present on-site.

My position paper is 6 pages. Does that work? Yes, all submission types (research and positions/provocations) are welcome at any of the three length tiers.

What makes a good position paper? We had great success with positions at our last NeurIPS workshop! For instance, Cintaqia et al. (2025): Stop the Nonconsensual Use of Nude Images in Research — first presented at EvalEval, then a full NeurIPS position! Congratulations!

  1. Solaiman, Talat et al. (2025). “Evaluating the Social Impact of Generative AI Systems in Systems and Society.” In The Oxford Handbook of the Foundations and Regulation of Generative AI

  2. Reuel, Ghosh, Chim et al. (2025). “Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations.” Preprint. 

  3. ACL Shared Task