A prompt change that improves one answer can silently break ten others. Come run live evaluations against real Drupal AI agents, see what grader output actually tells you, and help define the eval datasets the community needs.
Familiarity with the Drupal AI module. Hands-on participants should bring a laptop with DDEV and a Drupal 11 site with the AI module enabled. Observers welcome without setup.
5-minute walkthrough of ai_eval (drupal.org/project/ai_eval) running against a live agent, showing grader scores and quality gate verdicts. Then 35 minutes of collaborative work: what graders does the community need beyond the 5 that ship today? What should shared eval datasets look like and where should they live? This session complements the AI module sessions by focusing specifically on evaluation methodology. Participants can write a grader plugin or dataset question during the session. Contributions go to the ai_eval issue queue.
How to define eval datasets in YAML and run them against agents with Drush. How to interpret grader scores and set quality gates for deployment pipelines. How pluggable grader plugins work and how to write one for your domain. Where shared eval datasets fit in the Drupal AI ecosystem.
