BoF: "Testing AI Agents: Live Eval Session"

Session Room

Room 5 (BoF)

Time Slot

Fri 14:50-15:30

Duration

40 min

Speaker(s)

Full name

George Kastanis

Gender Pronouns

He/Him

Company

Pointblank,

Drupal name link

https://www.drupal.org/u/zorz

LinkedIn url

https://www.linkedin.com/in/george-kastanis-2105bb33/

Session track

Coding & Site Building

Experience level

Intermediate

A prompt change that improves one answer can silently break ten others. Come run live evaluations against real Drupal AI agents, see what grader output actually tells you, and help define the eval datasets the community needs.

Prerequisite

Familiarity with the Drupal AI module. Hands-on participants should bring a laptop with DDEV and a Drupal 11 site with the AI module enabled. Observers welcome without setup.

Outline

5-minute walkthrough of ai_eval (drupal.org/project/ai_eval) running against a live agent, showing grader scores and quality gate verdicts. Then 35 minutes of collaborative work: what graders does the community need beyond the 5 that ship today? What should shared eval datasets look like and where should they live? This session complements the AI module sessions by focusing specifically on evaluation methodology. Participants can write a grader plugin or dataset question during the session. Contributions go to the ai_eval issue queue.

Learning Objectives

How to define eval datasets in YAML and run them against agents with Drush. How to interpret grader scores and set quality gates for deployment pipelines. How pluggable grader plugins work and how to write one for your domain. Where shared eval datasets fit in the Drupal AI ecosystem.

Platinum Sponsors

Gold Sponsors

Silver Sponsors

Educational Track - Drupal in a Day Sponsors

Social Night Sponsors

Coffee / Tea Sponsors

In-Kind Sponsors

Media Partner Sponsors