CISOSE 2025 Tutorial

Generative AI Evaluation Essentials

Learn how to test GenAI systems with real-world rigor and practical frameworks

Struggling to evaluate GenAI systems in the real world?

This half-day, hands-on tutorial demystifies Generative AI (GenAI) evaluation—equipping you with tools and strategies to assess AI reliability, performance, and risk across enterprise, government, and academic use cases.

Led by industry experts, this session blends theory with applied techniques you can put to use immediately. You’ll learn how to:

Build strong AI testing strategies
Identify and manage failure modes
Collect high-quality evaluation data
Monitor system performance over time

Whether you're just starting with GenAI or refining an existing approach, you'll walk away with actionable methods for implementing robust, scalable evaluation processes.The tutorial will provide recommendations for implementing robust evaluation processes for GenAI systems.

The Organizing Committee

Heather Frase, Head of Veraitech

Dr. Heather Frase leads Veraitech, Program Lead for the AI Risk and Reliability working group at MLCommons, and Senior Advisor for Testing & Evaluation of AI at Virginia Tech’s National Security Institute. Her diverse career has spanned significant roles in defense, intelligence, policy, and financial crime. She also serves on the Organisation for Economic Co-operation and Development (OECD) Network of Experts on AI and on the board of the Responsible AI Collaborative, which researches and documents AI incidents.

Sarah Luger, ML Commons

Dr. Sarah Luger has accumulated over two decades of expertise in Artificial Intelligence and Natural Language Processing, focusing on human communication challenges. Her recent work encompasses low-resource machine translation, online toxicity identification, GenAI for marketing, increasing data annotator diversity, and responsible AI. She holds a PhD in Informatics from the University of Edinburgh, specializing in automated question answering. Sarah’s background includes roles at IBM Watson, particularly in NLP tasks for the Jeopardy! Challenge, as well as leadership positions in the human computation and AI research communities. She is the co-chair of MLCommons Datasets Working Group.

Marisa Ferrara Boston, Reins AI

Dr. Marisa Ferrara Boston is an artificial intelligence professional focused on expert augmentation. She currently serves as lead scientist of Reins AI and CEO of simthetic.ai, organizations that create processes and datasets to validate enterprise AI use. She has held leadership roles at Google and KPMG, where she applied her expertise to industries spanning financial audit, healthcare, marketing, and creativity enhancement. She holds a PhD in Cognitive Science from Cornell University.

DETAILS

📍 July 23, 2025

2:00-5:30PM

(with Coffee Break from 3:30-4:00PM)

University of Arizona Student Union
Tucson, AZ

3rd Floor, South Ballroom

Connect with others working in GenAI evaluation

Join our Slack Channel to delve into cutting-edge research, communicate with the IEEE AITest tutorial community, and share evaluation resources.

Join our Slack Channel →

By joining, you are agreeing to the IEEE Event Conduct and Safety Statement.

How to Attend

Learn more & Register →

AGENDA

AI Evaluation Workshop

Opening Remarks

Brief introduction by the organizing committee
Workshop goals and audience input
Ideation around key topics

Designing Evaluation, Part 1

AI Evaluation Tools Across the Lifecycle
Speaker: Heather Frase

Designing Evaluation, Part 2

Datasets
Speaker: Sarah Luger

Questions

Coffee Break

Deploying AI Evaluations

Deploying AI Evaluations content
Speaker: Marisa Ferrara Boston

Concluding Remarks

Speakers: Heather Frase, Sarah Luger, and Marisa Ferrara Boston

Tutorial Materials

Tutorial Leaflet →

Target Audience

The tutorial is designed for AI researchers, data scientists, software engineers, and technical professionals in industry, government, and academic sectors developing, working with, or purchasing GenAI technologies. Ideal participants have basic knowledge of GenAI and its applications and an interest in evaluating, testing, and monitoring AI systems. The content will be particularly valuable for those responsible for AI system design, deployment, risk management, and quality assurance.

Expected Outcomes

Participants will develop a comprehensive knowledge of current GenAI testing and evaluation methodologies, learning how different evaluation tools interact and when to apply each appropriately. With the tutorial, participants will understand how to customize GenAI testing and implement continuous performance monitoring strategies. Participants will gain insight into the importance of sustaining evaluation tools across use cases and product evolution.

Tutorial Outline

Introduction

It will begin with key GenAI concept definitions and outline the scope of the tutorial. This will be followed by an overview of current GenAI use in consumer, enterprise, and government, along with the state of testing across these use cases.

Exploring the interplay of AI evaluation tools

In AI evaluation, an array of tools exists to evaluate systems. From metrics and benchmarks to red-teaming, these methodologies offer critical insights into the capabilities, limitations, and potential risks of GenAI systems.

Overview

Purpose of an evaluation
Different evaluation tools: metrics, benchmarks, testing, red-teaming, and auditing.

Exploring the different tools

Common tools: metrics, benchmarks, model testing, field-testing, and red-teaming.
Strengths and weaknesses
Typical use

Recommendations for using evaluation tools

Prioritization and triage
Customization

Sustaining AI Evaluation tools and practices

AI evaluation tools can be dynamic systems requiring sustainment. Every system, with or without AI, has failure modes, reliability, and quality concerns. GenAI evaluation tools demand ongoing reliability assessment and scrutiny for quality.

Key systems engineering concepts

Reliability
Failures and failure modes

GenAI-specific reliability concerns
Example: Benchmark reliability efforts

Identifying benchmark failure modes
Saturation of the BBQ benchmark
AILuminate was built with benchmark integrity and sustainment in mind.

Creating and Sustaining Evaluation Datasets

The next component of this tutorial explores how to conceptualize, build, and refine prompt datasets for evaluating GenAI systems. Challenges include building data standards for test data collection and creation across the data life-cycle:

Systematic processes and structures
Ongoing data tuning including feedback, improvements, and refinement
Anticipating data distribution imbalances
Identifying protected classes in collected data
Centering human-evaluator-based best practice
Example: Low-resource languages

Common challenges
Devising and building evaluation datasets
Present specific strategies from low-resource conditions that are extensible to AI testing

Deploying assessments for system monitoring and adaptation

Translate human objectives to GenAI key performance indicators
Collect implicit evaluation data from experts
Evaluate continuously in production environments
Triage results and create pipelines for adaptation (improvement, rare events, and evolution)

Summary and recommendations

The tutorial will conclude with a summary and recommendations for future research.

Connect with others working in GenAI evaluation

Join our Slack Channel to delve into cutting-edge research, communicate with the IEEE AITest tutorial community, and share evaluation resources.

Join our Slack Channel →

How to Attend

Learn more & Register →

{xs}

{sm}

{md}

{lg}

{xl}

{xxl}