GEN AI Robots being evaluated GEN AI Robots being evaluated

CISOSE 2025 Tutorial

Generative AI Evaluation Essentials

Learn how to test GenAI systems with real-world rigor and practical frameworks

Struggling to evaluate GenAI systems in the real world?

This half-day, hands-on tutorial demystifies Generative AI (GenAI) evaluation—equipping you with tools and strategies to assess AI reliability, performance, and risk across enterprise, government, and academic use cases.

Led by industry experts, this session blends theory with applied techniques you can put to use immediately. You’ll learn how to:

  • Build strong AI testing strategies
  • Identify and manage failure modes
  • Collect high-quality evaluation data
  • Monitor system performance over time

Whether you're just starting with GenAI or refining an existing approach, you'll walk away with actionable methods for implementing robust, scalable evaluation processes.The tutorial will provide recommendations for implementing robust evaluation processes for GenAI systems.


The Organizing Committee

Heather Frase, Head of Veraitech

Dr. Heather Frase leads Veraitech, Program Lead for the AI Risk and Reliability working group at MLCommons, and Senior Advisor for Testing & Evaluation of AI at Virginia Tech’s National Security Institute. Her diverse career has spanned significant roles in defense, intelligence, policy, and financial crime. She also serves on the Organisation for Economic Co-operation and Development (OECD) Network of Experts on AI and on the board of the Responsible AI Collaborative, which researches and documents AI incidents.

Sarah Luger, ML Commons

Dr. Sarah Luger has accumulated over two decades of expertise in Artificial Intelligence and Natural Language Processing, focusing on human communication challenges. Her recent work encompasses low-resource machine translation, online toxicity identification, GenAI for marketing, increasing data annotator diversity, and responsible AI. She holds a PhD in Informatics from the University of Edinburgh, specializing in automated question answering. Sarah’s background includes roles at IBM Watson, particularly in NLP tasks for the Jeopardy! Challenge, as well as leadership positions in the human computation and AI research communities. She is the co-chair of MLCommons Datasets Working Group.

Marisa Ferrara Boston, Reins AI

Dr. Marisa Ferrara Boston is an artificial intelligence professional focused on expert augmentation. She currently serves as lead scientist of Reins AI and CEO of simthetic.ai, organizations that create processes and datasets to validate enterprise AI use. She has held leadership roles at Google and KPMG, where she applied her expertise to industries spanning financial audit, healthcare, marketing, and creativity enhancement. She holds a PhD in Cognitive Science from Cornell University.


Target Audience

The tutorial is designed for AI researchers, data scientists, software engineers, and technical professionals in industry, government, and academic sectors developing, working with, or purchasing GenAI technologies. Ideal participants have basic knowledge of GenAI and its applications and an interest in evaluating, testing, and monitoring AI systems. The content will be particularly valuable for those responsible for AI system design, deployment, risk management, and quality assurance.


Expected Outcomes

Participants will develop a comprehensive knowledge of current GenAI testing and evaluation methodologies, learning how different evaluation tools interact and when to apply each appropriately. With the tutorial, participants will understand how to customize GenAI testing and implement continuous performance monitoring strategies. Participants will gain insight into the importance of sustaining evaluation tools across use cases and product evolution.


Tutorial Outline
  1. Introduction
  2. It will begin with key GenAI concept definitions and outline the scope of the tutorial. This will be followed by an overview of current GenAI use in consumer, enterprise, and government, along with the state of testing across these use cases.

  3. Exploring the interplay of AI evaluation tools
  4. In AI evaluation, an array of tools exists to evaluate systems. From metrics and benchmarks to red-teaming, these methodologies offer critical insights into the capabilities, limitations, and potential risks of GenAI systems.

    • Overview
      • Purpose of an evaluation
      • Different evaluation tools: metrics, benchmarks, testing, red-teaming, and auditing.
    • Exploring the different tools
      • Common tools: metrics, benchmarks, model testing, field-testing, and red-teaming.
      • Strengths and weaknesses
      • Typical use
    • Recommendations for using evaluation tools
      • Prioritization and triage
      • Customization
  5. Sustaining AI Evaluation tools and practices
  6. AI evaluation tools can be dynamic systems requiring sustainment. Every system, with or without AI, has failure modes, reliability, and quality concerns. GenAI evaluation tools demand ongoing reliability assessment and scrutiny for quality.

    • Key systems engineering concepts
      • Reliability
      • Failures and failure modes
    • GenAI-specific reliability concerns
    • Example: Benchmark reliability efforts
      • Identifying benchmark failure modes
      • Saturation of the BBQ benchmark
      • AILuminate was built with benchmark integrity and sustainment in mind.
  7. Creating and Sustaining Evaluation Datasets
  8. The next component of this tutorial explores how to conceptualize, build, and refine prompt datasets for evaluating GenAI systems. Challenges include building data standards for test data collection and creation across the data life-cycle:

    • Systematic processes and structures
    • Ongoing data tuning including feedback, improvements, and refinement
    • Anticipating data distribution imbalances
    • Identifying protected classes in collected data
    • Centering human-evaluator-based best practice
    • Example: Low-resource languages
      • Common challenges
      • Devising and building evaluation datasets
      • Present specific strategies from low-resource conditions that are extensible to AI testing
  9. Deploying assessments for system monitoring and adaptation
  10. The final part of the tutorial focuses on applying the practices from the previous sections to real-world AI evaluation using a complex systems approach. The practicum will walk participants through a real-world agentic GenAI deployment, demonstrating how to:
    • Translate human objectives to GenAI key performance indicators
    • Collect implicit evaluation data from experts
    • Evaluate continuously in production environments
    • Triage results and create pipelines for adaptation (improvement, rare events, and evolution)
  11. Summary and recommendations
  12. The tutorial will conclude with a summary and recommendations for future research.


Connect with others working in GenAI evaluation

Join our Slack Channel to delve into cutting-edge research, communicate with the IEEE AITest tutorial community, and share evaluation resources.

Join our Slack Channel →


How to Attend

Register for the session at CISOSE 2025 to secure your spot.

Learn more & Register →

{xs}
{sm}
{md}
{lg}
{xl}
{xxl}