CISOSE 2025 Tutorial
This half-day, hands-on tutorial demystifies Generative AI (GenAI) evaluation—equipping you with tools and strategies to assess AI reliability, performance, and risk across enterprise, government, and academic use cases.
Led by industry experts, this session blends theory with applied techniques you can put to use immediately. You’ll learn how to:
Whether you're just starting with GenAI or refining an existing approach, you'll walk away with actionable methods for implementing robust, scalable evaluation processes.The tutorial will provide recommendations for implementing robust evaluation processes for GenAI systems.
Dr. Heather Frase leads Veraitech, Program Lead for the AI Risk and Reliability working group at MLCommons, and Senior Advisor for Testing & Evaluation of AI at Virginia Tech’s National Security Institute. Her diverse career has spanned significant roles in defense, intelligence, policy, and financial crime. She also serves on the Organisation for Economic Co-operation and Development (OECD) Network of Experts on AI and on the board of the Responsible AI Collaborative, which researches and documents AI incidents.
Dr. Sarah Luger has accumulated over two decades of expertise in Artificial Intelligence and Natural Language Processing, focusing on human communication challenges. Her recent work encompasses low-resource machine translation, online toxicity identification, GenAI for marketing, increasing data annotator diversity, and responsible AI. She holds a PhD in Informatics from the University of Edinburgh, specializing in automated question answering. Sarah’s background includes roles at IBM Watson, particularly in NLP tasks for the Jeopardy! Challenge, as well as leadership positions in the human computation and AI research communities. She is the co-chair of MLCommons Datasets Working Group.
Dr. Marisa Ferrara Boston is an artificial intelligence professional focused on expert augmentation. She currently serves as lead scientist of Reins AI and CEO of simthetic.ai, organizations that create processes and datasets to validate enterprise AI use. She has held leadership roles at Google and KPMG, where she applied her expertise to industries spanning financial audit, healthcare, marketing, and creativity enhancement. She holds a PhD in Cognitive Science from Cornell University.
The tutorial is designed for AI researchers, data scientists, software engineers, and technical professionals in industry, government, and academic sectors developing, working with, or purchasing GenAI technologies. Ideal participants have basic knowledge of GenAI and its applications and an interest in evaluating, testing, and monitoring AI systems. The content will be particularly valuable for those responsible for AI system design, deployment, risk management, and quality assurance.
Participants will develop a comprehensive knowledge of current GenAI testing and evaluation methodologies, learning how different evaluation tools interact and when to apply each appropriately. With the tutorial, participants will understand how to customize GenAI testing and implement continuous performance monitoring strategies. Participants will gain insight into the importance of sustaining evaluation tools across use cases and product evolution.
It will begin with key GenAI concept definitions and outline the scope of the tutorial. This will be followed by an overview of current GenAI use in consumer, enterprise, and government, along with the state of testing across these use cases.
In AI evaluation, an array of tools exists to evaluate systems. From metrics and benchmarks to red-teaming, these methodologies offer critical insights into the capabilities, limitations, and potential risks of GenAI systems.
AI evaluation tools can be dynamic systems requiring sustainment. Every system, with or without AI, has failure modes, reliability, and quality concerns. GenAI evaluation tools demand ongoing reliability assessment and scrutiny for quality.
The next component of this tutorial explores how to conceptualize, build, and refine prompt datasets for evaluating GenAI systems. Challenges include building data standards for test data collection and creation across the data life-cycle:
The tutorial will conclude with a summary and recommendations for future research.
Join our Slack Channel to delve into cutting-edge research, communicate with the IEEE AITest tutorial community, and share evaluation resources.
Register for the session at CISOSE 2025 to secure your spot.