PyConMY 2025

PyConMY 2025

How I Use Evals to Keep My AI Apps From Falling Apart
2025-11-02 , Hall 1

As Large Language Model apps become more powerful and widely used, one challenge keeps surfacing: how do we know if they’re actually working well? LLM apps often fail silently with hallucinations, bugs, and inconsistent outputs going unnoticed. Manual reviews don’t scale. This talk introduces a practical 3-part evaluation framework using code checks, golden sets, and LLMs as judges to catch failures early, improve output quality, and help you build more reliable AI applications.


Most of us who build LLM-powered apps chatbots, summarizers, RAG pipelines know the pattern: you test a handful of prompts, everything looks fine, you ship… and then real users uncover off-topic answers, hallucinations, or odd formatting. Worse, there’s no reliable way to spot these issues early or track quality over time.

That’s where evaluations, or “evals,” come in. This talk presents a practical, iterative framework inspired by OpenAI’s internal approach and guided by one mantra, stop guessing, start testing. We’ll cover three layers of evals in practical demo:

  1. Code-based tests to catch structural mistakes wrong length, missing fields, bad JSON.

  2. Golden sets of hand-checked responses to anchor accuracy and tone.

  3. LLM-as-judge scoring that lets a larger model grade nuanced traits like fluency, helpfulness, and style.

A real-world case study will show how a six-step workflow identify failures, design evals, build datasets, generate, score, refine tightened prompts and boosted quality. We’ll also explore building a trustworthy LLM judge through metaprompting and share tips for wiring these evals into CI pipelines. You’ll leave with a copy-pasteable playbook for testing your own LLM apps no massive datasets or deep ML expertise required, just a thoughtful setup , Python code and a bit of iteration

Kalyan is a Lead Data and AI scientist with a background as a former data science and analytics manager, effectively balancing both academia and industry. He has presented talks at various PyCon's, Data Science and AI conferences, showcasing his expertise. As a community leader, Kalyan currently serves as the one of the Chair for PyConf Hyderabad 2025 and has held the role of Co-chair for PyCon India 2023 and PyConf Hyderabad from 2022 to 2024. In addition to these leadership positions, he is an active contributor to numerous Python, data science, and scientific communities worldwide.