Evaluating LLMs in Databricks with RAGAS and MLFlow
2025TL; DR
Evaluating LLMs is essential for ensuring they perform accurately and align with safety standards. This talk explores two frameworks for LLM Evaluation: RAGAS and MLFlow. We’ll explore practical applications of these frameworks, including a live demo that walks through setting up an evaluation pipeline, monitoring results, and refining metrics.
Session Details
As Large Language Models (LLMs) become increasingly embedded in real-world applications, evaluating their performance and alignment with safety standards has never been more critical. This session delves into the importance of robust evaluation strategies for LLMs and introduces two powerful frameworks: RAGAS and MLFlow.
Attendees will gain insights into how these frameworks can be leveraged to design comprehensive evaluation pipelines tailored to specific use cases. We’ll discuss their core features, strengths, and practical implementations, guiding you from setup to execution. The session includes a live demonstration showcasing the end-to-end process of configuring an evaluation pipeline, monitoring performance metrics, identifying areas for improvement, and refining the model based on evaluation outcomes.
Whether you're developing GenAI-driven applications or looking to optimize existing systems, this talk will equip you with actionable strategies to ensure your models are both effective and aligned with safety and ethical guidelines. Join us to elevate your LLM evaluation practices with hands-on tools and real-world examples.
Attendees will gain insights into how these frameworks can be leveraged to design comprehensive evaluation pipelines tailored to specific use cases. We’ll discuss their core features, strengths, and practical implementations, guiding you from setup to execution. The session includes a live demonstration showcasing the end-to-end process of configuring an evaluation pipeline, monitoring performance metrics, identifying areas for improvement, and refining the model based on evaluation outcomes.
Whether you're developing GenAI-driven applications or looking to optimize existing systems, this talk will equip you with actionable strategies to ensure your models are both effective and aligned with safety and ethical guidelines. Join us to elevate your LLM evaluation practices with hands-on tools and real-world examples.
3 things you'll get out of this session
Show attendees the importance of evaluating LLMs
Showcase the two industry standard frameworks for evaluating GenAI
Communicate the common pitfalls with evaluating GenAI models vas traditional ML models
Speakers
Tori Tompkins's other proposed sessions for 2026
Unlocking the Potential of Retrieval-Augmented Generation (RAG) with Advanced Patterns - 2026
You Think Your MLOps Can Scale GenAI? Think Again. - 2026
AgentBricks vs Mosaic AI - 2026
No Code Agents with AgentBricks - 2026
Real-Time AI with Databricks Online Feature Stores: Powered by Lakebase - 2026