Evaluating LLMs in Databricks with RAGAS and MLFlow
2025TL; DR
Evaluating LLMs is essential for ensuring they perform accurately and align with safety standards. This talk explores two frameworks for LLM Evaluation: RAGAS and MLFlow. We’ll explore practical applications of these frameworks, including a live demo that walks through setting up an evaluation pipeline, monitoring results, and refining metrics.
Session Details
As Large Language Models (LLMs) become increasingly embedded in real-world applications, evaluating their performance and alignment with safety standards has never been more critical. This session delves into the importance of robust evaluation strategies for LLMs and introduces two powerful frameworks: RAGAS and MLFlow.
Attendees will gain insights into how these frameworks can be leveraged to design comprehensive evaluation pipelines tailored to specific use cases. We’ll discuss their core features, strengths, and practical implementations, guiding you from setup to execution. The session includes a live demonstration showcasing the end-to-end process of configuring an evaluation pipeline, monitoring performance metrics, identifying areas for improvement, and refining the model based on evaluation outcomes.
Whether you're developing GenAI-driven applications or looking to optimize existing systems, this talk will equip you with actionable strategies to ensure your models are both effective and aligned with safety and ethical guidelines. Join us to elevate your LLM evaluation practices with hands-on tools and real-world examples.
Attendees will gain insights into how these frameworks can be leveraged to design comprehensive evaluation pipelines tailored to specific use cases. We’ll discuss their core features, strengths, and practical implementations, guiding you from setup to execution. The session includes a live demonstration showcasing the end-to-end process of configuring an evaluation pipeline, monitoring performance metrics, identifying areas for improvement, and refining the model based on evaluation outcomes.
Whether you're developing GenAI-driven applications or looking to optimize existing systems, this talk will equip you with actionable strategies to ensure your models are both effective and aligned with safety and ethical guidelines. Join us to elevate your LLM evaluation practices with hands-on tools and real-world examples.
3 things you'll get out of this session
Show attendees the importance of evaluating LLMs
Showcase the two industry standard frameworks for evaluating GenAI
Communicate the common pitfalls with evaluating GenAI models vas traditional ML models