Semantic joins with AI, For Deeper Knowledge Discovery

Proposed session for SQLBits 2026

TL; DR

Semantic joins enable deeper knowledge discovery, but traditional methods fail at scale. Explore how LLMs can be used to address this by replacing brittle string metrics and rules with adaptive semantic understanding, reducing manual effort in entity resolution and deduplication

Session Details

Deeper knowledge discovery often requires matching data without access to common join keys. This includes combining data from disparate systems such as CRM and sales databases, matching customer data across different divisions within the same company or incorporating third-party data. Similarly, identifying and merging duplicate records within datasets poses the same fundamental challenge.
This problem, commonly known as record linkage or entity resolution, remains difficult to solve at scale. Traditional approaches from basic string similarity metrics to sophisticated ontology-based methods, struggle with modern data's scale, ambiguity, and diversity. They cannot reliably distinguish among synonyms, polysemy, and abbreviations, require lots of manual maintenance, and ultimately fail to handle the full "fuzziness" of real-world data.
This talk will explore how LLMs address these core limitations by replacing syntactic analysis and brittle rules with semantic comprehension. LLMs can adapt to new semantic matching tasks without requiring costly retraining, enabling users to define complex or domain-specific join criteria through natural language and examples. This substantially reduces manual effort and paves the way for more accurate, automated, and adaptable entity resolution—whether matching across systems or deduplicating within them.

3 things you'll get out of this session

Understand the limitations of traditional record linkage approaches (string metrics, ontologies, rule-based systems, ...) Understand how LLMs enable semantic comprehension for data matching at scale Apply LLM techniques for automated, adaptable entity resolution and deduplication

Michael Victor's other proposed sessions for 2026

Building a Feature Store in Fabric to Support Model Reproducibility - 2026

Learning to Spot and Circumvent Paradoxes in Data Analysis to avoid flawed conclusions - 2026

Selecting the Right Tools for Deploying a CI/CD Workflow in Fabric - 2026

The Power of Naming: Setting up a Naming Convention for Success - 2026