A walkthrough of using MLflow GenAI evaluation to compare chunking strategies for Databricks Agent Bricks Knowledge Assistant, from building eval datasets to writing custom LLM judges.

databricks

A practical walkthrough comparing three chunking strategies for RAG-based code knowledge assistants: naive fixed-size splitting, language-aware splitting (LangChain's RecursiveCharacterTextSplitter), and AST-based chunking using Tree-sitter. Each strategy was deployed as a Databricks Knowledge Assistant over a demo codebase, then evaluated using MLflow's GenAI evaluation framework with 46 questions and three LLM judges. AST-based chunking outperformed the others on answer correctness (70% vs 59-61%), particularly for disambiguation questions, due to metadata headers prepended to each chunk. The post also covers building custom LLM judges with MLflow's make_judge() API and using MLflow traces to inspect retrieved chunks per query.

Building a Knowledge Assistant over Code

How Knowledge Assistants Works (and Why Code Is Different)