Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Flattening structured JSON data into natural language before embedding can improve vector search precision and recall by up to 20%. Generic embedding models trained on unstructured text struggle with JSON's structural syntax (quotes, colons, braces), which creates noise tokens that dilute semantic meaning during tokenization, attention calculation, and mean pooling. Converting JSON to templated natural language reduces token count by 14% and provides clearer semantic context. An experiment using the all-MiniLM-L6-v2 model and Amazon ESCI dataset with 5,000 queries demonstrates consistent retrieval performance improvements across recall@k and precision@k metrics.

Optimizing Vector Search: Why You Should Flatten Structured Data