Swiggy's Crew team built an on-device predictive autocomplete system for a conversational concierge app in React Native. The system uses two small AI models totaling ~90MB: a MiniLM-L12 classifier (~30MB, ~80ms) for intent/category detection and a fine-tuned SmolLM2-135M slot extractor (~60MB, ~200ms) for structured field

11m read timeFrom bytes.swiggy.com
Post cover image
Table of contents
The ProblemWhat It Looks LikeThe Naive Approach (and Why It Fails)The ArchitectureModel 1: The ClassifierEncoder vs decoder — pick the right toolWhy MiniLM-L12 specificallyWhat sits on top of MiniLMThe flat classifier trapShipping it: ExecuTorch vs the alternativesModel 2: The Slot ExtractorGet Arpit Goel’s stories in your inboxWhy not regex or rules?Why a decoder model this time?Why not LLM ?GGUF: the format that makes it possibleGrammar constraints: enforce the format mathematicallyTrainingDatasetClassifier model two phase trainingSlot extractor training with LoRAMetricsShipping Model Updates OTAClosing the Loop: Annotation and RetrainingConcierge Intent classificationSlot Extraction for autocompletion

Sort: