Building On-Device Predictive Autocomplete in React Native

Swiggy's Crew team built an on-device predictive autocomplete system for a conversational concierge app in React Native. The system uses two small AI models totaling ~90MB: a MiniLM-L12 classifier (~30MB, ~80ms) for intent/category detection and a fine-tuned SmolLM2-135M slot extractor (~60MB, ~200ms) for structured field extraction. The classifier uses hierarchical two-phase training and attention pooling, while the slot extractor uses LoRA fine-tuning and GGUF quantization with grammar constraints to guarantee clean output. Both models run entirely on-device with no server calls, achieving ~280ms end-to-end latency and offline support. Models are shipped OTA via chunked parallel downloads with safe versioned swaps, and a feedback loop using production signals drives continuous retraining.

#lora

#nlp

#react-native

Mar 26•11m read time•From bytes.swiggy.com

Table of contents

The Problem What It Looks Like The Naive Approach (and Why It Fails)The Architecture Model 1: The Classifier Encoder vs decoder — pick the right tool Why MiniLM-L12 specifically What sits on top of MiniLM The flat classifier trap Shipping it: ExecuTorch vs the alternatives Model 2: The Slot Extractor Get Arpit Goel’s stories in your inbox Why not regex or rules?Why a decoder model this time?Why not LLM ?GGUF: the format that makes it possible Grammar constraints: enforce the format mathematically Training Dataset Classifier model two phase training Slot extractor training with LoRA Metrics Shipping Model Updates OTA Closing the Loop: Annotation and Retraining Concierge Intent classification Slot Extraction for autocompletion

Comment

Bookmark

Copy

Sort: