Grab developed a foundation model using transformer architecture to unify user understanding across its superapp ecosystem. The model processes both tabular data (user profiles, transaction history) and time-series data (clickstream interactions) through specialized adapters for different modalities (text, IDs, locations, numerical values). Using unsupervised pre-training with masked language modeling and next action prediction, it generates dual embeddings (long-term and short-term) for users, merchants, and drivers. The system employs hierarchical classification to handle massive ID vocabularies and supports both fine-tuning for specific tasks and embedding extraction for general features, currently powering ad optimization, fraud detection, and churn prediction across Grab's platform.
Table of contents
Break production less with Seer Code Review (Sponsored)Data FoundationKey Challenges in Model DesignArchitecture Overview - Transformer BackboneAdapter-based Modality HandlingUnsupervised Pre-TrainingHandling of Massive ID VocabulariesEmbedding Extraction vs Fine-TuningConclusionSPONSOR USSort: