GitHub developed their own search engine called Project Blackbird in Rust to meet their requirements at scale. They use indices to store information about code, including programming languages and n-grams. Content addressable storage is used to efficiently store duplicate data, and hash-based sharding is used to distribute data across shards.

8m read timeFrom scaleyourapp.com
Post cover image
Table of contents
Indexing CodeN-gramsContent Addressable StorageCode StorageIngesting & Indexing CodeOptimizing Ingest Order & Making the Most of Delta EncodingSystem Design Learnings In this Case StudyStoring duplicate data efficiently with content addressable storageHash-based sharding to distribute data across shardsTrees for storing hierarchical dataObject store fits best for storing unstructured dataUsing an event queue/message broker to decouple system modules

Sort: