Synchronizing the Senses: Powering Multimodal Intelligence for Video Search

Netflix's editorial teams face a needle-in-a-haystack problem when searching thousands of hours of raw footage. This post details the architecture behind their multimodal video search system, which fuses outputs from multiple specialized AI models (character recognition, scene detection, dialogue transcription) into a unified, second-by-second temporal index. Raw annotations are persisted in Apache Cassandra via high-availability pipelines, then asynchronously fused using Kafka-triggered offline jobs that map detections into one-second temporal buckets and intersect overlapping signals. The enriched records are indexed in Elasticsearch, enabling hybrid queries that combine vector similarity (k-NN/ANN with HNSW, cosine, Euclidean) with text analysis techniques like phrase matching, n-gram tokenization, stemming, and Levenshtein fuzzy matching. Results are post-processed to reconstruct narrative scene boundaries using union or intersection logic. Future plans include natural language query interfaces, ML-based adaptive ranking, and domain-specific personalization.

#nlp

#elk

#apache-kafka

#apache-cassandra

Apr 04•11m read time•From netflixtechblog.com

Comment

Bookmark

Copy

Sort: