A GSoC 2025 project that built an end-to-end semantic video search engine capable of finding specific moments within videos using natural language queries. The system uses a two-part architecture: an ingestion pipeline that processes videos with AI models (TransNetV2, WhisperX, BLIP, VideoMAE) to extract shots, transcripts, captions, and actions, then segments them intelligently and enriches them with LLM-generated summaries; and a search application with FastAPI backend that performs hybrid text-visual searches using ChromaDB vector database and Reciprocal Rank Fusion for result ranking, paired with a Streamlit frontend for user interaction.

7m read timeFrom news.opensuse.org
Post cover image
Table of contents
The Problem: Beyond KeywordsThe Big Picture: A Two-Act PlayPart 1: The Ingestion Pipeline - Teaching the Machine to Watch TVPart 2: The Search Application - Reaping the RewardsThe Final Result & GSoC Experience
1 Comment

Sort: