Grounded SAM 2 combines Grounding DINO's language-driven object detection with SAM 2's pixel-level segmentation and video tracking capabilities. The pipeline detects objects from natural language prompts, generates precise segmentation masks, and maintains temporal consistency across video frames using a streaming-memory
Table of contents
Grounded SAM 2: From Open-Set Detection to Segmentation and TrackingWhy Segmentation Matters (Beyond Bounding Boxes)Introducing Grounded SAM 2Where SAM Fits in the PipelineWhy SAM 2 (and not SAM)How Grounded SAM 2 Works InternallyHow Grounded SAM 2 Differs from the Original Grounded SAMBenefits and Use CasesConfiguring Your Development EnvironmentSetup and ImportsDownload Model CheckpointsDetect, Segment, and Track FunctionBuilding a Gradio InterfaceOutputSummarySort: