Netflix Research has open-sourced VOID (Video Object and Interaction Deletion), a video inpainting model that removes objects from videos along with all physical interactions they induce — such as causing objects to fall when a person is removed. Built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning, VOID runs in two passes: Pass 1 for base inpainting and Pass 2 for optical flow-warped noise refinement to improve temporal consistency. The pipeline uses SAM2 for segmentation and Gemini (VLM) for reasoning about interaction-affected regions. A quadmask encoding scheme (4 semantic values: primary object, overlap, affected region, background) drives the model's understanding of physical interactions. Training data is generated via Blender/HUMOTO and Kubric pipelines. Requires 40GB+ VRAM (A100) for inference and was trained on 8× A100 80GB GPUs. Models and a Gradio demo are available on HuggingFace.
Table of contents
🤖 Models▶️ Quick Start⚙️ Setup📂 Input Format🚀 Pipeline🤩 Community Adoption🙏 AcknowledgementsStar History📄 CitationSort: