GitHub - Netflix/void-model

Netflix Research has open-sourced VOID (Video Object and Interaction Deletion), a video inpainting model that removes objects from videos along with all physical interactions they induce — such as causing objects to fall when a person is removed. Built on CogVideoX and fine-tuned with interaction-aware quadmask conditioning, VOID runs in two passes: Pass 1 for base inpainting and Pass 2 for optical flow-warped noise refinement to improve temporal consistency. The pipeline uses SAM2 for segmentation and Gemini (VLM) for reasoning about interaction-affected regions. A quadmask encoding scheme (4 semantic values: primary object, overlap, affected region, background) drives the model's understanding of physical interactions. Training data is generated via Blender/HUMOTO and Kubric pipelines. Requires 40GB+ VRAM (A100) for inference and was trained on 8× A100 80GB GPUs. Models and a Gradio demo are available on HuggingFace.

#python

#genai

#computer-vision

Apr 07•12m read time•From github.com

Table of contents

🤖 Models ▶️ Quick Start ⚙️ Setup 📂 Input Format 🚀 Pipeline 🤩 Community Adoption 🙏 Acknowledgements Star History 📄 Citation

Comment

Bookmark

Copy

Sort: