unknown

StableAvatar is an end-to-end video diffusion transformer that generates infinite-length, high-quality audio-driven avatar videos without post-processing. It addresses the main limitation of existing models - audio modeling issues that cause latent distribution drift in long videos - through a Time-step-aware Audio Adapter and Audio Native Guidance Mechanism. The system supports multiple resolutions (512x512, 480x832, 832x480) and includes comprehensive training pipelines, inference optimization for various GPU configurations, and tools for audio extraction and vocal separation.

Francis-Rings/StableAvatar: We present StableAvatar, the first end-to-end video diffusion transformer, which synthesizes infinite-length high-quality audio-driven avatar videos without any post-proces