moonshotai/MoonViT-SO-400M · Hugging Face
MoonViT is a native-resolution vision encoder initialized from and continually pre-trained on SigLIP-SO-400M, extracted from Moonshot AI's Kimi-VL-A3B multimodal model for standalone use. It supports native image resolution processing and is available on Hugging Face with example code using the Transformers library. Full training details are available in the Kimi-VL Technical Report.