MiniCPM-V 4.6 is a 1.3B parameter vision-language model from OpenBMB that combines a SigLIP vision encoder with a Qwen language model backbone. It targets local agent workflows where a lightweight vision model is needed without the VRAM cost of larger multimodal models. Key highlights include token efficiency 19-43x better than comparable Qwen models, switchable 4x/16x visual token compression at inference time, support for images, multi-image, and video inputs, a 262K context window, Apache 2.0 license, and deployment support via llama.cpp, vLLM, SGLang, and quantized GGUF formats. On benchmarks like MMU Pro it outperforms all sub-2B open-weights models. The post includes a hands-on demo covering visual QA, OCR, invoice parsing, handwritten prescription reading, and video understanding, with comparisons between thinking and non-thinking modes.
Sort: