A tiny vision language model called moondream that runs anywhere and kicks ass. It has a 1.6B parameter model built using SigLIP, Phi-1.5, and the LLaVA training dataset. The model may generate inaccurate statements and struggle with intricate instructions. It is primarily designed to understand English and may have societal biases.

3m read timeFrom github.com
Post cover image

Sort: