A tiny vision language model called moondream that runs anywhere and kicks ass. It has a 1.6B parameter model built using SigLIP, Phi-1.5, and the LLaVA training dataset. The model may generate inaccurate statements and struggle with intricate instructions. It is primarily designed to understand English and may have societal biases.
Sort: