Best of Computer Vision — October 2025

1
Article
Hacker News·31w
character-ai/Ovi
Ovi is an open-source audio-video generation model that simultaneously creates synchronized 5-second videos and audio from text or text+image inputs. The 11B parameter model supports flexible resolutions (720×720 to 960×960), multiple aspect ratios, and includes a custom-trained 5B audio branch. It offers inference options for single or multi-GPU setups, includes memory optimization features like fp8 quantization and CPU offloading for 24GB GPUs, and provides integration with Gradio UI and ComfyUI. The model is based on research from Character AI and builds upon Wan2.2 for video and MMAudio for audio processing.
59
2
2
Article
Hacker News·30w
apple/pico-banana-400k
Apple released Pico-Banana-400K, a dataset containing approximately 400,000 text-image-edit triplets for training text-guided image editing models. The dataset includes 257K single-turn examples, 56K preference learning samples, and 72K multi-turn conversations, covering 35 edit operations across 8 semantic categories. Built using Gemini-2.5-Flash for instruction generation and the Nano-Banana model for editing, each edit undergoes automated quality evaluation. Source images come from Open Images, with edits spanning object manipulation, scene composition, stylistic changes, and photometric adjustments. The dataset is available under CC BY-NC-ND 4.0 license for non-commercial research use.
57
1
3
Article
openSUSE·33w
GSoC 2025, Building a Semantic Search Engine for Any Video
A GSoC 2025 project that built an end-to-end semantic video search engine capable of finding specific moments within videos using natural language queries. The system uses a two-part architecture: an ingestion pipeline that processes videos with AI models (TransNetV2, WhisperX, BLIP, VideoMAE) to extract shots, transcripts, captions, and actions, then segments them intelligently and enriches them with LLM-generated summaries; and a search application with FastAPI backend that performs hybrid text-visual searches using ChromaDB vector database and Reciprocal Rank Fusion for result ranking, paired with a Streamlit frontend for user interaction.
40
1
4
Article
80 LEVEL·34w
Ex Meta Engineer Develops Tech That Makes Any Screen 3D Using a Camera
A former Meta engineer developed True3D Labs technology that creates 3D viewing experiences on any screen using only a front-facing camera. The system tracks head position in real-time to reproject scenes with motion parallax, eliminating the need for special glasses or hardware. It uses facial landmark detection and six-degree-of-freedom head pose estimation to treat the screen as a window into a 3D world. The platform supports volumetric video, voxels, and Gaussian splats, with APIs available for web developers to integrate into applications, game captures, and real-time renders from engines like Unity and Blender.
31
9
5
Video
Fireship·33w
Alibaba is going all in on Qwen…
Alibaba announced a $52 billion three-phase roadmap to artificial superintelligence at their Apsara conference, targeting completion by 2032. Key releases include Qwen 3 Max, a trillion-parameter model trained on 36 trillion tokens using mixture-of-experts architecture; Qwen 3VL, an open-source vision-language model that tops the Clockbench benchmark; and Qwen 3 Omni, a multimodal model capable of processing visual, audio, and text inputs. The roadmap progresses from generalized understanding through autonomous action to self-iteration with physical world integration.
29
2
6
Article
IEEE Spectrum·32w
Where Was This Photo Taken? AI Knows Instantly
Researchers developed a machine learning model that matches street-level photos to aerial images for geolocation with 97% accuracy in initial narrowing and 82% for exact location. The system uses deep cross-view hashing with vision transformers to convert images into unique numerical fingerprints, making it twice as fast and using one-third the memory of competing models. The approach could benefit navigation systems when GPS fails, emergency response, and defense applications, though it needs further testing for real-world challenges like seasonal variations and cloud cover.
21
2
7
Article
Hugging Face·34w
SOTA OCR with Core ML and dots.ocr
A detailed walkthrough of converting the dots.ocr model (a 3B parameter OCR model from RedNote) to run on Apple devices using Core ML and MLX. The guide covers the conversion process from PyTorch to Core ML, including simplifying the model architecture, debugging common conversion errors, and initial benchmarking. Key challenges addressed include handling attention implementations, fixing dtype mismatches, removing dynamic control flow, and dealing with variable-length sequence masking. The converted model initially runs on GPU in FLOAT32 precision, with future parts promising Neural Engine optimization and quantization techniques.
19
8
Article
Machine Learning Mastery·30w
7 Machine Learning Projects to Land Your Dream Job in 2026
Seven portfolio-ready machine learning projects designed to demonstrate practical, end-to-end skills for job seekers. Projects include predictive maintenance with LSTM networks, NLP-based resume screening, personalized learning recommenders, real-time traffic prediction using Graph Neural Networks, deepfake detection systems, multimodal sentiment analysis combining text/audio/visual data, and reinforcement learning agents for financial forecasting. Each project includes dataset recommendations and emphasizes deployment, interpretability, and real-world problem solving over theoretical knowledge.
10

See all Computer Vision archives