Best of Computer Vision — 2025

1
Article
Daily Dose of Data Science | Avi Chawla | Substack·46w
48 Most Popular Open ML Datasets
A comprehensive compilation of 48 widely-used open machine learning datasets organized by domain including computer vision (ImageNet, COCO), natural language processing (SQuAD, GLUE), recommendation systems (MovieLens, new Yambda-5B), tabular data (UCI datasets, Titanic), reinforcement learning (OpenAI Gym), and multimodal learning (LAION-5B, VQA). Each dataset is briefly described with its primary use case and key characteristics, serving as a reference guide for researchers and practitioners selecting appropriate datasets for their ML projects.
140
1
2
Article
ByteByteGo·35w
How LLMs See Images, Audio, and More
Modern AI systems process images, audio, and video by converting them into discrete tokens, similar to text processing. Images use patch embeddings (dividing into grid squares), vector quantization (learning visual codebooks), or contrastive embeddings. Audio employs neural codecs for quality preservation, ASR transcription for semantic content, or hierarchical approaches for multi-scale representation. Each tokenization method involves trade-offs between computational efficiency, information preservation, and semantic understanding, with the optimal choice depending on specific use cases and requirements.
138
3
Article
ElixirStatus·40w
Fine-Tuning YOLO to Watch Soccer Matches
Fine-tuning pre-trained YOLO models for specialized object detection tasks requires significantly less data and training time than building from scratch. Using a soccer dataset with 7,010 training images, the author demonstrates how to adapt a COCO-trained YOLOv11 model to detect balls, players, referees, and goalkeepers with 88% mAP50 accuracy. The process involves using Ultralytics tools for training, monitoring key metrics like loss values and mAP50, and converting the final PyTorch model to ONNX format for deployment in Elixir applications. The fine-tuned model shows superior contextual understanding compared to generic models, focusing on field action while filtering out background spectators.
113
5
4
Article
Tinashe Mutuswa·39w
'Influencer' who garnered 165k followers while 'enjoying Wimbledon' is revealed as an AI creation
An AI-generated Instagram influencer named Mia Zelu fooled 165,000 followers into believing she was a real person attending Wimbledon. The fake account garnered over 40,000 likes on posts showing her at the tennis tournament, even deceiving celebrities like cricketer Rishabh Pant. Despite being labeled as an 'AI storyteller' in her profile, many followers continue to interact with the account as if she were real, highlighting the sophistication of AI-generated personas and their potential to deceive social media users.
100
56
5
Article
MIT News·18w
Deep-learning model predicts how fruit flies form, cell by cell
MIT researchers developed a deep-learning model that predicts cell-by-cell development during fruit fly embryo formation with 90% accuracy. The model uses a dual-graph structure representing cells as both point clouds and foam-like bubbles, tracking properties like position, division, and folding minute-by-minute during gastrulation. The approach could eventually predict development in more complex organisms and identify early disease patterns in conditions like asthma and cancer, though high-quality video data remains the primary limitation for broader applications.
90
7
6
Video
Tiff In Tech·37w
The Most Important Tech Shift Is Happening Right Now and You Probably Missed It
AI is transitioning from digital interfaces to physical environments through spatial and embodied AI systems. Spatial AI understands 3D space and object relationships, while embodied AI can act in physical spaces through robotic bodies. This shift is enabled by cheaper sensors, multimodal AI models, and edge computing. Examples include smart home devices like robot vacuums, AR headsets like Apple Vision Pro, and industrial robots in warehouses and hospitals. This technology makes human-machine interaction more natural through gestures and speech rather than traditional interfaces, potentially improving accessibility for non-technical users while raising important safety and privacy considerations.
73
15
7
Article
Hacker News·26w
character-ai/Ovi
Ovi is an open-source audio-video generation model that simultaneously creates synchronized 5-second videos and audio from text or text+image inputs. The 11B parameter model supports flexible resolutions (720×720 to 960×960), multiple aspect ratios, and includes a custom-trained 5B audio branch. It offers inference options for single or multi-GPU setups, includes memory optimization features like fp8 quantization and CPU offloading for 24GB GPUs, and provides integration with Gradio UI and ComfyUI. The model is based on research from Character AI and builds upon Wan2.2 for video and MMAudio for audio processing.
59
2
8
Article
Hacker News·25w
apple/pico-banana-400k
Apple released Pico-Banana-400K, a dataset containing approximately 400,000 text-image-edit triplets for training text-guided image editing models. The dataset includes 257K single-turn examples, 56K preference learning samples, and 72K multi-turn conversations, covering 35 edit operations across 8 semantic categories. Built using Gemini-2.5-Flash for instruction generation and the Nano-Banana model for editing, each edit undergoes automated quality evaluation. Source images come from Open Images, with edits spanning object manipulation, scene composition, stylistic changes, and photometric adjustments. The dataset is available under CC BY-NC-ND 4.0 license for non-commercial research use.
57
1
9
Article
Hacker News·19w
I failed to recreate the 1996 Space Jam Website with Claude
An engineer attempts to use Claude AI to recreate the iconic 1996 Space Jam website from screenshots and assets, but fails despite multiple approaches. The experiment reveals Claude's limitations in spatial reasoning and precise visual measurements. Despite providing grids, comparison tools, and zoomed images, Claude consistently produces inaccurate layouts while confidently claiming success. The author theorizes this stems from how vision models process images in 16x16 patches, losing fine-grained spatial detail. The piece documents the iterative debugging process, Claude's unreliable self-assessment, and the surprising difficulty of a seemingly simple HTML recreation task.
44
5
10
Video
Tiff In Tech·49w
I Built an App That Reads Your Emotions and Plays Music Automatically (Using Junie by JetBrains)
The post describes the creation of an app called Mood DJ, which uses OpenCV and MediaPipe to detect emotions via webcam and plays music that matches the user's mood, with the help of JetBrains' AI coding assistant, Juny. It highlights the benefits of using AI tools to automate repetitive tasks and streamline code development, allowing for quick experimentation and innovation.
44
1
11
Video
Tiff In Tech·1y
Code With Me: Automating My Life With Python and AI
Learn to build a fun computer vision project using Python and AI with hands-on guidance. Explore how to create a virtual environment, install necessary packages, and utilize AI tools like OpenCV to develop a program that detects hand movements to keep track of coffee consumption and hydration needs. Suitable for both beginners and experienced developers.
44
3
12
Article
Hacker News·45w
Making eyesite
A developer created Eyesite, a web application that enables eye-controlled navigation as an affordable alternative to Apple Vision Pro. The project uses WebGazer.js for eye tracking, requiring calibration through 9-point mapping for accuracy. Key design decisions included hiding the eye cursor to maintain immersion, implementing visual feedback through button glows when gazed upon, and using large UI elements to compensate for tracking jitteriness. The spacebar serves as the click mechanism, mimicking Vision Pro's look-and-pinch interaction model.
42
4
13
Article
Hugging Face·46w
ScreenSuite - The most comprehensive evaluation suite for GUI Agents!
ScreenSuite is a comprehensive evaluation framework for GUI agents that unifies 13 benchmarks across perception, grounding, single-step actions, and multi-step agent capabilities. The suite evaluates vision language models on their ability to interact with graphical interfaces using only visual input, without accessibility trees or DOM metadata. It includes Dockerized environments for Ubuntu and Android testing, supports both local and remote sandbox execution, and provides standardized evaluation of leading VLMs like Qwen-2.5-VL series, UI-TARS, and GPT-4o on GUI automation tasks.
41
14
Article
openSUSE·28w
GSoC 2025, Building a Semantic Search Engine for Any Video
A GSoC 2025 project that built an end-to-end semantic video search engine capable of finding specific moments within videos using natural language queries. The system uses a two-part architecture: an ingestion pipeline that processes videos with AI models (TransNetV2, WhisperX, BLIP, VideoMAE) to extract shots, transcripts, captions, and actions, then segments them intelligently and enriches them with LLM-generated summaries; and a search application with FastAPI backend that performs hybrid text-visual searches using ChromaDB vector database and Reciprocal Rank Fusion for result ranking, paired with a Streamlit frontend for user interaction.
40
1
15
Article
AI Products·19w
SAM 3 just dropped, and it's a big deal
Meta released SAM 3, an open-source computer vision model that enables text-based object segmentation in images and videos. The model supports multiple input methods including text prompts, clicks, and bounding boxes, and can track objects across video frames. Trained on over 4 million unique concepts, it reportedly delivers double the accuracy of competing systems on open-vocabulary segmentation tasks. The model is available on GitHub with weights and starter notebooks.
32
3
16
Article
80 LEVEL·29w
Ex Meta Engineer Develops Tech That Makes Any Screen 3D Using a Camera
A former Meta engineer developed True3D Labs technology that creates 3D viewing experiences on any screen using only a front-facing camera. The system tracks head position in real-time to reproject scenes with motion parallax, eliminating the need for special glasses or hardware. It uses facial landmark detection and six-degree-of-freedom head pose estimation to treat the screen as a window into a 3D world. The platform supports volumetric video, voxels, and Gaussian splats, with APIs available for web developers to integrate into applications, game captures, and real-time renders from engines like Unity and Blender.
31
9
17
Article
Lobsters·35w
sabrinas.space -
A data scientist analyzed 2,671 website screenshots from popular sites across different countries using AI and machine learning to investigate whether Japanese web design is truly more maximalist than other regions. The study used ResNet models and t-SNE visualization to cluster websites by visual similarity, confirming that Japanese sites tend to favor lighter colors and denser layouts. The research explores three potential causes: writing system constraints (CJK characters), cultural differences, and Japan's unique technology adoption patterns, particularly their separate smartphone evolution that bypassed the iPhone-driven minimalism trend that influenced Western web design.
31
3
18
Article
Ars Technica·44w
MIT student prints AI polymer masks to restore paintings in hours
MIT graduate student Alex Kachkine developed a revolutionary art restoration technique using AI-generated polymer films that can restore damaged paintings in hours instead of months. The method creates transparent masks with thousands of precisely color-matched regions that can be applied to artwork and removed when needed, making restoration reversible. An AI model identified damage patterns and generated over 57,000 different colors to restore a 15th-century painting with 5,612 damaged regions in just 3.5 hours. This approach could help make the 70% of institutional art collections currently hidden due to damage accessible to the public again.
31
2
19
Video
Fireship·29w
Alibaba is going all in on Qwen…
Alibaba announced a $52 billion three-phase roadmap to artificial superintelligence at their Apsara conference, targeting completion by 2032. Key releases include Qwen 3 Max, a trillion-parameter model trained on 36 trillion tokens using mixture-of-experts architecture; Qwen 3VL, an open-source vision-language model that tops the Clockbench benchmark; and Qwen 3 Omni, a multimodal model capable of processing visual, audio, and text inputs. The roadmap progresses from generalized understanding through autonomous action to self-iteration with physical world integration.
29
2
20
Article
DigitalOcean Community·1y
YOLOv12-The Next Big Leap in Real-Time Object Detection
YOLOv12 introduces significant advancements in real-time object detection, leveraging attention-based mechanisms, optimized feature aggregation, and improved architectural design. These innovations make YOLOv12 faster, more accurate, and efficient compared to its predecessors and other end-to-end detectors like RT-DETR. The use of modules such as Area Attention (A²), Residual Efficient Layer Aggregation Networks (R-ELAN), and FlashAttention enhances its performance, offering significant improvements in speed and accuracy while maintaining low computational costs.
28
21
Article
GitHub Blog·17w
This year’s most influential open source projects
GitHub Universe 2025's Open Source Zone featured twelve influential projects spanning diverse domains: Appwrite (backend platform), GoReleaser (Go release automation), Homebrew (macOS package manager), Ladybird (independent browser), Moondream (lightweight visual AI), Oh My Zsh (shell framework), OpenCV (computer vision library), OSPSB (security baseline), p5.js and Processing (creative coding), PixiJS (2D graphics engine), Spark (3D Gaussian Splatting renderer), and Zulip (threaded team chat). Each project showcases different aspects of open source innovation, from developer tooling to AI and graphics rendering.
24
2
22
Article
IEEE Spectrum·27w
Where Was This Photo Taken? AI Knows Instantly
Researchers developed a machine learning model that matches street-level photos to aerial images for geolocation with 97% accuracy in initial narrowing and 82% for exact location. The system uses deep cross-view hashing with vision transformers to convert images into unique numerical fingerprints, making it twice as fast and using one-third the memory of competing models. The approach could benefit navigation systems when GPS fails, emergency response, and defense applications, though it needs further testing for real-world challenges like seasonal variations and cloud cover.
21
2
23
Article
80 LEVEL·17w
Particle Simulator Controlled by Hand Gestures
David Katz created a browser-based particle simulation controlled by hand gestures using only a webcam. The project uses MediaPipe for hand tracking and Three.js for rendering 100,000 particles that react to hand movements in real-time. The setup demonstrates how accessible gesture-controlled interactive graphics have become with modern web technologies.
20
4
24
Article
Valdemar·20w
OpenAGI launched something interesting - Lux
OpenAGI released Lux, a foundation AI agent that controls computers through screenshots and action sequences rather than text. It outperforms competing solutions from OpenAI, Google, and Anthropic on real-world tasks (83.6% vs 69% for Gemini CUA), operates faster (~1 second per step), and costs 10× less. Unlike browser-only alternatives, Lux works across desktop applications including Excel, Slack, Adobe products, and IDEs. The model is available via API and SDK, with Intel collaboration underway for local laptop optimization.
20
25
Article
Hacker News·24w
zserge/grayskull: A tiny, dependency-free computer vision library in C for embedded systems, drones, and robotics.
Grayskull is a minimalist computer vision library for microcontrollers and resource-constrained devices. Written in pure C99 as a single header file under 1000 lines, it requires no dependencies or dynamic memory allocation. The library provides grayscale image operations including filtering (blur, Sobel edges), thresholding (Otsu, adaptive), morphology (erosion, dilation), connected components, perspective warping, FAST/ORB feature detection for object tracking, and LBP cascades for face/vehicle detection. It includes PGM file I/O and uses integer-based operations optimized for embedded systems.
20

See all Computer Vision archives