Best of Speech RecognitionOctober 2024

  1. 1
    Article
    Avatar of notedNoted·2y

    Whisper WebUI - The Self-Hosted AI Transcriber

    Whisper WebUI is a powerful self-hosted AI tool designed for transcribing audio to text locally. It supports multiple subtitle formats and can handle tasks like translating audio files and transcribing YouTube videos. Installation is simplified using a Docker Compose stack, and it can leverage NVIDIA GPUs for faster processing. Whisper is highly versatile, supporting multilingual speech recognition and translation. Additional models can be integrated from Hugging Face. Security considerations are crucial when exposing it to the public.

  2. 2
    Article
    Avatar of habrhabr·2y

    Transcribe Audio and Video Locally with Whisper WebGPU! No Internet Needed

    Learn how to transcribe audio and video files locally using OpenAI's Whisper model and WebGPU technology, eliminating the need for an internet connection. The setup involves using Git, Node.js, and configuring your browser to support WebGPU. The Whisper WebGPU project leverages Hugging Face's Transformers.js and ONNX Runtime Web for real-time, in-browser processing, supporting 100 languages and enhancing privacy.

  3. 3
    Article
    Avatar of mlnewsMachine Learning News·2y

    SpeechBrain: A PyTorch-based Speech Toolkit

    SpeechBrain is a PyTorch-based toolkit designed to address the complexities of modern speech and audio processing tasks, including automatic speech recognition, text-to-speech synthesis, and speaker recognition. It offers a modular and flexible framework that leverages PyTorch’s efficient tensor operations and GPU acceleration to enable faster training and inference. Researchers and developers can experiment with different neural network architectures and techniques to adapt models to specific tasks and datasets, achieving state-of-the-art results.

  4. 4
    Article
    Avatar of hnHacker News·2y

    homebrewltd/ichigo: Llama3.1 learns to Listen

    Ichigo, previously known as llama3-s, is a custom-built early-fusion speech model with improved multiturn capabilities and the ability to refuse inaudible queries. This model was rebranded and continues to evolve with cleaner data and enhanced functionality. It leverages techniques inspired by Meta's Chameleon paper and incorporates noise-synthetic data for better user experience. The project is open for collaboration and aims to advance text-based LLMs to have native listening capabilities.

  5. 5
    Article
    Avatar of hnHacker News·2y

    lifeiteng/OmniSenseVoice: Omni SenseVoice: High-Speed Speech Recognition with words timestamps 🗣️🎯

    Omni SenseVoice, built on SenseVoice, offers high-speed and precise audio transcription with features like automatic language detection, GPU support, and quantized models for faster processing. Users can achieve up to 50x faster processing without sacrificing accuracy. The tool can be easily installed via pip and provides several key options for customization.