Best of Speech Recognition2024

  1. 1
    Article
    Avatar of hnHacker News·1y

    The Accent Oracle

    A tool that claims to guess your native language by analyzing your English accent in less than 30 seconds.

  2. 2
    Article
    Avatar of lnLaravel News·1y

    Automatic speech recognition and transcription

    Whisper.php is a PHP wrapper for whisper.cpp, a C/C++ port of OpenAI's Whisper model, created by Kyrian Obikwelu. Recently released in version 1.0.0, it facilitates fully local and API-free transcription with features like high and low-level APIs, model auto-downloading, support for various audio formats, and multiple output formats. It requires the FFI extension in PHP and relies on platform-specific shared libraries, downloaded automatically during the first initialization. Whisper.php currently supports Linux and macOS, with Windows support in development.

  3. 3
    Article
    Avatar of notedNoted·2y

    Whisper WebUI - The Self-Hosted AI Transcriber

    Whisper WebUI is a powerful self-hosted AI tool designed for transcribing audio to text locally. It supports multiple subtitle formats and can handle tasks like translating audio files and transcribing YouTube videos. Installation is simplified using a Docker Compose stack, and it can leverage NVIDIA GPUs for faster processing. Whisper is highly versatile, supporting multilingual speech recognition and translation. Additional models can be integrated from Hugging Face. Security considerations are crucial when exposing it to the public.

  4. 4
    Article
    Avatar of hnHacker News·2y

    niedev/RTranslator: RTranslator is the world's first open source real-time translation app.

    RTranslator is a nearly open-source, offline real-time translation app for Android. It allows seamless conversation translation using Bluetooth headsets and phones, ensuring privacy by running AI models directly on the device. The app supports multiple languages and works in various modes, including conversation and walkie-talkie. It requires at least 6GB of RAM for optimal performance. The app is free, with no need for configuration, using Meta's NLLB for translation and OpenAI's Whisper for speech recognition.

  5. 5
    Article
    Avatar of communityCommunity Picks·2y

    Next.js Audio Transcription App Development Guide

    This post teaches how to use OpenAI's speech-to-text API in a Next.js audio transcription app. It covers uploading audio files, setting up OpenAI, making API requests, updating the UI, fixing layout shifts, and adding a loading indicator to the submit button.

  6. 6
    Article
    Avatar of habrhabr·2y

    Transcribe Audio and Video Locally with Whisper WebGPU! No Internet Needed

    Learn how to transcribe audio and video files locally using OpenAI's Whisper model and WebGPU technology, eliminating the need for an internet connection. The setup involves using Git, Node.js, and configuring your browser to support WebGPU. The Whisper WebGPU project leverages Hugging Face's Transformers.js and ONNX Runtime Web for real-time, in-browser processing, supporting 100 languages and enhancing privacy.

  7. 7
    Article
    Avatar of devtoDEV·2y

    Echodiary : AI-Powered Diary with AWS Amplify

    Echodiary is a user-friendly diary-making web app that uses AWS Amplify Gen 2. It allows users to document their daily experiences using voice or manual writing, with the option to enhance their entries with photos. The app uses AI technology for content enhancement, weekly highlights, and personalized suggestions for mental health and personal growth.

  8. 8
    Article
    Avatar of asayerasayer·2y

    Adding Speech Navigation to a Website

    Learn how to incorporate speech navigation into a website using JavaScript's Web Speech API. Enhance accessibility, provide a hands-free browsing experience, and create a more inclusive user experience. Understand the Web Speech API and how to configure it for speech navigation. Set up the HTML structure and style the website using CSS. Map speech commands to website navigation actions and handle speech recognition errors. Optimize the code for efficiency. Consider potential pitfalls and considerations for implementation.

  9. 9
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    [Hands-on] Building a Real-Time AI Voice Bot

    This post guides readers through the process of building a real-time AI voice bot using AssemblyAI for transcription, OpenAI for generating responses, and ElevenLabs for speech generation. It includes prerequisites, implementation steps, and methods for completing the application. Additionally, it highlights the performance improvements in AssemblyAI's latest speech-to-text model, Universal-2, compared to its predecessor.

  10. 10
    Article
    Avatar of qtQt·2y

    Examples of local LLM usage

    The post outlines how to use local large language models (LLMs) on a MacBook Pro M1 Max for generating images, extracting text from audio, and summarizing text. The author describes the use of stable-diffusion.cpp for image generation, whisper.cpp for audio transcription, and llama.cpp for text summarization. Detailed scripts and time performance metrics are provided for each use case.

  11. 11
    Article
    Avatar of mlnewsMachine Learning News·2y

    SpeechBrain: A PyTorch-based Speech Toolkit

    SpeechBrain is a PyTorch-based toolkit designed to address the complexities of modern speech and audio processing tasks, including automatic speech recognition, text-to-speech synthesis, and speaker recognition. It offers a modular and flexible framework that leverages PyTorch’s efficient tensor operations and GPU acceleration to enable faster training and inference. Researchers and developers can experiment with different neural network architectures and techniques to adapt models to specific tasks and datasets, achieving state-of-the-art results.

  12. 12
    Article
    Avatar of medium_jsMedium·2y

    AI-Driven Real-Time Communication

    Discover how AI-driven real-time voice translation and synthesis technology is revolutionizing communication. With advanced speech recognition and machine translation, this system captures and translates speech with high accuracy, preserving emotional nuances and context. It synthesizes natural-sounding voices, customizable to user preferences, ensuring timely and fluid interactions. This technology is designed for seamless integration with existing platforms, enhancing business communication, customer support, and personal interactions while prioritizing user experience and ethical considerations.

  13. 13
    Article
    Avatar of hnHacker News·2y

    ictnlp/LLaMA-Omni: LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

    LLaMA-Omni is a high-quality, low-latency, end-to-end speech interaction model built on Llama-3.1-8B-Instruct. It can generate both text and speech responses with latency as low as 226ms. The model was trained in less than 3 days using 4 GPUs. Setup involves cloning the repository, installing necessary packages, and downloading models from Huggingface and other sources. A Gradio web server can be used for interaction.

  14. 14
    Article
    Avatar of communityCommunity Picks·2y

    Developing Multi-Modal Bots with Django, GPT-4, Whisper, and DALL-E

    Learn how to develop a multi-modal bot using Django, GPT-4, Whisper, and DALL-E. The tutorial covers integrating artificial intelligence into web applications, creating a multi-modal bot that understands and responds to user inputs in various forms (text, voice, and images), and leveraging models like Whisper for speech transcription, GPT-4 for text generation, and DALL-E for image generation.

  15. 15
    Article
    Avatar of gopenaiGoPenAI·2y

    Create a Simple Voice-to-Voice Translation App with Python

    Learn to create a simple voice-to-voice translator application using Python, integrating Gradio for the interface, AssemblyAI for voice recognition, the translate library for text translation, and ElevenLabs for text-to-speech conversion. The process involves setting up the environment, obtaining necessary API keys, transcribing audio to text, translating text into multiple languages, and converting translated text back into speech. The tutorial also covers building a user-friendly interface using Gradio.

  16. 16
    Article
    Avatar of gopenaiGoPenAI·2y

    🚀 Revolutionizing Document Interaction: An AI-Powered PDF Voice-2-Voice Chatbot Using LlamaIndex 🐑, Langchain 🔗 Azure AI Speech 🎤and Google Audio 🔊

    Experience a breakthrough in document interaction with an AI-driven PDF voice-2-voice chatbot. Utilizing LlamaIndex, Langchain, Azure AI Speech, and Google Audio, this advanced system allows for seamless verbal interactions with PDF documents, enhancing accessibility and usability. Learn about its evolution from text-based dialogue to voice-enabled functionalities and explore the technical components, including dependencies, document handling, and user interaction features.

  17. 17
    Article
    Avatar of hnHacker News·2y

    homebrewltd/ichigo: Llama3.1 learns to Listen

    Ichigo, previously known as llama3-s, is a custom-built early-fusion speech model with improved multiturn capabilities and the ability to refuse inaudible queries. This model was rebranded and continues to evolve with cleaner data and enhanced functionality. It leverages techniques inspired by Meta's Chameleon paper and incorporates noise-synthetic data for better user experience. The project is open for collaboration and aims to advance text-based LLMs to have native listening capabilities.

  18. 18
    Article
    Avatar of hnHacker News·2y

    lifeiteng/OmniSenseVoice: Omni SenseVoice: High-Speed Speech Recognition with words timestamps 🗣️🎯

    Omni SenseVoice, built on SenseVoice, offers high-speed and precise audio transcription with features like automatic language detection, GPU support, and quantized models for faster processing. Users can achieve up to 50x faster processing without sacrificing accuracy. The tool can be easily installed via pip and provides several key options for customization.

  19. 19
    Article
    Avatar of infoworldInfoWorld·2y

    11 surprising ways developers are using Wasm

    WebAssembly, or Wasm, has a wide range of surprising applications including speech decoding, data analysis, old video games, functions as a service, and plugin integrations.

  20. 20
    Article
    Avatar of mlnewsMachine Learning News·2y

    LLaMA-Omni: A Novel AI Model Architecture Designed for Low-Latency and High-Quality Speech Interaction with LLMs

    LLaMA-Omni, developed by researchers from the University of Chinese Academy of Sciences, is a novel AI model architecture designed for low-latency, high-quality speech interaction with large language models (LLMs). It integrates a speech encoder, speech adaptor, LLM, and streaming speech decoder to enable seamless speech-to-speech communication, bypassing intermediate text transcription. The model’s innovative design and the specialized InstructS2S-200K dataset allow it to outperform previous models in both content and style, achieving a remarkably low response latency of 226ms. Its efficient training process makes it a leading solution for real-time speech-based interactions.

  21. 21
    Article
    Avatar of hnHacker News·2y

    ShaShekhar/aaiela

    This project allows users to modify images using audio commands. It incorporates several AI models including Detectron2 for object detection, Faster Whisper for audio transcription, and Stable Diffusion for text-to-image inpainting. Users can upload an image, give an audio command, and see the image modified based on their spoken instructions. The project also supports both local and API-based language models, with adjustable settings for customization.

  22. 22
    Article
    Avatar of communityCommunity Picks·2y

    How to have your AI stack locally (Vision, Chat, TTS, STT, Image Generation and RAG)

    The guide outlines the steps to set up a local AI stack covering Vision, Chat, TTS (Text-to-Speech), STT (Speech-to-Text), Image Generation, and RAG (Retrieval-Augmented Generation). Key requirements include a high-end GPU with 16GB VRAM, Docker, and Docker Compose. The post covers installation and configuration of various models and services such as Ollama for running LLMs, openedai-speech for TTS, fasterwhisper for STT, searxng for private search, and SD.Next for image generation. Lastly, it integrates these services with Open WebUI for easy access and management.