Best of Speech Recognition — 2024
- 1
- 2
Laravel News·1y
Automatic speech recognition and transcription
Whisper.php is a PHP wrapper for whisper.cpp, a C/C++ port of OpenAI's Whisper model, created by Kyrian Obikwelu. Recently released in version 1.0.0, it facilitates fully local and API-free transcription with features like high and low-level APIs, model auto-downloading, support for various audio formats, and multiple output formats. It requires the FFI extension in PHP and relies on platform-specific shared libraries, downloaded automatically during the first initialization. Whisper.php currently supports Linux and macOS, with Windows support in development.
- 3
Noted·2y
Whisper WebUI - The Self-Hosted AI Transcriber
Whisper WebUI is a powerful self-hosted AI tool designed for transcribing audio to text locally. It supports multiple subtitle formats and can handle tasks like translating audio files and transcribing YouTube videos. Installation is simplified using a Docker Compose stack, and it can leverage NVIDIA GPUs for faster processing. Whisper is highly versatile, supporting multilingual speech recognition and translation. Additional models can be integrated from Hugging Face. Security considerations are crucial when exposing it to the public.
- 4
Hacker News·2y
niedev/RTranslator: RTranslator is the world's first open source real-time translation app.
RTranslator is a nearly open-source, offline real-time translation app for Android. It allows seamless conversation translation using Bluetooth headsets and phones, ensuring privacy by running AI models directly on the device. The app supports multiple languages and works in various modes, including conversation and walkie-talkie. It requires at least 6GB of RAM for optimal performance. The app is free, with no need for configuration, using Meta's NLLB for translation and OpenAI's Whisper for speech recognition.
- 5
Community Picks·2y
Next.js Audio Transcription App Development Guide
This post teaches how to use OpenAI's speech-to-text API in a Next.js audio transcription app. It covers uploading audio files, setting up OpenAI, making API requests, updating the UI, fixing layout shifts, and adding a loading indicator to the submit button.
- 6
habr·2y
Transcribe Audio and Video Locally with Whisper WebGPU! No Internet Needed
Learn how to transcribe audio and video files locally using OpenAI's Whisper model and WebGPU technology, eliminating the need for an internet connection. The setup involves using Git, Node.js, and configuring your browser to support WebGPU. The Whisper WebGPU project leverages Hugging Face's Transformers.js and ONNX Runtime Web for real-time, in-browser processing, supporting 100 languages and enhancing privacy.
- 7
DEV·2y
Echodiary : AI-Powered Diary with AWS Amplify
Echodiary is a user-friendly diary-making web app that uses AWS Amplify Gen 2. It allows users to document their daily experiences using voice or manual writing, with the option to enhance their entries with photos. The app uses AI technology for content enhancement, weekly highlights, and personalized suggestions for mental health and personal growth.
- 8
asayer·2y
Adding Speech Navigation to a Website
Learn how to incorporate speech navigation into a website using JavaScript's Web Speech API. Enhance accessibility, provide a hands-free browsing experience, and create a more inclusive user experience. Understand the Web Speech API and how to configure it for speech navigation. Set up the HTML structure and style the website using CSS. Map speech commands to website navigation actions and handle speech recognition errors. Optimize the code for efficiency. Consider potential pitfalls and considerations for implementation.
- 9
Daily Dose of Data Science | Avi Chawla | Substack·1y
[Hands-on] Building a Real-Time AI Voice Bot
This post guides readers through the process of building a real-time AI voice bot using AssemblyAI for transcription, OpenAI for generating responses, and ElevenLabs for speech generation. It includes prerequisites, implementation steps, and methods for completing the application. Additionally, it highlights the performance improvements in AssemblyAI's latest speech-to-text model, Universal-2, compared to its predecessor.
- 10
Qt·2y
Examples of local LLM usage
The post outlines how to use local large language models (LLMs) on a MacBook Pro M1 Max for generating images, extracting text from audio, and summarizing text. The author describes the use of stable-diffusion.cpp for image generation, whisper.cpp for audio transcription, and llama.cpp for text summarization. Detailed scripts and time performance metrics are provided for each use case.
- 11
Machine Learning News·2y
SpeechBrain: A PyTorch-based Speech Toolkit
SpeechBrain is a PyTorch-based toolkit designed to address the complexities of modern speech and audio processing tasks, including automatic speech recognition, text-to-speech synthesis, and speaker recognition. It offers a modular and flexible framework that leverages PyTorch’s efficient tensor operations and GPU acceleration to enable faster training and inference. Researchers and developers can experiment with different neural network architectures and techniques to adapt models to specific tasks and datasets, achieving state-of-the-art results.
- 12
Medium·2y
AI-Driven Real-Time Communication
Discover how AI-driven real-time voice translation and synthesis technology is revolutionizing communication. With advanced speech recognition and machine translation, this system captures and translates speech with high accuracy, preserving emotional nuances and context. It synthesizes natural-sounding voices, customizable to user preferences, ensuring timely and fluid interactions. This technology is designed for seamless integration with existing platforms, enhancing business communication, customer support, and personal interactions while prioritizing user experience and ethical considerations.
- 13
Hacker News·2y
ictnlp/LLaMA-Omni: LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
LLaMA-Omni is a high-quality, low-latency, end-to-end speech interaction model built on Llama-3.1-8B-Instruct. It can generate both text and speech responses with latency as low as 226ms. The model was trained in less than 3 days using 4 GPUs. Setup involves cloning the repository, installing necessary packages, and downloading models from Huggingface and other sources. A Gradio web server can be used for interaction.
- 14
Community Picks·2y
Developing Multi-Modal Bots with Django, GPT-4, Whisper, and DALL-E
Learn how to develop a multi-modal bot using Django, GPT-4, Whisper, and DALL-E. The tutorial covers integrating artificial intelligence into web applications, creating a multi-modal bot that understands and responds to user inputs in various forms (text, voice, and images), and leveraging models like Whisper for speech transcription, GPT-4 for text generation, and DALL-E for image generation.
- 15
GoPenAI·2y
Create a Simple Voice-to-Voice Translation App with Python
Learn to create a simple voice-to-voice translator application using Python, integrating Gradio for the interface, AssemblyAI for voice recognition, the translate library for text translation, and ElevenLabs for text-to-speech conversion. The process involves setting up the environment, obtaining necessary API keys, transcribing audio to text, translating text into multiple languages, and converting translated text back into speech. The tutorial also covers building a user-friendly interface using Gradio.
- 16
GoPenAI·2y
🚀 Revolutionizing Document Interaction: An AI-Powered PDF Voice-2-Voice Chatbot Using LlamaIndex 🐑, Langchain 🔗 Azure AI Speech 🎤and Google Audio 🔊
Experience a breakthrough in document interaction with an AI-driven PDF voice-2-voice chatbot. Utilizing LlamaIndex, Langchain, Azure AI Speech, and Google Audio, this advanced system allows for seamless verbal interactions with PDF documents, enhancing accessibility and usability. Learn about its evolution from text-based dialogue to voice-enabled functionalities and explore the technical components, including dependencies, document handling, and user interaction features.
- 17
Hacker News·2y
homebrewltd/ichigo: Llama3.1 learns to Listen
Ichigo, previously known as llama3-s, is a custom-built early-fusion speech model with improved multiturn capabilities and the ability to refuse inaudible queries. This model was rebranded and continues to evolve with cleaner data and enhanced functionality. It leverages techniques inspired by Meta's Chameleon paper and incorporates noise-synthetic data for better user experience. The project is open for collaboration and aims to advance text-based LLMs to have native listening capabilities.
- 18
Hacker News·2y
lifeiteng/OmniSenseVoice: Omni SenseVoice: High-Speed Speech Recognition with words timestamps 🗣️🎯
Omni SenseVoice, built on SenseVoice, offers high-speed and precise audio transcription with features like automatic language detection, GPU support, and quantized models for faster processing. Users can achieve up to 50x faster processing without sacrificing accuracy. The tool can be easily installed via pip and provides several key options for customization.
- 19
- 20
Machine Learning News·2y
LLaMA-Omni: A Novel AI Model Architecture Designed for Low-Latency and High-Quality Speech Interaction with LLMs
LLaMA-Omni, developed by researchers from the University of Chinese Academy of Sciences, is a novel AI model architecture designed for low-latency, high-quality speech interaction with large language models (LLMs). It integrates a speech encoder, speech adaptor, LLM, and streaming speech decoder to enable seamless speech-to-speech communication, bypassing intermediate text transcription. The model’s innovative design and the specialized InstructS2S-200K dataset allow it to outperform previous models in both content and style, achieving a remarkably low response latency of 226ms. Its efficient training process makes it a leading solution for real-time speech-based interactions.
- 21
Hacker News·2y
ShaShekhar/aaiela
This project allows users to modify images using audio commands. It incorporates several AI models including Detectron2 for object detection, Faster Whisper for audio transcription, and Stable Diffusion for text-to-image inpainting. Users can upload an image, give an audio command, and see the image modified based on their spoken instructions. The project also supports both local and API-based language models, with adjustable settings for customization.
- 22
Community Picks·2y
How to have your AI stack locally (Vision, Chat, TTS, STT, Image Generation and RAG)
The guide outlines the steps to set up a local AI stack covering Vision, Chat, TTS (Text-to-Speech), STT (Speech-to-Text), Image Generation, and RAG (Retrieval-Augmented Generation). Key requirements include a high-end GPU with 16GB VRAM, Docker, and Docker Compose. The post covers installation and configuration of various models and services such as Ollama for running LLMs, openedai-speech for TTS, fasterwhisper for STT, searxng for private search, and SD.Next for image generation. Lastly, it integrates these services with Open WebUI for easy access and management.