Best of Speech Recognition2025

  1. 1
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    Building a Real-time Voice RAG Agent

    Real-time voice interactions are becoming increasingly popular. This post provides a detailed, step-by-step guide on building a real-time Voice RAG Agent. Key components include using AssemblyAI for speech-to-text transcription, LlamaIndex for document-based answers, and Cartesia for generating seamless speech. The post includes a video and open-source code for easy implementation.

  2. 2
    Article
    Avatar of codewithandreaCode with Andrea·45w

    Build Flutter Apps FASTER with Claude Code Opus 4

    Claude Code Opus 4 was used to build a voice-activated timer Flutter app from scratch, demonstrating AI-assisted development workflow. The project involved native integrations like speech recognition and permissions, showcasing both strengths and limitations of AI coding tools. Key success factors include writing detailed requirements, using structured planning, actively reviewing generated code, and leveraging the most powerful AI models available. The workflow emphasizes breaking down complex tasks, maintaining context through documentation, and combining AI assistance with manual oversight for production-ready results.

  3. 3
    Article
    Avatar of freecodecampfreeCodeCamp·46w

    How to Build a Conversational AI Chatbot with Stream Chat and React

    A comprehensive guide to building a conversational AI chatbot that combines Stream Chat for real-time messaging with Web Speech API for voice input. The tutorial covers backend setup with Node.js and Express, frontend implementation with React and TypeScript, and integration of speech-to-text functionality. Key features include AI agent management, real-time transcription, microphone permission handling, and seamless voice-to-text message submission.

  4. 4
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·48w

    An MCP-powered Voice Agent

    A technical demonstration of building a voice agent using Model Context Protocol (MCP) that can query databases and perform web searches. The system uses AssemblyAI for speech-to-text, Firecrawl for web search, Supabase as the database, LiveKit for orchestration, and Qwen3 as the LLM. The agent transcribes user speech, determines whether to query the database or search the web, and responds via text-to-speech.

  5. 5
    Video
    Avatar of codingwithlewisCoding with Lewis·1y

    This App Wanted $700 So I Built it Myself with Python

    The post discusses how the author built a voice dictation app using Python to avoid purchasing expensive software like Dragon Professional. Various speech recognition models, including Whisper by OpenAI, are explored. The implementation involves using the Whisper model with an Nvidia GPU and the keyboard library to transcribe and input speech in real-time. The process is demonstrated using PyCharm. The project also incorporates OCR for better context understanding and mentions contributions to an open-source project, Whisper Writer.

  6. 6
    Article
    Avatar of hnHacker News·31w

    Handy

    Handy is a free, open-source speech-to-text application that runs locally on your computer. It allows users to press a keyboard shortcut, speak, and have their words automatically transcribed and pasted into any text field. The app prioritizes privacy by keeping all voice processing on-device without sending audio to the cloud, and offers simple configuration options including push-to-talk mode and customizable key bindings.

  7. 7
    Article
    Avatar of deepgramDeepgram·1y

    Introducing Nova-3 Medical: The Future of AI-Powered Medical Transcription

    Nova-3 Medical, the latest from Deepgram, offers best-in-class medical speech-to-text capabilities. The model provides unmatched accuracy in healthcare settings, capturing vital details like medication names and diagnostic terms while filtering out irrelevant noise. It is HIPAA-compliant and supports flexible customization with Keyterm Prompting. Nova-3 Medical outperforms its competitors significantly in terms of Word Error Rate (WER) and Keyterm Error Rate (KER). It integrates seamlessly with existing healthcare infrastructures and is optimized for performance and scale.

  8. 8
    Article
    Avatar of deepgramDeepgram·42w

    Announcing Deepgram Saga: The Voice OS for Developers

    Deepgram launches Saga, a Voice OS that allows developers to control their entire development workflow through natural speech commands. Saga integrates with existing tools like Cursor, MCP, and Slack, enabling developers to execute tasks across their tech stack without context switching. The platform can transform rough ideas into precise prompts, generate code from plain speech, manage end-to-end workflows, and structure thoughts into documentation. Unlike traditional voice assistants, Saga embeds directly into developer workflows rather than operating as a separate interface.

  9. 9
    Video
    Avatar of samwitteveenaiSam Witteveen·43w

    Kyutai STT & TTS - A Perfect Local Voice Solution?

    Kyutai has released separate speech-to-text and text-to-speech models that offer low latency voice processing for English and French. The TTS model is only 1.6B parameters and performs competitively with commercial solutions like 11 Labs. While the models support voice cloning through embeddings, the voice embedding model itself isn't released for ethical reasons. Users can blend existing voice embeddings to create new voices, but cannot generate embeddings from custom audio samples. The models show promise for local voice applications but are currently limited by language support and the restricted voice cloning capability.

  10. 10
    Article
    Avatar of hnHacker News·1y

    pipecat-ai/smart-turn

    Pipecat-AI's smart-turn is an open-source, community-driven audio turn detection model designed to improve the functionality of conversational voice AI systems. It uses Meta AI's Wav2Vec2-BERT as its backbone and aims to closely mimic human speech patterns beyond traditional voice activity detection. The model is still in its initial phases, currently supporting English with limited training data. Future goals include multi-language support, faster inference times, and broader dataset inclusivity. Contributions and experimentation from the community are encouraged.

  11. 11
    Article
    Avatar of deepgramDeepgram·42w

    How to Build a Speech-to-Text (STT) Note Taking App in Python

    A comprehensive guide to building a speech-to-text note-taking application using Python, Deepgram's API, and LLMs. The tutorial covers audio recording with pyaudio, transcription with speaker diarization and timestamps, and intelligent post-processing using structured outputs from Google's Gemini API to generate summaries, chapters, and action items. Includes complete code examples and discusses extensions like UI integration and Obsidian packaging.

  12. 12
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·39w

    Build a Multimodal Agentic RAG

    A comprehensive guide to building a multimodal agentic RAG system that processes both documents and audio files using speech input. The tutorial covers the complete workflow from data ingestion and audio transcription with AssemblyAI, to embedding storage in Milvus vector database, and orchestration with CrewAI Flows. The system allows users to query information using voice commands, with agents retrieving relevant context and generating cited responses. The implementation includes deployment using Beam for serverless containers and a Streamlit interface for user interaction.

  13. 13
    Article
    Avatar of freecodecampfreeCodeCamp·35w

    How to Build AI Speech-to-Text and Text-to-Speech Accessibility Tools with Python

    A comprehensive guide to building AI-powered accessibility tools for inclusive education using Python. Covers implementing speech-to-text functionality with OpenAI's Whisper (both local and API versions) and text-to-speech using Hugging Face's SpeechT5. Includes complete setup instructions for Windows, macOS, and Linux, practical code examples, troubleshooting tips, and discusses real-world applications for supporting neurodiverse learners in classrooms.

  14. 14
    Article
    Avatar of syncfusionSyncfusion·1y

    Introducing the New Angular SpeechToText Component

    Explore the new Angular SpeechToText component by Syncfusion, which utilizes the Web Speech API to convert spoken words into text in real time. This component supports multiple languages, customizable UI elements, and advanced speech recognition features, making it ideal for voice-command apps, language-learning tools, and transcription services. Detailed setup instructions are provided to help users integrate this feature-rich component into their Angular projects.