Best of Speech Recognition — September 2024

1
Article
Qt·2y
Examples of local LLM usage
The post outlines how to use local large language models (LLMs) on a MacBook Pro M1 Max for generating images, extracting text from audio, and summarizing text. The author describes the use of stable-diffusion.cpp for image generation, whisper.cpp for audio transcription, and llama.cpp for text summarization. Detailed scripts and time performance metrics are provided for each use case.
30
2
Article
Hacker News·2y
ictnlp/LLaMA-Omni: LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
LLaMA-Omni is a high-quality, low-latency, end-to-end speech interaction model built on Llama-3.1-8B-Instruct. It can generate both text and speech responses with latency as low as 226ms. The model was trained in less than 3 days using 4 GPUs. Setup involves cloning the repository, installing necessary packages, and downloading models from Huggingface and other sources. A Gradio web server can be used for interaction.
24
1
3
Article
Machine Learning News·2y
LLaMA-Omni: A Novel AI Model Architecture Designed for Low-Latency and High-Quality Speech Interaction with LLMs
LLaMA-Omni, developed by researchers from the University of Chinese Academy of Sciences, is a novel AI model architecture designed for low-latency, high-quality speech interaction with large language models (LLMs). It integrates a speech encoder, speech adaptor, LLM, and streaming speech decoder to enable seamless speech-to-speech communication, bypassing intermediate text transcription. The model’s innovative design and the specialized InstructS2S-200K dataset allow it to outperform previous models in both content and style, achieving a remarkably low response latency of 226ms. Its efficient training process makes it a leading solution for real-time speech-based interactions.
12

See all Speech Recognition archives