NVIDIA PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Native Swift with MLX

NVIDIA PersonaPlex 7B is now running natively on Apple Silicon via a Swift/MLX library, enabling full-duplex speech-to-speech without the traditional ASR→LLM→TTS pipeline. The 16.7 GB PyTorch model was converted to a 4-bit quantized MLX format (~5.3 GB) and runs faster than real-time at ~68ms/step (RTF 0.87) on an M2 Max. The architecture uses 17 parallel audio token streams through a 7B temporal transformer and a Depformer, based on Kyutai's Moshi. The library also supports streaming audio output via AsyncThrowingStream, system prompts for behavioral steering, and end-to-end testing via ASR round-trips. Key optimizations include eval() consolidation, bulk audio extraction, prefill batching, and Metal kernel fusion via MLX compile.

#swift

Mar 05•8m read time•From blog.ivan.digital

Table of contents

The Journey: From Transcription to Conversation Why PersonaPlex? One Model Instead of Three The Model: 16.7 GB → 5.3 GB How a Single Model Handles Voice Conversation Reusing the Mimi Codec The Depformer: Per-Step Weight Switching System Prompts: The Difference Between Rambling and Useful Get Ivan ’s stories in your inbox Performance: Honest Numbers Round-Trip Verification: One Library, End-to-End Streaming Is Here What We Built Performance Optimizations Try It

Comment

Bookmark

Copy

Sort: