NVIDIA PersonaPlex 7B is now running natively on Apple Silicon via a Swift/MLX library, enabling full-duplex speech-to-speech without the traditional ASR→LLM→TTS pipeline. The 16.7 GB PyTorch model was converted to a 4-bit quantized MLX format (~5.3 GB) and runs faster than real-time at ~68ms/step (RTF 0.87) on an M2 Max. The architecture uses 17 parallel audio token streams through a 7B temporal transformer and a Depformer, based on Kyutai's Moshi. The library also supports streaming audio output via AsyncThrowingStream, system prompts for behavioral steering, and end-to-end testing via ASR round-trips. Key optimizations include eval() consolidation, bulk audio extraction, prefill batching, and Metal kernel fusion via MLX compile.

8m read timeFrom blog.ivan.digital
Post cover image
Table of contents
The Journey: From Transcription to ConversationWhy PersonaPlex? One Model Instead of ThreeThe Model: 16.7 GB → 5.3 GBHow a Single Model Handles Voice ConversationReusing the Mimi CodecThe Depformer: Per-Step Weight SwitchingSystem Prompts: The Difference Between Rambling and UsefulGet Ivan ’s stories in your inboxPerformance: Honest NumbersRound-Trip Verification: One Library, End-to-EndStreaming Is HereWhat We BuiltPerformance OptimizationsTry It

Sort: