In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition…

NVIDIA DevTalk serves as a vibrant community hub where developers can engage in discussions, seek assistance, and collaborate on projects involving NVIDIA hardware and software. Developers can tap into the collective expertise of the NVIDIA developer community, sharing insights, troubleshooting issues, and exploring best practices for GPU programming and AI development. Additionally, DevTalk provides a platform for developers to showcase their projects, receive feedback, and network with peers, fostering collaboration and knowledge exchange within the NVIDIA ecosystem.

NVIDIA Developer

Production Kubernetes environments waste GPU resources when lightweight models like ASR and TTS occupy entire GPUs. This post benchmarks two GPU partitioning strategies—NVIDIA MIG (hardware-level) and time-slicing (software-level)—using a voice AI pipeline with ASR, TTS, and LLM workloads on three A100 GPUs. MIG achieved ~1.00 req/s per GPU with strict fault isolation, while time-slicing reached ~0.76 req/s but introduced noisy-neighbor risks. The recommendation is to use MIG for production workloads requiring reliability and throughput, and time-slicing for dev/CI environments. Consolidating support models onto a single partitioned GPU frees an entire GPU for additional LLM instances while maintaining >99% reliability.

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

Experimental setup: The voice AI pipeline