Production Kubernetes environments waste GPU resources when lightweight models like ASR and TTS occupy entire GPUs. This post benchmarks two GPU partitioning strategies—NVIDIA MIG (hardware-level) and time-slicing (software-level)—using a voice AI pipeline with ASR, TTS, and LLM workloads on three A100 GPUs. MIG achieved ~1.00
Table of contents
Addressing GPU resource fragmentationArchitecture: Partitioning strategiesExperimental setup: The voice AI pipelineOur hypothesisExperimentResultsRecommendations for partitioningGet startedSort: