dgx-lab-benchmarks-vs-reality-day-4

After 6 days of intensive testing on NVIDIA's DGX Spark, training performance matches published benchmarks (80+ tokens/sec), but critical production issues emerge: GPU inference completely fails with standard PyTorch workflows, memory fragmentation forces 2-3 hour training session limits, and llama.cpp direct invocation produces empty responses. The ARM64 + Blackwell + CUDA 13.0 combination requires expert-level workarounds including manual cache clearing every 50 steps, CPU-based inference for evaluation, and Ollama for stable production inference. Successfully trained 7 models with 70-84% accuracy using documented mitigations.

#machine-learning

#nvidia

#gpu

#pytorch

Oct 26, 2025•9m read time•From publish.obsidian.md

Table of contents

What NVIDIA Showed Us My Testing Environment The Results: What Matches The Reality: What They Didn't Tell You Root Cause: The ARM64 + Blackwell Triple Whammy Precision Deep Dive What This Means for Production My Actual Numbers vs. NVIDIA's Claims Recommendations The Silver Lining Lessons Learned What I'd Tell NVIDIA Conclusion: Powerful But Not Magic Appendix: Full Experimental Data

Comment

Bookmark

Copy

Sort: