After 6 days of intensive testing on NVIDIA's DGX Spark, training performance matches published benchmarks (80+ tokens/sec), but critical production issues emerge: GPU inference completely fails with standard PyTorch workflows, memory fragmentation forces 2-3 hour training session limits, and llama.cpp direct invocation produces empty responses. The ARM64 + Blackwell + CUDA 13.0 combination requires expert-level workarounds including manual cache clearing every 50 steps, CPU-based inference for evaluation, and Ollama for stable production inference. Successfully trained 7 models with 70-84% accuracy using documented mitigations.

9m read timeFrom publish.obsidian.md
Post cover image
Table of contents
What NVIDIA Showed UsMy Testing EnvironmentThe Results: What MatchesThe Reality: What They Didn't Tell YouRoot Cause: The ARM64 + Blackwell Triple WhammyPrecision Deep DiveWhat This Means for ProductionMy Actual Numbers vs. NVIDIA's ClaimsRecommendationsThe Silver LiningLessons LearnedWhat I'd Tell NVIDIAConclusion: Powerful But Not MagicAppendix: Full Experimental Data

Sort: