NVIDIA didn't want me to do this

A hands-on experiment building a cluster of four (then eight) NVIDIA DJX Spark machines to run large language models using tensor parallelism and RDMA over Converged Ethernet (RoCE). The video covers the hardware challenges of QSFP cable types (28 vs 56), managed switch configuration, SSH mesh setup, and benchmarking with models like Qwen 34B, Qwen VL 32B, Qwen 3.5 397B, and Kimi K2. Key findings: token generation scales well with more nodes for larger models, prompt processing benefits are more variable, and the 8-node cluster with ~1TB VRAM can run 800GB models like Qwen 3.5 at 24 tokens/second.

Sort: