The AMD MI300X, despite its superior on-paper specifications and lower total cost of ownership compared to Nvidia's H100 and H200 GPUs, fails to meet expectations in real-world training performance due to significant software issues. Nvidia's out-of-the-box performance and usability are far superior, with AMD's software stack

50m read timeFrom semianalysis.com
Post cover image
Table of contents
IntroKey FindingsExecutive Recommendation to AMDA Summary of the AMD vs Nvidia NarrativeGeneral Matrix Multiply (GEMM) PerformancePopular GEMM Benchmark Isn’t AccurateHBM Memory Bandwidth PerformanceAMD Hand-Crafted VIP Custom Builds and WIP Development BuildsDec 21st AMD Development BuildsTraining Testing Methodology (GPT1.5B, Llama 8B, Llama 70B, Mistral)Single Node Training PerformanceMulti-Node Training PerformanceAMD PYTORCH_TUNABLE_OPS FLAG is a Bad User ExperienceScale Up NVLink/xGMI TopologyAll Reduce/All to All/Reduce Scatter/All Gather Collectives OverviewSingle Node NCCL CollectiveMulti Node RCCL/NCCL Collectives and Scale Out Network BenchmarksAMD’s User Experience is Suboptimal and the MI300X is Not Usable Out of the BoxExploring Ideas for Better Performance on AMDAMD’s Forked LibrariesDetailed Recommendations to AMD on How to Fix Their SoftwareH100/H200/MI300X Networking BoM Analysis and Performance per TCO

Sort: