A detailed benchmark comparing ML training throughput on Tigris Object Storage versus AWS S3, using the S3 Connector for PyTorch on a g5.8xlarge instance. Results show Tigris delivers ~134 samples/sec on random-access data, within 3% of AWS S3's ~138 samples/sec. The post also introduces the Tigris Acceleration Gateway (TAG), a

13m read time From tigrisdata.com
Post cover image
Table of contents
Background: why data loading is a bottleneck ​Benchmark setup ​Part 1: Random access — Tigris throughput scales near-linearly to 16 workers, where the GPU saturates at ~134 samples/sec ​Part 2: Sequential access (sharded) — GPU saturates at 8 workers regardless of shard size; sharding halves worker requirements versus random access ​Part 3: Multi-epoch training with TAG — warm-cache epochs complete 5.7x faster; GPU saturates at 4 workers versus 16 without caching ​Part 4: Entitlement benchmark — Tigris delivers samples at 46x GPU demand; TAG reaches ~200x at peak ​Summary ​Future work ​About TAG ​

Sort: