S3 server access logs at scale

Yelp's infrastructure team built a scalable system to process terabytes of S3 server access logs daily, converting raw logs to Parquet format to reduce storage by 85% and object count by 99.99%. The solution uses AWS Athena for querying, partition projection for efficient data access, and S3 batch operations for lifecycle management. The system enables debugging access issues, cost attribution, incident response, and data retention decisions by identifying unused objects through joining S3 inventory with access logs. Key challenges included handling malformed logs with user-controlled fields, managing Athena query limits, and ensuring idempotent insertions across distributed systems.

#aws

#big-data

#data-engineering

#aws-s3

Nov 21, 2025•20m read time•From engineeringblog.yelp.com

Table of contents

Introduction What are S3 server access logs?Why we want them (how to use them)Parquet format to the rescue Volume of logs Architecture Infrastructure setup Reading S3 server access logs Compaction of SAL logs Joining S3 inventory with S3 server access logs Future Work Acknowledgements Idiosyncrasy in key encoding Mapping bucket to Glue Data Catalog Verification of table counts

Comment

Bookmark

Copy

Sort: