Yelp's infrastructure team built a scalable system to process terabytes of S3 server access logs daily, converting raw logs to Parquet format to reduce storage by 85% and object count by 99.99%. The solution uses AWS Athena for querying, partition projection for efficient data access, and S3 batch operations for lifecycle management. The system enables debugging access issues, cost attribution, incident response, and data retention decisions by identifying unused objects through joining S3 inventory with access logs. Key challenges included handling malformed logs with user-controlled fields, managing Athena query limits, and ensuring idempotent insertions across distributed systems.

20m read timeFrom engineeringblog.yelp.com
Post cover image
Table of contents
IntroductionWhat are S3 server access logs?Why we want them (how to use them)Parquet format to the rescueVolume of logsArchitectureInfrastructure setupReading S3 server access logsCompaction of SAL logsJoining S3 inventory with S3 server access logsFuture WorkAcknowledgementsIdiosyncrasy in key encodingMapping bucket to Glue Data CatalogVerification of table counts

Sort: