Yelp's infrastructure team built a scalable system to process terabytes of S3 server access logs daily, converting raw logs to Parquet format to reduce storage by 85% and object count by 99.99%. The solution uses AWS Athena for querying, partition projection for efficient data access, and S3 batch operations for lifecycle

20m read timeFrom engineeringblog.yelp.com
Post cover image
Table of contents
IntroductionWhat are S3 server access logs?Why we want them (how to use them)Parquet format to the rescueVolume of logsArchitectureInfrastructure setupReading S3 server access logsCompaction of SAL logsJoining S3 inventory with S3 server access logsFuture WorkAcknowledgementsIdiosyncrasy in key encodingMapping bucket to Glue Data CatalogVerification of table counts

Sort: