Yelp built a scalable pipeline for processing Amazon S3 server-access logs by converting terabytes of daily plaintext logs into compact Parquet files, reducing storage by 85% and object count by 99.99%. The architecture uses AWS Glue Data Catalog, Lambda functions, and Athena for querying, enabling efficient permission debugging, cost attribution, and incident investigation. The system handles delayed or duplicate log delivery through idempotent inserts and automated lifecycle management. This approach demonstrates that object-level S3 logging can be made operationally manageable at scale, providing a reference architecture for organizations needing data governance, auditing, and cost visibility in cloud storage environments.

4m read timeFrom infoq.com
Post cover image

Sort: