Optimizing Data Lakes: A Pipeline to Aggregate Small Files by Keith Gregory

Опубликовано: 25 Ноябрь 2024
на канале: ChariotSolutions

Delivered at the Greater Philadelphia AWS User Group on February 8th, 2024. Presented by Keith Gregory, Chariot Solutions' Data Engineering & AWS Practice Lead. Hosted by Chariot Solutions.

Greater Philadelphia AWS User Group: https://www.meetup.com/gpawsug/
Chariot Solutions: https://chariotsolutions.com/

Overview:
Small files are the bane of a data lake, increasing your query times and processing costs. However, often you don’t get to control the data that you receive. For example, CloudTrail writes one file for each account and region, approximately every 15 minutes; dozens or even hundreds a day, some of which only have a few events.

An Athena query against the raw CloudTrail data might take minutes to execute, most of that time is due to the overhead of reading each file. By comparison, after aggregating the CloudTrail logs into one file per day, the same query takes only a few seconds.

In this talk, Keith Gregory walks through a data pipeline that uses Lambda to aggregate these files into a form that can be queried efficiently. He looks at the general design of such a pipeline, how to trigger it, how to monitor it, and how to be resilient to processing errors.