Peter Hoffmann - Using Pandas and Dask to work with large columnar datasets in Apache Parquet

Опубликовано: 04 Октябрь 2024
на канале: EuroPython Conference

3,379

Using Pandas and Dask to work with large columnar datasets in Apache Parquet
[EuroPython 2018 - Talk - 2018-07-25 - Fintry [PyData]]
[Edinburgh, UK]

By Peter Hoffmann

Apache Parquet Data Format

Apache Parquet is a binary, efficient columnar data format. It uses various
techniques to store data in a CPU and I/O efficient way like row groups,
compression for pages in column chunks or dictionary encoding for columns.
Index hints and statistics to quickly skip over chunks of irrelevant data
enable efficient queries on large amount of data.

Apache Parquet with Pandas & Dask

Apache Parquet files can be read into Pandas DataFrames with the two libraries
fastparquet and Apache Arrow. While Pandas is mostly used to work with data
that fits into memory, Apache Dask allows us to work with data larger then memory
and even larger than local disk space. Data can be split up into partitions
and stored in cloud object storage systems like Amazon S3 or Azure Storage.

Using Metadata from the partiton filenames, parquet column statistics and
dictonary filtering allows faster performance for selective queries without
reading all data. This talk will show how use partitioning, row group skipping
and general data layout to speed up queries on large amount of data.

License: This video is licensed under the CC BY-NC-SA 3.0 license: https://creativecommons.org/licenses/...
Please see our speaker release agreement for details: https://ep2018.europython.eu/en/speak...