Skip to main content

Taming One Quadrillion Data Points with Apache Iceberg and Parquet

PyData: Data Engineering
Terrace 2A
14:35 on 12 July 2024
30 minutes


Bloomberg is a leading provider of financial data, with financial data spanning multiple decades. Handling and organizing these huge datasets can be challenging, with typical concerns including sluggish query performance, high storage costs, and data consistency problems.

This talk will describe how Apache Iceberg and Parquet are the dynamic duo of big data management, offering ACID transactions, time travel, and columnar storage capabilities that enable lightning-fast query performance and seamless schema evolution for even our largest workloads.

The session will introduce Apache Iceberg, an open-source table format that enables incremental updates, versioning, and schema evolution. The discussion will then focus on Parquet files, which store data in a compressed and columnar format to enhance query performance and lower storage costs. Finally, the session will outline how our Enterprise Data Lake Applications engineering team has harnessed the capabilities of Apache Iceberg (especially PyIceberg) to revolutionize our data management and analytical processing workflows.

Attendees will be able to apply the best practices discussed in the talk to build better infrastructure for their growing data demands and spur innovation within their organization.

The speaker

Gowthami Bhogireddy

Gowthami Bhogireddy

Gowthami Bhogireddy is a Software Engineer on the Bloomberg’s Enterprise Data Lake team. She is leveraging distributed file systems, cloud table formats, and distributed query engines to build a data lake for historical financial data, which will empower clients to take full advantage of the company’s Enterprise Data products. Her team has ingested, cleaned, and enriched a quadrillion data points into the data lake, which continues growing by at least hundreds of billions of new data points daily.