ICLR 2024 Awards Honorable Mention for Granica Research

Apr 18, 2024

How Data Lake Compression Reduces Carbon Emissions

The tech landscape is seeing a renaissance, particularly with developments in generative AI creating disruption and opportunities unseen since the internet became a tool for everyday people. But we must confront a stark reality: every sector contributes to carbon emissions, some unexpectedly so. Data centers, which are vital to our increasingly digital lifestyles and businesses, are also hefty emitters. It’s time to acknowledge and tackle these hidden sources head-on, ensuring both a viable planet for future generations, and room for more technological growth.

Let’s look at some statistics. The airline industry, a well-recognized source of greenhouse gases (GHG) emission, emitted 495 metric tonnes of CO2 in 2020[1]. Compare that to data centers which accounted for ~330 metric tonnes of CO2 also in 2020[2]. Even with their much smaller physical footprint, data centers have a significant environmental impact.

This article explores the carbon footprint of data centers, in particular data lakes with their vast and fast-growing data storage, and recommends strategies to help reduce the environmental impact by seeking out and using green technology.

The Environmental Cost of Data Storage

As the volume of data grows exponentially—expected to reach 180 zettabytes by 2025[3] — the energy consumption and CO2 equivalent (CO2e) impact of storing this data become increasingly critical. Notably, the focus is often on 'hot' data, or data that is frequently accessed and processed, such as that stored in cloud data lakes and used for analytics, machine learning, and AI. 

1 petabyte of data equals 235 tons of CO2 annually

These active storage environments are vital for day-to-day business operations but are more energy-intensive than 'cold' or archival data storage. For example, storing data in the cloud at scale can emit approximately 235 tons of CO2e per petabyte (PB) of data[4]. Given a single data center can hold hundreds of petabytes of data, it’s easy to see just how significant the emissions problem for data really is.

Specific Impacts and Solutions for Data Storage

Cloud data storage, particularly for hot data, presents unique challenges and opportunities in reducing CO2e emissions. The efficiency of data storage systems directly impacts their energy consumption and, by extension, their carbon footprint. Improving efficiency in data storage begins with strategies that optimize cloud costs. By reducing costs across all cloud infrastructure, not just data storage, carbon emissions can be significantly reduced.

Data compression and tiering are important strategies for reducing emissions for data

Innovative solutions like data compression optimized for cloud data lakes can significantly reduce the amount of data that needs to be physically, as opposed to logically, stored and actively managed inside data centers, thereby decreasing energy use. Data compression not only helps reduce carbon footprint and cost but can also enhance performance for enterprise applications by speeding up data transfer, especially in cases where applications face input/output (I/O) bottlenecks, which slow down data processing.

Another means to increase efficiency and thus reduce carbon emissions is through data lifecycle management and tiering, where infrequently accessed and “cold” data is moved to storage classes with lower performance and cost, but also lower energy consumption. These advanced data management and storage optimization strategies can help achieve greener operations while continuing to meet the SLAs for internal workloads and external users.

Granica's Role in Sustainable Data Management

Imagine a world where our technological advancements bring us closer to nature rather than further from it. Granica's solutions, like our ML-powered data lake compression, are a step towards this vision—significantly reducing our ecological footprint while powering the data-driven innovations that enhance our daily lives.

Our lossless compression solutions are designed to tackle the inherent inefficiencies in typical data lake formats such as Parquet, JSON, CSV, LiDAR point clouds, and even images, formats that are typically “hot” and actively used for analytics and AI. For example, our customer Quantum Metric has reduced its annual carbon footprint, and its cloud data lake storage costs, by over 40%. Results obviously depend on the compressibility of the data, and we’ve seen up to 80% reduction—a massive reduction in cost and associated carbon.

Conclusion: Granica is Committed to a Green, Sustainable Future

Every one of us has a pivotal role in shaping a sustainable future. By embracing innovative solutions like those from Granica, we can not only deliver on our collective business goals and make room for more advancements in technology, but also contribute to a healthier planet. It’s time to act boldly, ensuring a sustainable legacy for ourselves and the generations to come.

Sign up for a demo of Granica’s cutting-edge, AI-powered cloud cost optimization solutions to take the next step in your sustainability journey.

[1] https://www.statista.com/statistics/1186820/co2-emissions-commercial-aviation-worldwide/ 

[2] https://www.iea.org/energy-system/transport/aviation

[3,4] https://blog.huawei.com/2023/08/03/advice-for-cios-green-data-storage-net-zero-data-centers 

Got other suggestions for sustainability? Share your comments/questions below: