As AI technology, use cases, and marketplace solutions evolve, the focus among developers and data engineers is shifting from model performance optimization to data optimization, a.k.a. data “readiness.” This is evidenced by research from the Gartner® report, A Journey Guide to Delivering AI Success Through ‘AI-Ready’ Data, which found that “over 75% of organizations state that AI-ready data remains one of their top five investment areas.” The report also predicts that “organizations that don’t enable and support their AI use cases through an AI-ready data practice will see over 60% of AI projects fail.”**
The efficacy and accuracy of a model’s inference (or decision-making) capabilities rely heavily on both the quantity and quality of data fed to it, but storing, processing, and transferring data at high volumes significantly drives up costs and environmental impacts while reducing query performance. Plus, large data sets are more likely to contain sensitive or toxic information, increasing AI privacy and ethics concerns.
Data teams are starting to address these problems by “shifting left” to build AI data readiness directly into pipelines and applications. Those at the forefront of AI development are integrating advanced tooling to help optimize costs and query performance while efficiently managing privacy concerns and mitigating ethical issues.
This blog provides an overview of two major problems with making data AI-ready and the current tools for overcoming them before describing a new solution that’s paving the future of AI data readiness.
For a deep dive into AI data readiness, read our Gartner report: A Journey Guide to Delivering AI Success Through ‘AI-Ready’ Data.
Data engineering leaders are tasked with managing ever-expanding AI data sets, but they have few levers to control costs or sustainability impacts. One major cost driver is cloud data lakes (or lakehouses), which store vast amounts of both structured and unstructured data that can easily balloon out of control. The biggest sources of cloud data lake costs directly associated with data are:
Currently, data teams are trying to mitigate soaring costs with data tiering and archival strategies, which prioritize data lake/lakehouse data based on utility and frequency of required access. In theory, AI training data and other high-use data are kept in standard-tier cloud storage, which is the most expensive and easiest to access, while the rest goes into cheaper archival storage.
In reality, however, cloud architects prefer to keep the vast majority of their actual data in the standard tier where it’s fast, has the highest availability SLA, is highly accessible, and incurs fewer data transfer charges. As a result, data tiering and archival strategies usually aren’t effective at controlling AI data costs.
Data teams, AI developers, and consumers all want query results at ever-accelerating rates. As data volumes continue to expand, however, models require more and more processing (a.k.a., compute) power to find the relevant information they need to respond accurately. Engineers are thus tasked with maintaining a difficult balance between improving model performance and keeping compute costs in check.
Data teams have a few techniques for striking this balance, although trade-offs exist for each of them.
Data compression reduces the physical bit size of data both at rest and in transit, which can speed queries up, especially for workloads bottlenecked by cloud storage and/or network throughput. Certain data lake file formats like Parquet can further improve query speed because they support compression algorithms such as Snappy, Gzip, LZ4, and ZSTD. For example, the team at Influxdata uses Parquet to improve compression efficiency for the InfluxDB platform. Compression can also help reduce carbon emissions to lower the environmental impact of utilizing AI technology.
Cloud Data Lake File Formats |
||
File Format |
Description |
Used For |
Apache Parquet |
A columnar file format within the Apache Hadoop ecosystem |
Analytical queries that process a large number of rows but a smaller subset of columns |
Apache Avro |
A row-oriented, JSON-based file format for Hadoop with data serialization capabilities |
Real-time data streams and data science processing |
Apache ORC (Optimized Row Columnar) |
A columnar file format for Hadoop workloads and Hive data |
Applications that perform substantially more reads than writes |
CSV (Comma-Separated Values) |
A tabular file format that stores structured data in plain text (widely compatible with most platforms) |
Structured data ingestion, transformation, and cross-platform analysis |
JSON (JavaScript Object Notation) |
A nested file format that’s widely used for API development |
Smaller data sets and API integrations |
Although compressed data can be queried more compute-efficiently, the compression itself can require substantial amounts of compute power, so companies may not save as much as they hope for with this strategy. Data compression can also be difficult to implement and manage at scale, which may increase complexity and reduce operational efficiency.
Data teams also use query enhancement techniques to streamline the information retrieval process and deliver faster results. Some of these strategies pre-date AI and are extremely familiar to SQL developers, such as annotating data with machine-readable labels and optimizing query language to avoid duplicate results.
Applying these techniques to massive AI data sets can be extraordinarily time-consuming without the assistance of LLMs (large language models) and other artificial intelligence tools, but purchasing and utilizing these tools further drives up costs.
The ideal AI data readiness solution can optimize the compression of columnar lakehouse data in a compute-efficient and format-compatible manner, reducing costs while improving query performance. The Granica team has been developing just such a solution with the Crunch platform.
Granica Crunch provides intelligent compression optimization uniquely tailored for each data lake/lakehouse file. It uses a proprietary compression control system that leverages the columnar nature of modern analytics formats like Parquet to significantly improve compression ratios while still utilizing underlying open source compression algorithms such as zstd. With Granica Crunch, all processed files remain in their open, standards-based format, readable by applications without any changes. Crunch can shrink data storage and transfer costs by up to 60%, and early TPC-DS benchmarks show a 56% increase in query speeds.
Explore an interactive demo of the Granica Crunch AI data readiness platform to see its efficient, cost-optimizing compression capabilities in action.
Gartner: A Journey Guide to Delivering AI Success Through ‘AI-Ready’ Data, by Ehtisham Zaidi, Roxane Edjilali, 18 October 2024.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.