Skip to main content

AI Data Readiness Key Considerations for IT Leaders

As AI technology, use cases, and marketplace solutions evolve, the focus among developers and data engineers is shifting from model performance optimization to data optimization, a.k.a. data “readiness.” This is evidenced by research from the Gartner® report, A Journey Guide to Delivering AI Success Through ‘AI-Ready’ Data, which found that “over 75% of organizations state that AI-ready data remains one of their top five investment areas.” The report also predicts that “organizations that don’t enable and support their AI use cases through an AI-ready data practice will see over 60% of AI projects fail.”**

The efficacy and accuracy of a model’s inference (or decision-making) capabilities rely heavily on both the quantity and quality of data fed to it, but storing, processing, and transferring data at high volumes significantly drives up costs and environmental impacts while reducing query performance. Plus, large data sets are more likely to contain sensitive or toxic information, increasing AI privacy and ethics concerns.

Data teams are starting to address these problems by “shifting left” to build AI data readiness directly into pipelines and applications. Those at the forefront of AI development are integrating advanced tooling to help optimize costs and query performance while efficiently managing privacy concerns and mitigating ethical issues. 

This blog provides an overview of two major problems with making data AI-ready and the current tools for overcoming them before describing a new solution that’s paving the future of AI data readiness.

AI data readiness problem #1: A lack of control over the costs and environmental impacts of ever-expanding data sets

Data engineering leaders are tasked with managing ever-expanding AI data sets, but they have few levers to control costs or sustainability impacts. One major cost driver is cloud data lakes (or lakehouses), which store vast amounts of both structured and unstructured data that can easily balloon out of control. The biggest sources of cloud data lake costs directly associated with data are:

  • Storage → This is typically the most significant cost driver, with expenses varying depending on how much data is stored, the type of storage (hot, cool, or cold), the storage class (object or block), and how frequently data is accessed or moved.
     
  • Transferring → Cloud storage providers typically charge data transfer fees for transferring data out to a data pipeline or to different regions or services. 

  • Management → The various services used to manage, monitor, and secure data also increase the cost of cloud data lakes.

Traditional Solution: Data tiering and archival strategies

Currently, data teams are trying to mitigate soaring costs with data tiering and archival strategies, which prioritize data lake/lakehouse data based on utility and frequency of required access. In theory, AI training data and other high-use data are kept in standard-tier cloud storage, which is the most expensive and easiest to access, while the rest goes into cheaper archival storage. 

In reality, however, cloud architects prefer to keep the vast majority of their actual data in the standard tier where it’s fast, has the highest availability SLA, is highly accessible, and incurs fewer data transfer charges. As a result, data tiering and archival strategies usually aren’t effective at controlling AI data costs.

AI data readiness problem #2: Slow query speeds and increasing compute costs due to expanding data volumes

Data teams, AI developers, and consumers all want query results at ever-accelerating rates. As data volumes continue to expand, however, models require more and more processing (a.k.a., compute) power to find the relevant information they need to respond accurately. Engineers are thus tasked with maintaining a difficult balance between improving model performance and keeping compute costs in check. 

Traditional Solution: Data compression and query engine enhancements

Data teams have a few techniques for striking this balance, although trade-offs exist for each of them.

Data compression reduces the physical bit size of data both at rest and in transit, which can speed queries up, especially for workloads bottlenecked by cloud storage and/or network throughput. Certain data lake file formats like Parquet can further improve query speed because they support compression algorithms such as Snappy, Gzip, LZ4, and ZSTD. For example, the team at Influxdata uses Parquet to improve compression efficiency for the InfluxDB platform. Compression can also help reduce carbon emissions to lower the environmental impact of utilizing AI technology.

Cloud Data Lake File Formats

File Format

Description

Used For

Apache Parquet

A columnar file format within the Apache Hadoop ecosystem

Analytical queries that process a large number of rows but a smaller subset of columns

Apache Avro

A row-oriented, JSON-based file format for Hadoop with data serialization capabilities

Real-time data streams and data science processing

Apache ORC (Optimized Row Columnar)

A columnar file format for Hadoop workloads and Hive data

Applications that perform substantially more reads than writes

CSV (Comma-Separated Values)

A tabular file format that stores structured data in plain text (widely compatible with most platforms)

Structured data ingestion, transformation, and cross-platform analysis

JSON (JavaScript Object Notation)

A nested file format that’s widely used for API development

Smaller data sets and API integrations

Although compressed data can be queried more compute-efficiently, the compression itself can require substantial amounts of compute power, so companies may not save as much as they hope for with this strategy. Data compression can also be difficult to implement and manage at scale, which may increase complexity and reduce operational efficiency.

Data teams also use query enhancement techniques to streamline the information retrieval process and deliver faster results. Some of these strategies pre-date AI and are extremely familiar to SQL developers, such as annotating data with machine-readable labels and optimizing query language to avoid duplicate results. 

Applying these techniques to massive AI data sets can be extraordinarily time-consuming without the assistance of LLMs (large language models) and other artificial intelligence tools, but purchasing and utilizing these tools further drives up costs.

What if a better AI data readiness solution is possible?

The ideal AI data readiness solution can optimize the compression of columnar lakehouse data in a compute-efficient and format-compatible manner, reducing costs while improving query performance. The Granica team has been developing just such a solution with the Crunch platform. 

Granica Crunch provides intelligent compression optimization uniquely tailored for each data lake/lakehouse file. It uses a proprietary compression control system that leverages the columnar nature of modern analytics formats like Parquet to significantly improve compression ratios while still utilizing underlying open source compression algorithms such as zstd. With Granica Crunch, all processed files remain in their open, standards-based format, readable by applications without any changes. Crunch can shrink data storage and transfer costs by up to 60%, and early TPC-DS benchmarks show a 56% increase in query speeds.

Efficient AI: FinOps for Data

Explore an interactive demo of the Granica Crunch AI data readiness platform to see its efficient, cost-optimizing compression capabilities in action. 

Gartner: A Journey Guide to Delivering AI Success Through ‘AI-Ready’ Data, by Ehtisham Zaidi, Roxane Edjilali, 18 October 2024.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Granica
Post by Granica
November 05, 2024