Topic:
Research
In Part 1 of this blog series, we saw that computer vision models can be trained on lossily compressed image data with minimal impact on model performance. In this part, we will show what can happen to model performance when we reallocate the freed-up physical storage space to additional training data.
The Opportunity
Suppose we have a fixed physical cloud storage allotment (or equivalently financial budget) for training data over a given period of time, as is the case for many companies and machine learning teams. Company operations often generate significant amounts of new data, but typically a significant fraction of it cannot be retained and used as a result of these budget/allotment constraints. This excess data is then permanently deleted, losing the potential benefits it could provide for machine learning models.
In this context, image compression level drives the amount of the to-be-deleted image data we can keep given the budget—the more we compress images, the more additional images we can store and use to improve our models.
The Experiment
As an illustrative example, we will use one of the datasets from Part 1, the Food101 dataset. This dataset is used for image classification tasks and contains images of 101 different types of food. We will consider 6 different JPEG XL compression levels, compressing images to Butteraugli distances[1] of 0 (the original JPEG), 2, 4, 6, 8, and 10. The more that we compress, the more images we can keep with the same amount of physical cloud storage.
Consider a storage allotment of approximately 0.5 GB.[2] For the original JPEG images, we can fit 9,985 images into the budget, i.e. our baseline. At higher compression levels, each individual image takes up less space, allowing us to fit additional images into the budget. The following plot shows the number of images we can add to our baseline JPEG capacity at each subsequent compression level.
We can keep about 5.6 times as many images when we compress to a Butteraugli distance of 10 compared to when we leave images in their original format.
What do these different numbers of images at different image qualities mean for computer vision applications? We find that more images at higher compression levels can lead to better model performance. Consider the following Vision Transformer model that was pre-trained on ImageNet-21k and that we fine-tune on each data subset. To keep the computation comparable, we maintain the same number of training iterations/steps for each model regardless of training data size (meaning that the data at distance 0 will be trained for more epochs than the data at distance 10, though results for a constant number of epochs are very similar).
The Result
The result is 6 different models that have seen a different number of examples and different image qualities.
Within our 508 MB allotment of physical storage (or equivalent financial budget) for training data, having more images— even though they are more lossily compressed—improves this model’s performance.
Without increasing the physical storage in use (and thus the cloud costs associated with that storage), we can increase model accuracy by more than 10 percentage points.
Here, having more samples is more important to model performance than any change in image quality resulting from compression.
Wrapping up
Undoubtedly, different models will see different scaling patterns as we add more training samples. For the lower compression levels, this model is in the data poor regime, with the original JPEG images giving fewer than 100 images per class. This leaves room for significant improvement from adding additional samples. More generally, as state-of-the-art models grow in size and training needs, an increasing volume of data is needed to escape the data poor regime and achieve the best model performance. Lossy compression holds the potential to address these growing data needs—increasing training data volume and opening possibilities to use more sophisticated models, all without requiring more physical cloud storage or a bigger financial budget.
Of course, there are many teams building recommender systems, training models on tabular data typically stored in data lakehouses, and this task requires full data fidelity. Granica Crunch, our data lakehouse-native compression service, losslessly optimizes and shrinks the physical size of columnar data files (such as Apache Parquet) by up to 60%, reducing monthly cloud storage costs by the same percentage. The resulting smaller physical files not only reduce at-rest costs, they reduce the cost (and time) to transfer data across cloud regions, addressing AI-related compute scarcity, compliance, disaster recovery and other use cases requiring bulk data transfers. Even better, smaller files speed query performance and reduce the data loading time when training models, leading to faster and more cost-effective AI development.
To learn more about how to use completely lossless lakehouse-native data compression to improve AI and ML, check out our Granica Crunch product page.
[1] Recall from Part 1 that the Butteraugli distance is a perceptual distance measure that indicates degradation due to lossy compression, where higher Butteraugli distances indicate lower image quality.
[2] In practice, we work in reverse. When we compress the full training data set to the maximum compression level, it is approximately 508 MB in size. We consider this size to be our training data storage budget. Working backwards to lower compression levels, we select the largest possible subset of the training data that fits within this 508 MB limit at each Butteraugli distance.
December 05, 2023