Training Recommender System Models for Efficiency and Scale

Topic:

AI Development & Implementation

Training Recommender System Models for Efficiency and Scale

6:27

Shortly before the Holidays, Rahul Ponnala, Granica’s CEO, invited me to deliver a talk as part of Granica's tech talk series. This article is a summary of that session where I discussed the unique challenges and innovative solutions we’ve developed to train recommender system models at scale.

At Pinterest, recommender systems form the backbone of critical product features like the Home feed, and Related pins. These systems drive both user engagement and business performance, including ad-based revenue streams. While large language models (LLMs) and generative AI are trending topics, recommender systems remain indispensable, offering unique challenges and opportunities, particularly at web scale.

This article explores how we tackle these challenges, focusing on the efficiency and scalability of training recommender system models, especially under constraints imposed by the large-scale nature of our data.

The Nature of Recommender Systems

Unlike GenAI, recommender models predict the probability of actions (e.g., clicks) based on structured, tabular data. These models train on large datasets—each training iteration consuming more than 100 terabytes of data—and operate at sub-50 millisecond latencies during inference. The training pipelines for these models are highly demanding, requiring optimization across the entire data pipeline.

Key Challenges in Training Recommender Models

Data Intensity: Recommender models are data-intensive, requiring large-scale data preprocessing and frequent feature updates. Unlike GenAI, the data-to-compute ratio is skewed heavily toward data.
Pipeline Bottlenecks: Optimizing training throughput requires addressing bottlenecks across the pipeline—from file I/O and CPU preprocessing to GPU utilization.
Infrastructure Constraints: Cloud hardware imposes fixed CPU-to-GPU ratios, complicating scalability for data-heavy workflows.
Iterative Nature: Machine learning workflows are iterative and require fast prototyping of features and models, which demands flexibility in the development environment.

Optimizing Training Throughput

1. Optimizing the Data Pipeline

A key part of our strategy involves improving the data loading pipeline. We focus on maximizing examples per second, which directly reduces job runtime and costs. Key techniques include:

Data Format Optimizations: For sequence data, we adopted PyTorch’s sparse CSR format, significantly reducing memory, data copy and I/O overhead. This compression improved throughput by up to 2x.
Data Storage Optimizations: By using ID-based sorting and reordering dataset rows to compact repeated features, we improved Parquet compression efficiency and reduced file sizes, gaining an additional 30% throughput improvement.
Data Loading Optimizations: Moving data preprocessing to a distributed CPU infrastructure powered by Ray allowed us to offload expensive CPU-bound operations, breaking free of fixed CPU-to-GPU ratios.

2. Leveraging Ray for Distributed Data Loading

We implemented a Ray-based distributed data loader to address bottlenecks in training setup. This approach enabled us to:

Offload CPU-intensive tasks like data filtering, batching, and tensor conversion to a distributed cluster of CPU nodes.
Compress data during network transfers between CPU and GPU nodes using ZSTD, which reduced network usage and improved network transfer efficiency.
Introduce a “compact batch” format, streamlining data movement and unpacking on the GPU side, resulting in a 5x reduction in python unpickling overhead.

3. Developer Iteration-Friendly Workflows

To address the iterative nature of ML workflows, we adopted Ray Data, which allows for user-defined functions (UDFs) in Python. This framework:

Simplifies last-mile feature engineering directly in the training pipeline.
Eliminates the need for separate preprocessing steps, reducing time between feature experimentation and model training.
Supports dynamic scaling of CPU resources to handle complex feature transformations efficiently.

Achieving Scalable ML Training

Unified Development Interface

To enhance developer velocity, we consolidated our workflows into a single, Python-centric interface using Ray. This streamlined approach minimizes the learning curve and removes the overhead of managing multiple frameworks.

On-the-Fly Feature Backfill

We are introducing techniques like in-trainer Iceberg bucket joins, enabling seamless, on-the-fly backfill of features from the feature store into training datasets. This eliminates the need to prepopulate large datasets for every new feature experiment, reducing storage cost and iteration time.

Iterative Optimization

Our approach ensures that engineers can:

Quickly test feature hypotheses with minimal overhead
Scale workloads horizontally across distributed CPU nodes.
Maintain high throughput and cost efficiency, even under demanding data-intensive scenarios.

Results and Impact

Our optimized infrastructure has delivered remarkable results:

5x Training Throughput: By addressing model and data pipeline inefficiencies, we scaled throughput for Home feed Ranking from 80,000 to 400,000 examples per second.
Better GPU Utilization: Faster data pipelines ensure GPUs are properly utilized, improving overall training efficiency.
Improved Developer Productivity: Unified workflows and streamlined preprocessing have reduced iteration times, enabling faster innovation.

Closing Thoughts

Training recommender system models at scale requires a holistic approach to training pipeline optimization, infrastructure scalability, and developer experience. By focusing on these aspects, we have not only enhanced our training efficiency but also empowered our teams to iterate faster and deliver impactful business results.

While the challenges of recommender systems differ significantly from those of GenAI, they present equally rewarding opportunities for innovation, as we continue to push the boundaries of what’s possible in scaling machine learning for real-world applications.

If you have questions or would like to learn more about our work, feel free to reach out!

(Editor's note: Huge thanks to Saurabh for his fantastic talk and guest blog!)

Post by Saurabh Vishwas Joshi (Guest)
January 02, 2025

Training Recommender System Models for Efficiency and Scale

The Nature of Recommender Systems

Key Challenges in Training Recommender Models

Optimizing Training Throughput

1. Optimizing the Data Pipeline

2. Leveraging Ray for Distributed Data Loading

3. Developer Iteration-Friendly Workflows

Achieving Scalable ML Training

Results and Impact

Closing Thoughts

Gartner®: A Journey Guide to AI-Ready Data