Shortly before the Holidays, Rahul Ponnala, Granica’s CEO, invited me to deliver a talk as part of Granica's tech talk series. This article is a summary of that session where I discussed the unique challenges and innovative solutions we’ve developed to train recommender system models at scale.
At Pinterest, recommender systems form the backbone of critical product features like the Home feed, and Related pins. These systems drive both user engagement and business performance, including ad-based revenue streams. While large language models (LLMs) and generative AI are trending topics, recommender systems remain indispensable, offering unique challenges and opportunities, particularly at web scale.
This article explores how we tackle these challenges, focusing on the efficiency and scalability of training recommender system models, especially under constraints imposed by the large-scale nature of our data.
Unlike GenAI, recommender models predict the probability of actions (e.g., clicks) based on structured, tabular data. These models train on large datasets—each training iteration consuming more than 100 terabytes of data—and operate at sub-50 millisecond latencies during inference. The training pipelines for these models are highly demanding, requiring optimization across the entire data pipeline.
A key part of our strategy involves improving the data loading pipeline. We focus on maximizing examples per second, which directly reduces job runtime and costs. Key techniques include:
We implemented a Ray-based distributed data loader to address bottlenecks in training setup. This approach enabled us to:
To address the iterative nature of ML workflows, we adopted Ray Data, which allows for user-defined functions (UDFs) in Python. This framework:
Unified Development Interface
To enhance developer velocity, we consolidated our workflows into a single, Python-centric interface using Ray. This streamlined approach minimizes the learning curve and removes the overhead of managing multiple frameworks.
On-the-Fly Feature Backfill
We are introducing techniques like in-trainer Iceberg bucket joins, enabling seamless, on-the-fly backfill of features from the feature store into training datasets. This eliminates the need to prepopulate large datasets for every new feature experiment, reducing storage cost and iteration time.
Iterative Optimization
Our approach ensures that engineers can:
Our optimized infrastructure has delivered remarkable results:
Training recommender system models at scale requires a holistic approach to training pipeline optimization, infrastructure scalability, and developer experience. By focusing on these aspects, we have not only enhanced our training efficiency but also empowered our teams to iterate faster and deliver impactful business results.
While the challenges of recommender systems differ significantly from those of GenAI, they present equally rewarding opportunities for innovation, as we continue to push the boundaries of what’s possible in scaling machine learning for real-world applications.
If you have questions or would like to learn more about our work, feel free to reach out!
(Editor's note: Huge thanks to Saurabh for his fantastic talk and guest blog!)