April 15, 2024
Pinterest Engineering
Pinterest Engineering Blog

Yulin Lei | Senior Machine Studying Engineer; Kaili Zhang | Employees Machine Studying Engineer; Sharare Zahtabian | Machine Studying Engineer II; Randy Carlson | Machine Studying Engineer I; Qifei Shen | Senior Employees Machine Studying Engineer

Pinterest strives to ship high-quality advertisements and preserve a constructive consumer expertise. The platform goals to indicate advertisements that align with the consumer’s pursuits and intentions, whereas additionally offering them with inspiration and discovery. The Advertisements Engagement Modeling crew at Pinterest performs an important position in delivering efficient promoting campaigns and serving to companies attain their audience in a significant method. The objective of the engagement modeling is to indicate customers essentially the most related and interesting advertisements based mostly on their pursuits and preferences. To ship a customized and pleasing advert expertise for its customers, the Engagement Modeling crew constructed deep neural community (DNN) fashions to repeatedly be taught and adapt to consumer suggestions and conduct, making certain that the advertisements proven are extremely focused and precious to the consumer.

Customized advice is important within the advertisements advice system as a result of it may well higher seize customers’ pursuits, join the customers with the compelling merchandise, and preserve them engaged with the platform. To make the advertisements Click on-through fee (CTR) predictions extra customized, our crew has adopted customers’ actual time conduct histories and utilized deep studying algorithms to advocate applicable advertisements to customers.

On this weblog submit, we are going to primarily focus on how we undertake the consumer sequence options and the followup optimization:

  • Designed the sequence options
  • Leveraged Transformer for sequence modeling
  • Improved the serving effectivity by half precision inference

We may even share enhance the mannequin stability by Resilient Batch Norm.

To assist the engagement fashions be taught customers’ suggestions and pursuits, we developed consumer sequence options, which included customers’ actual time and historic engagement occasions and the associated data. We outlined sequence options from two major facets: function sorts and have attributes.

Function Sorts: Often customers work together with natural content material or promoted Pins, which each point out customers’ intent and curiosity. Natural pins mirror customers’ common pursuits, whereas promoted pins mirror customers’ pursuits on gross sales, merchandise, and so forth. So we created two consumer sequence options: one with all engaged pins, and one with advertisements solely. It turned out that each sequence options had sizable positive aspects by way of offline mannequin efficiency. We additionally developed consumer search sequence options, that are additionally very informative and helpful, particularly for search advertisements.

Function Attributes: Apart from what sequence options to construct, it’s also necessary to deal with what to incorporate within the sequence. A sequence of consumer exercise is a well-liked design selection, and our consumer sequence is basically a sequence of user-engaged occasion representations together with timestamps, merchandise illustration, id options, and taxonomy options. At Pinterest, a pre-trained embedding (GraphSage) is often used for merchandise illustration in lots of fashions. We additionally use it because the merchandise illustration in our sequence options.

As soon as we’ve the consumer sequences, with a view to develop efficient sequence modeling methods, we discover a variety of architectures.

Transformer [1]: One extensively used strategy is the Transformer, which serves as our baseline. We begin with a single layer single head Transformer and embody place embeddings based mostly on time delta for every occasion within the sequence. We discover that rising the variety of layers leads to improved efficiency, whereas rising the variety of heads doesn’t present further positive aspects.

Determine 1: Transformer Structure

Function Connection: We additionally experiment with totally different strategies for connecting options inside every occasion, akin to concatenation and sum. Each approaches show efficient in sure situations. The benefit of the sum connection is that it permits us to manage the dimensionality of every occasion, making the computation of self-attention within the Transformer quicker when utilizing a small fastened dimension.

Extra Function Interplay: A common apply when utilizing Transformer on modeling consumer sequence is to first embed your entire sequence right into a vector, then use this vector to work together with different options. Nonetheless, early stage function interplay is important for rating fashions. Thus, we introduce extra function interactions between your entire sequence with consumer and pin facet representations. We calculate the cosine similarity between further options with every occasion and use them as attributes for occasions. We additionally incorporate the consumer and pin facet representations instantly into the self-attention calculations.

Sum Pooling: By way of pooling methods, we experiment with sum pooling, which is historically utilized in consumer sequence modeling as a result of its effectivity. We additionally develop a brand new strategy referred to as interval sum pooling, the place we divide the sequence into a number of intervals and apply sum pooling to every interval. The outcomes are then concatenated to generate the ultimate illustration of the sequence. In some situations, interval sum pooling outperforms the Transformer baseline.

Determine 2: Sum Pooling

Deep Curiosity Community (DIN) [2]: Though we additionally discover the DIN, a well-liked structure launched in 2018, we discover that it doesn’t surpass the efficiency of the beforehand talked about fashions.

Lengthy-Quick Curiosity: Recognizing that customers’ long-term and short-term pursuits might differ, we individually mannequin each facets. The excellent sequence represents the long-term pursuits, whereas the most recent eight occasions are thought of the short-term pursuits. For the short-term sequences, we apply a light-weight consideration mechanism just like DIN. This enables us to seize customers’ newest pursuits adjustments whereas nonetheless contemplating their longer-term patterns.

Determine 3: Lengthy-Quick Curiosity Module

Total, by combining totally different architectures in numerous on-line manufacturing fashions, we obtain vital efficiency enhancements in all situations.

The brand new structure has extra modules and bigger layers, making it value extra to serve. Whereas there are a lot of alternatives for optimization, one of the vital notable ones is combined precision inference.

The GPUs we use for serving have tensor cores. Tensor cores are specialised in a single factor: fused matrix multiply and add, however solely with sure knowledge sorts. Our present fashions use the pytorch default float32 datatype, however tensor cores don’t function on this. To get an inference speedup, we have to use a lower-precision knowledge kind, of which pytorch gives two simple choices: float16 and bfloat16. Each of those knowledge sorts use 16 bits as an alternative of 32 to signify a quantity, however they’ve totally different tradeoffs between vary and precision. Float16 has a balanced discount in each vary and precision, whereas bfloat16 has almost the identical vary as float32, however much-reduced precision. We needed to discover which of those knowledge sorts has higher efficiency in our mannequin and make it possible for it’s steady.

Attributable to each 16-bit sorts having decrease precision, we wish to preserve as a lot of our mannequin as attainable in float32 in order to not danger prediction high quality, however we nonetheless wish to get good reductions in inference time. We discovered that many of the largest layers had room for enchancment, whereas lots of the smaller layers didn’t have an effect on inference time sufficient to make a distinction.

For these bigger layers, we tried each knowledge sorts. The primary pitfall of float16 is that because of the lowered vary, it’s simple for the mannequin to overflow to “infinity.” We discovered that one in every of our major layers, the DCNv2 cross layer, was generally overflowing throughout coaching with float16. This could be mitigated by tuning some hyperparameters (e.g. weight decay), however a slight danger would nonetheless stay, and a failure mode of “full failure, no rating predicted” isn’t ultimate.

The primary pitfall of bfloat16 is that because of the lowered precision, the mannequin might have marginally worse predictions. Empirically, we discovered that our mannequin can deal with this simply fantastic; there was no discount in mannequin accuracy. There may be additionally a good thing about a greater failure mode: “degraded prediction” is preferable to “no prediction.” Based mostly on our outcomes, we chosen bfloat16 for the big layers of our mannequin.

Lastly, there was the benchmarking. In offline testing, we discovered a 30% discount in mannequin inference time, with the identical prediction accuracy. This inference time discount translated properly into manufacturing, and we obtained a big discount in infrastructure prices for our fashions.

Enhancing the soundness and coaching velocity of deep studying fashions is a vital job. To deal with this problem, Batch Normalization (Batch Norm) has turn into a well-liked normalization technique utilized by many practitioners. At Pinterest, we leverage Batch Norm together with different normalization methods like minmax clip, log norm, and layer norm to successfully normalize our enter knowledge. Nonetheless, we’ve encountered instances the place Batch Norm itself can introduce mannequin instability.

Let’s take a better have a look at the method for Batch Norm and its underlying course of throughout the ahead go.

Batch Norm has two learnable parameters, specifically beta and gamma, together with two non-learnable parameters, imply shifting common and variance shifting common. Right here’s how the Batch Norm layer operates:

  1. Calculate Imply and Variance: For each activation vector, compute the imply and variance of all of the values within the mini-batch.
  2. Normalize: Utilizing the corresponding imply and variance, calculate the normalized values for every activation function vector.
  3. Scale and Shift: Apply an element, gamma, to the normalized values, and add an element, beta, to it.
  4. Transferring Common: Keep an Exponential Transferring Common of the imply and variance.

Nonetheless, a problem arises when the variance in step 2 turns into extraordinarily small and even zero. In such situations, the normalized worth, y, turns into abnormally massive, resulting in a price explosion throughout the mannequin. A number of widespread causes behind this extraordinarily small variance embody stale or delayed function values, function absence, and distribution shifts with low protection.To handle these points, we usually fill zeroes or use default values within the affected situations. Consequently, the variance computed in step 1 turns into zero. Whereas rising the mini-batch measurement and shuffling on the row degree will help mitigate this downside, they don’t totally clear up it. To beat the instability brought on by Batch Norm, we at Pinterest have developed an answer referred to as Resilient Batch Norm.

Resilient Batch Norm introduces two essential hyperparameters: minimal_variance and variance_shift_threshold. The ahead go in Resilient Batch Norm follows these steps:

  1. Calculate Imply and Variance for the mini-batch.
  2. Replace Transferring Common with particular circumstances:
  3. If a variance is smaller than the minimal_variance hyperparameter, masks out the column from the operating variance replace.
  4. If a variance’s change ratio exceeds the variance_shift_threshold, masks out the column from the operating variance replace.
  5. Proceed to replace the remaining operating variance and operating imply.
  6. Normalize utilizing the operating variance and operating imply.
  7. Scale and Shift.

After conducting intensive experiments, we’ve noticed no lower in efficiency or coaching velocity. By seamlessly changing Batch Norm with Resilient Batch Norm, our fashions acquire the flexibility to handle the aforementioned function issues and comparable conditions whereas reaching enhanced stability.

In conclusion, when confronted with instability points as a result of Batch Norm, adopting Resilient Batch Norm can present a strong resolution and enhance the general efficacy of the fashions.

On this part, we present some offline and on-line outcomes for the consumer motion sequence mannequin on totally different view sorts (HomeFeed, RelatedPins, Search) and general. The baseline mannequin is our manufacturing mannequin with DCNv2 [3] structure and inner coaching knowledge. It’s to be famous that 0.1% offline accuracy enchancment within the Engagement rating mannequin is taken into account to be vital. Thus, the consumer motion sequence options and modeling methods enhance each on-line and offline metrics very considerably.

By leveraging realtime consumer sequence options, using numerous modeling methods akin to transformers, function interplay, function connections, and pooling, the engagement mannequin at Pinterest has been capable of successfully adapt to customers’ conduct and suggestions, leading to extra customized and related suggestions. The popularity of customers’ long-term and short-term pursuits has been instrumental in reaching this goal. To be able to account for each facets, a complete sequence is utilized to signify long-term pursuits, whereas the most recent eight occasions are employed to seize short-term pursuits. This strategy has considerably improved the mannequin’s prediction efficiency; nonetheless, it has come at a substantial value by way of the added options and complexity of the fashions.

To mitigate the influence on serving effectivity and infrastructure prices, we’ve explored and applied combined precision inference methods, using decrease precision (float16, bfloat16). This has successfully improved our serving effectivity whereas additionally lowering infrastructure prices. Moreover, we’ve addressed the problem of creating the mannequin resilient to realtime adjustments, as we acknowledged the important significance of those realtime sequence options. By incorporating a extra resilient batch normalization method, we’re capable of stop irregular worth explosions brought on by sudden adjustments in function protection or distribution shift.

On account of these endeavors, Pinterest continues to ship extremely fascinating, adaptive, and related suggestions that encourage and drive discovery for every distinctive consumer.

This work represents a results of collaboration of the conversion modeling crew members and throughout a number of groups at Pinterest.

Engineering Groups:

Advertisements Rating: Van Wang, Ke Zeng, Han Solar, Meng Qi

Superior Expertise Group: Yi-Ping Hsu, Pong Eksombatchai, Xiangyi Chen

Advertisements ML Infra: Shantam Shorewala, Kartik Kapur, Matthew Jin, Yiran Zhao, Dongyong Wang

Person Sequence Assist: Zefan Fu, Kimmie Hua

Indexing Infra: Kangnan Li, Dumitru Daniliuc

Management: Ling Leng, Dongtao Liu, Liangzhe Chen, Haoyang Li, Joey Wang, Shun-ping Chiu, Shu Zhang, Jiajing Xu, Xiaofang Chen, Yang Tang, Behnam Rezaei, Caijie Zhang

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural data processing methods 30 (2017).

[2] Zhou, Guorui, et al. “Deep interest network for click-through rate prediction.” Proceedings of the twenty fourth ACM SIGKDD worldwide convention on data discovery & knowledge mining. 2018.

[3] Wang, Ruoxi, et al. “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems.” Proceedings of the net convention 2021. 2021.

To be taught extra about engineering at Pinterest, try the remainder of our Engineering Weblog and go to our Pinterest Labs web site. To discover and apply to open roles, go to our Careers web page.