September 7, 2024

By Jiangwei Pan, Gary Tang, Henry Wang, and Justin Basilico

Our mission at Netflix is to entertain the world. Our personalization algorithms play an important position in delivering on this mission for all members by recommending the precise exhibits, motion pictures, and video games on the proper time. This objective extends past speedy engagement; we intention to create an expertise that brings lasting enjoyment to our members. Conventional recommender methods usually optimize for short-term metrics like clicks or engagement, which can not totally seize long-term satisfaction. We try to suggest content material that not solely engages members within the second but additionally enhances their long-term satisfaction, which will increase the worth they get from Netflix, and thus they’ll be extra prone to proceed to be a member.

One easy manner we will view suggestions is as a contextual bandit drawback. When a member visits, that turns into a context for our system and it selects an motion of what suggestions to point out, after which the member offers varied varieties of suggestions. These suggestions alerts may be speedy (skips, performs, thumbs up/down, or including gadgets to their playlist) or delayed (finishing a present or renewing their subscription). We will outline reward features to mirror the standard of the suggestions from these suggestions alerts after which practice a contextual bandit coverage on historic knowledge to maximise the anticipated reward.

There are a lot of ways in which a advice mannequin may be improved. They could come from extra informative enter options, extra knowledge, completely different architectures, extra parameters, and so forth. On this submit, we concentrate on a less-discussed facet about bettering the recommender goal by defining a reward operate that tries to higher mirror long-term member satisfaction.

Member retention would possibly look like an apparent reward for optimizing long-term satisfaction as a result of members ought to keep in the event that they’re glad, nevertheless it has a number of drawbacks:

  • Noisy: Retention may be influenced by quite a few exterior elements, resembling seasonal traits, advertising and marketing campaigns, or private circumstances unrelated to the service.
  • Low Sensitivity: Retention is simply delicate for members on the verge of canceling their subscription, not capturing the complete spectrum of member satisfaction.
  • Arduous to Attribute: Members would possibly cancel solely after a sequence of dangerous suggestions.
  • Sluggish to Measure: We solely get one sign per account per 30 days.

Attributable to these challenges, optimizing for retention alone is impractical.

As a substitute, we will practice our bandit coverage to optimize a proxy reward operate that’s extremely aligned with long-term member satisfaction whereas being delicate to particular person suggestions. The proxy reward r(person, merchandise) is a operate of person interplay with the really useful merchandise. For instance, if we suggest “One Piece” and a member performs then subsequently completes and provides it a thumbs-up, a easy proxy reward is perhaps outlined as r(person, merchandise) = f(play, full, thumb).

Click on-through charge (CTR)

Click on-through charge (CTR), or in our case play-through charge, may be seen as a easy proxy reward the place r(person, merchandise) = 1 if the person clicks a advice and 0 in any other case. CTR is a standard suggestions sign that usually displays person desire expectations. It’s a easy but sturdy baseline for a lot of advice purposes. In some circumstances, resembling adverts personalization the place the press is the goal motion, CTR could even be an inexpensive reward for manufacturing fashions. Nonetheless, normally, over-optimizing CTR can result in selling clickbaity gadgets, which can hurt long-term satisfaction.

Past CTR

To align the proxy reward operate extra carefully with long-term satisfaction, we have to look past easy interactions, take into account all varieties of person actions, and perceive their true implications on person satisfaction.

We give a number of examples within the Netflix context:

  • Quick season completion ✅: Finishing a season of a really useful TV present in sooner or later is a powerful signal of enjoyment and long-term satisfaction.
  • Thumbs-down after completion ❌: Finishing a TV present in a number of weeks adopted by a thumbs-down signifies low satisfaction regardless of important time spent.
  • Taking part in a film for simply 10 minutes ❓: On this case, the person’s satisfaction is ambiguous. The temporary engagement would possibly point out that the person determined to desert the film, or it may merely imply the person was interrupted and plans to complete the film later, maybe the subsequent day.
  • Discovering new genres ✅ ✅: Watching extra Korean or sport exhibits after “Squid Sport” suggests the person is discovering one thing new. This discovery was seemingly much more invaluable because it led to quite a lot of engagements in a brand new space for a member.

Reward engineering is the iterative technique of refining the proxy reward operate to align with long-term member satisfaction. It’s much like function engineering, besides that it may be derived from knowledge that isn’t obtainable at serving time. Reward engineering includes 4 levels: speculation formation, defining a brand new proxy reward, coaching a brand new bandit coverage, and A/B testing. Under is an easy instance.

Consumer suggestions used within the proxy reward operate is commonly delayed or lacking. For instance, a member could resolve to play a really useful present for just some minutes on the primary day and take a number of weeks to totally full the present. This completion suggestions is subsequently delayed. Moreover, some person suggestions could by no means happen; whereas we might need in any other case, not all members present a thumbs-up or thumbs-down after finishing a present, leaving us unsure about their stage of enjoyment.

We may attempt to wait to offer an extended window to look at suggestions, however how lengthy ought to we look forward to delayed suggestions earlier than computing the proxy rewards? If we wait too lengthy (e.g., weeks), we miss the chance to replace the bandit coverage with the most recent knowledge. In a extremely dynamic atmosphere like Netflix, a stale bandit coverage can degrade the person expertise and be significantly dangerous at recommending newer gadgets.

Resolution: predict lacking suggestions

We intention to replace the bandit coverage shortly after making a advice whereas additionally defining the proxy reward operate primarily based on all person suggestions, together with delayed suggestions. Since delayed suggestions has not been noticed on the time of coverage coaching, we will predict it. This prediction happens for every coaching instance with delayed suggestions, utilizing already noticed suggestions and different related info as much as the coaching time as enter options. Thus, the prediction additionally will get higher as time progresses.

The proxy reward is then calculated for every coaching instance utilizing each noticed and predicted suggestions. These coaching examples are used to replace the bandit coverage.

However aren’t we nonetheless solely counting on noticed suggestions within the proxy reward operate? Sure, as a result of delayed suggestions is predicted primarily based on noticed suggestions. Nonetheless, it’s easier to purpose about rewards utilizing all suggestions straight. For example, the delayed thumbs-up prediction mannequin could also be a posh neural community that takes into consideration all noticed suggestions (e.g., short-term play patterns). It’s extra simple to outline the proxy reward as a easy operate of the thumbs-up suggestions fairly than a posh operate of short-term interplay patterns. It may also be used to regulate for potential biases in how suggestions is supplied.

The reward engineering diagram is up to date with an elective delayed suggestions prediction step.

Two varieties of ML fashions

It’s value noting that this strategy employs two varieties of ML fashions:

  • Delayed Suggestions Prediction Fashions: These fashions predict p(ultimate suggestions | noticed feedbacks). The predictions are used to outline and compute proxy rewards for bandit coverage coaching examples. In consequence, these fashions are used offline through the bandit coverage coaching.
  • Bandit Coverage Fashions: These fashions are used within the bandit coverage π(merchandise | person; r) to generate suggestions on-line and in real-time.

Improved enter options or neural community architectures usually result in higher offline mannequin metrics (e.g., AUC for classification fashions). Nonetheless, when these improved fashions are subjected to A/B testing, we frequently observe flat and even damaging on-line metrics, which may quantify long-term member satisfaction.

This online-offline metric disparity normally happens when the proxy reward used within the advice coverage is just not totally aligned with long-term member satisfaction. In such circumstances, a mannequin could obtain greater proxy rewards (offline metrics) however end in worse long-term member satisfaction (on-line metrics).

Nonetheless, the mannequin enchancment is real. One strategy to resolve that is to additional refine the proxy reward definition to align higher with the improved mannequin. When this tuning ends in optimistic on-line metrics, the mannequin enchancment may be successfully productized. See [1] for extra discussions on this problem.

On this submit, we supplied an summary of our reward engineering efforts to align Netflix suggestions with long-term member satisfaction. Whereas retention stays our north star, it isn’t simple to optimize straight. Due to this fact, our efforts concentrate on defining a proxy reward that’s aligned with long-term satisfaction and delicate to particular person suggestions. Lastly, we mentioned the distinctive problem of delayed person suggestions at Netflix and proposed an strategy that has confirmed efficient for us. Consult with [2] for an earlier overview of the reward innovation efforts at Netflix.

As we proceed to enhance our suggestions, a number of open questions stay:

  • Can we study a very good proxy reward operate robotically by correlating habits with retention?
  • How lengthy ought to we look forward to delayed suggestions earlier than utilizing its predicted worth in coverage coaching?
  • How can we leverage Reinforcement Studying to additional align the coverage with long-term satisfaction?

[1] Deep learning for recommender systems: A Netflix case study. AI Journal 2021. Harald Steck, Linas Baltrunas, Ehtsham Elahi, Dawen Liang, Yves Raimond, Justin Basilico.

[2] Reward innovation for long-term member satisfaction. RecSys 2023. Gary Tang, Jiangwei Pan, Henry Wang, Justin Basilico.