YouTube Recommendation Algorithms (2016)

 2024/10/15 

This is a detailed reading of Google’s paper Deep Neural Networks for YouTube Recommendations

Candidate Generation

Problem Setup

We pose recommendation as an extreme multi-class classification problem where we predict which video will be watched next. Specifically, we classify a specific video watch w_t at time t among millions of videos i (classes) from a video corpus V based on a user U and context C. And $u \in \R^d$ represents a high-dimensional “embedding” of the user-context pair and the $v_j \in \R^d$ represent embeddings of each candidate video. P(w_t = i ∣ U, C) = Softmax(V, u)_i

Input

what we didn’t use: explicit feedback (thumbs up/down, in-product surveys) because there’s too few of them in the tail of videos.

embedded video watches:
- representation of each video: in my understanding, YouTube didn’t extract information from their videos, and fed the extracted info into the embedder. Instead, they directly fed the video ids into the embedder. To my understanding, “Inspired by continuous bag of words language models” means they fed the video ids into the embedder, just like NLP feeds BoW representation into embedder. It doesn’t mean YouTube decomposes a video into word count like BoW.
  
  I reached this conclusion from the last sentence of 4.1 Feature Representation - Embedding Categorical Features, where 1000000*32 / (2048*2048) = 7
  
  The overwhelming majority of model parameters are in these high-cardinality embedding spaces - for example, one million IDs embedded in a 32 dimensional space have 7 times more parameters than fully connected layers 2048 units wide.
- representation of watch history: watch history is a variable-length list of videos. YouTube used average-pooling to transform them into a fixed-length vector.
- why order-independent pooling?
embedded search tokens: each query is tokenized into unigrams and bigrams. each token is embedded. All these embedded tokens from all queries are then pooled and fed into the model as a summarized dense search history.
geographic embeddings: The user’s geographic region and device are embedded and concatenated.
Simple binary and continuous features: such as the user’s gender, logged-in state and age are input directly into the network as real values normalized to [0, 1].
example (sample) age: ML model often has bias towards past samples because there’s more data about them. It’s common to promote video freshness during re-rank phase, but YouTube also try to reduce this bias as early as in candidate generation phase. We define the example age to be the time between training and obtaining this sample. e.g. t days earlier, after having watched video v, user searched word w and clicked on another video y. We use this sample in training, so its sample age is t. By introducing this feature, our model no longer reflects the average watch likelihood in the training window of several weeks, but the likelihood at a specific time step. At serving time, this feature is set to zero (or slightly negative) to reflect that the model is making predictions at the very end of the training window.

总结某知乎讨论下的内容: example age 和消除第一种 ad position bias 做法类似
- 纯feed应用中前置内容点击率被高估：新闻客户端，position靠前的是虚高的，这部分叫 position bias，输入给模型的时候也要输入它们的位置，在线上预估时置0
- 搜索偏向应用中前置内容点击率被低估：手百，手淘，美团等，都会在首页默认展示feed，但很多目的明确的用户压根不会用这些推荐功能，导致这部分展示的内容点击率是被低估了。实际操作中大家可能只针对有feed互动的用户进行采样，抛弃了完全过路型用户的行为，也算是修正bias了

Data Gathering Details

Class Balance: generate a fixed number of training examples per user, effectively weighting our users equally in the loss function
Permutation-Invariant Pooling: in pooling, YouTube chose average pooling among sum, max, etc, because average performed the best. The important thing is, they decided to abandon sequence information whatsoever. Their explanation is below and I don’t quite buy it because they’re definitely better way to solve this problem than discarding the info altogether. In addition, they did publish another paper on sequence-based recommendation system later. I think at this paper’s publishing time, they either didn’t want to publicize it or have not tested that in detail.

Consider the user has just issued a search query for “taylor swift”. Since our problem is posed as predicting the next watched video, a classifier given this information will predict that the most likely videos to be watched are those which appear on the corresponding search results page for “taylor swift”. Unsurprisingly, reproducing the user’s last search page as homepage recommendations performs very poorly. By discarding sequence information and representing search queries with an unordered bag of tokens, the classifier is no longer directly aware of the origin of the label.
Next-Watch Heldout: at the time, many collaborative filtering systems implicitly choose the labels and context by holding out a random item and predicting it from other items in the user’s history. They decided to always hold out and predict user’s next watch and achieved much better results. This is now already the standard. In fact it appeared in my college class CS4780 by Killian.

Training:

In these experiments, a vocabulary of 1M videos and 1M search tokens were embedded with 256 floats each in a maximum bag size of 50 recent watches and 50 recent searches. The softmax layer outputs a multinomial distribution over the same 1M video classes with a dimension of 256 (which can be thought of as a separate output video embedding).

negative sampling: for the same user, in addition to his watched videos, we also sample some unwatched videos to generalize the model.
Importance Weighting: we do not take softmax over all the 1M videos. Instead, we sample ~5000 of them, compute their probability, re-weight them based on importance (watch time, click rate?) and only compute loss over these samples.
loss: this is a multi-class classification problem, so we use cross-entropy loss naturally

Inference / Serving:

kNN: In serving, we need an algorithm sublinear to number of classes (videos to recommend). Say the last layer of the network has hidden dimension d and we have N videos to predict. Decoding hidden dimension back into per-video logits and take a softmax takes O(dN). Sorting takes O(Nlog N). The total time is O(dN + Nlog N). On the other hand, naive kNN takes O(dN) time in total and some heuristic version like Ball tree can take O(dlog N).

distance is based on dot product
how do we get video embedding? (where is decoder?)

All classification models have to include a decoder (FC layer) at the end of the network but before the softmax layer to decode the hidden vector back to per-video logits in video ID space to make prediction. If we have 1M videos and hidden vector is of dimension 256, the decoder is a matrix of size [256, 1M]. However, the graph presented in the paper is very confusing because the authors omit drawing the decoder and made it an implicit part of the softmax layer.

Anyway, we know we do have that decoder, so it’s natural to use the vectors in the decoder as our video embedding. The i-th video’s embedding is simply decoder[:, i].
weight sharing / weight tying: this is a concept I encountered in nanoGPT and has become clear here. At the beginning of the network, we have a video encoder from video ID space to hidden space; at the end of the network, we have a video decoder from hidden space back to video ID space. It is possible to share weights (use the same weights) in encoder and decoder to save space (recall this part costs the most parameter). This is just mentioned by people in comment section and is not implemented by Google.

Ranking

Problem Setup

We pose ranking as predicting the expected watch time of a video. Ranking by click-through rate often promotes deceptive videos that the user does not complete (“clickbait”) whereas watch time better captures engagement

Input (Feature Engineering)

user’s previous interaction with the item itself and other similar items: e.g. user’s past history with the channel that uploaded the video being scored (how many videos has the user watched from this channel? When was the last time the user watched a video on this topic?)
propagate information from candidate generation into ranking in the form of features: e.g. which sources nominated this video candidate? What scores did they assign?
frequency of past video impressions: If a user was recently recommended a video but did not watch it then the model will naturally demote this impression on the next page load.

Input Details

embedding space should increase approximately proportional to the logarithm of the number of unique values of data space
very large space (video id & query token) is truncated by click-through-rate. So we only recommend videos above a certain CTR. Note these filtered out videos can still be searched out.
Out-of-vocabulary values (new / truncated videos) are mapped to the zero embedding.
Continuous features are always normalized to [0, 1)
In addition to the raw normalized feature x, we also input $\sqrt x$ and x², giving the network more expressive power by allowing it to easily form super- and sub-linear functions of the feature. Feeding powers of continuous features was found to improve offline accuracy.

Training

We use Logistic Regression to predict expected watch time (EWT) of a video, but LR only predicts 0 or 1. How can we use it to predict EWT? We use weighted logistic regression, where the positive (clicked) impressions are weighted by the observed watch time on the video. Negative (unclicked) impressions all receive unit weight.

In training, we use the weighted cross entropy loss: WeightedCrossEntropy = −∑_i[T_iy_ilog p_i + (1 − y_i)log (1 − p_i)] ### Serving

In serving, we directly output e^θx as the predicted watch time. Why is this the watch time? Recall in Logistic Regression, we have a binary classification problem, so we can define P(Y_i = 1|X_i) = p, P(Y_i = 0|X_i) = 1 − p and $$ p = \frac{1}{1 + e^{-\theta^T x}} $$ In statistics, we define odds as: $$ \texttt{odds} = \frac{p}{1-p} = \frac{1}{e^{-\theta^Tx}} = e^{\theta^Tx} $$ If we take a log at both sides, we have the log odds, or logits $$ \ln(\texttt{odds}) = \ln(\frac{p}{1-p}) = \theta^Tx $$ Now let’s look at the our weighted logistic regression problem: Positive impressions are weighted by watch time T_i. Negative impressions receive unit weight. We have a total N videos, and k of them are positive (clicked). We have k = pN and the expected watch time of all videos is $\mathbb E[T] = \frac{\sum_i^k T_i}{N}$. Now look at our weighted odds: $$ \begin{align} \texttt{wghted odds} &= \frac{\text{weighted pos prob}}{\text{weighted neg prob}}\\ &= \frac{\sum_i^k T_i}{N-k} = \frac{\sum_i^k T_i}{N-pN} = \frac{\sum_i^k T_i}{N(1-p)} = \frac{\sum_i^k T_i}{N}\frac{1}{1-p}\\ & = \mathbb E[T](1+p) \approx \mathbb E[T] \hspace{20px}\text{($p$ is small)} \end{align} $$ Therefore, at serving time, we can directly output e^θx because it is the expected watch time.

Evaluation Metric

Since we’re essentially predicting a video’s watch time, we “wrongly-predicted video’s watch time” as our evaluation metric. In the paper, author called it “weighted, per-user loss”. Specifically, for each user, we feed the model both positive (clicked) and negative (unclicked) impressions shown to him on a single page. We first predict their respective watch time with our model. “mispredicted watch time” is the positive video’s watch time when the negative impression receives a longer predicted watch time than the positive impression. This user’s loss is total mispredicted watch time / total watch time of ground-truth (positive) impressions.

Takeaway

example age

expected watch time

KNN - quick serving

feature engineering

what is surrogate problem?

https://www.youtube.com/watch?v=WK_Nr4tUtl8

Comment

RNN这个方法在17年已经由Alex Beutel做上线了，其实在16年初就想做，只是效果还没有完全出来，后来Alex发现了原先做法的一些弱点，很快改进之后就上线了，作为重要的candidates generation来源；排序目标只写了EWT，一是Google的技术保密要求，毕竟还是要做到HR-safe的，论文只能点到即止，二是相对有效并且能够比分开预测ctr和staytime能节省serving latenc

– 严林 under 重读Youtube深度学习推荐系统论文，字字珠玑，惊为神文