This is a detailed reading of Google’s paper <ahref="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf">DeepNeural Networks for YouTube Recommendations</a>

<h2 id="candidate-generation">Candidate Generation</h2><h3 id="problem-setup">Problem Setup</h3><p>We pose recommendation as an extreme multi-class classificationproblem where we predict which video will be watchednext. Specifically, we classify a specific video watch <spanclass=”math inline”>w_t</span> at time <spanclass=”math inline”>t</span> among millions of videos <spanclass=”math inline”>i</span> (classes) from a video corpusV based on a user <spanclass=”math inline”>U</span> and context <spanclass=”math inline”>C</span>. And $u\in \R^d$ represents a high-dimensional “embedding” of theuser-context pair and the $v_j \in\R^d$ represent embeddings of each candidate video. <spanclass=”math display”>P(w_t = i ∣ U, C) = Softmax(V, u)_i</span></p><h3 id="input">Input</h3><p>what we didn’t use: explicit feedback (thumbs up/down, in-productsurveys) because there’s too few of them in the tail of videos.</p><ul><li><p>embedded video watches:</p><ul><li><p>representation of each video: in myunderstanding, YouTube didn’t extract information from their videos, andfed the extracted info into the embedder. Instead, they directly fed thevideo ids into the embedder. To my understanding, “Inspired bycontinuous bag of words language models” means they fed the video idsinto the embedder, just like NLP feeds BoW representation into embedder.It doesn’t mean YouTube decomposes a video into word count like BoW.</p><p>I reached this conclusion from the last sentence of 4.1 FeatureRepresentation - Embedding Categorical Features, where1000000*32 / (2048*2048) = 7</p><blockquote><p>The overwhelming majority of model parameters are in thesehigh-cardinality embedding spaces - for example, one million IDsembedded in a 32 dimensional space have 7 times more parameters thanfully connected layers 2048 units wide.</p></blockquote></li><li><p>representation of watch history: watch historyis a variable-length list of videos. YouTube used average-pooling totransform them into a fixed-length vector.</p></li><li><p>why order-independent pooling?</p></li></ul></li><li><p>embedded search tokens: each query is tokenizedinto unigrams and bigrams. each token is embedded. All these embeddedtokens from all queries are then pooled and fed into the model as asummarized dense search history.</p></li><li><p>geographic embeddings: The user’s geographicregion and device are embedded and concatenated.</p></li><li><p>Simple binary and continuous features: such asthe user’s gender, logged-in state and age are input directly into thenetwork as real values normalized to [0, 1].</p></li><li><p>example (sample) age: ML model often has biastowards past samples because there’s more data about them. It’s commonto promote video freshness during re-rank phase, but YouTube also try toreduce this bias as early as in candidate generation phase. We definethe example age to be the time between training and obtaining thissample. e.g. t days earlier, after having watched videov, user searched word w and clicked on anothervideo y. We use this sample in training, so its sample ageis t. By introducing this feature, our model no longerreflects the average watch likelihood in the training window of severalweeks, but the likelihood at a specific time step. At serving time, thisfeature is set to zero (or slightly negative) to reflect that the modelis making predictions at the very end of the training window.</p><p>总结某<ahref=”https://zhuanlan.zhihu.com/p/52504407”>知乎讨论下的内容</a>:example age 和消除第一种 ad position bias 做法类似</p><ul><li><p>纯feed应用中前置内容点击率被高估：新闻客户端，position靠前的是虚高的，这部分叫positionbias，输入给模型的时候也要输入它们的位置，在线上预估时置0</p></li><li><p>搜索偏向应用中前置内容点击率被低估：手百，手淘，美团等，都会在首页默认展示feed，但很多目的明确的用户压根不会用这些推荐功能，导致这部分展示的内容点击率是被低估了。实际操作中大家可能只针对有feed互动的用户进行采样，抛弃了完全过路型用户的行为，也算是修正bias了</p></li></ul></li></ul><h3 id="data-gathering-details">Data Gathering Details</h3><ul><li><p>Class Balance: generate a fixed number oftraining examples per user, effectively weighting our users equally inthe loss function</p></li><li><p>Permutation-Invariant Pooling: in pooling,YouTube chose average pooling among sum, max, etc, because averageperformed the best. The important thing is, they decided to abandonsequence information whatsoever. Their explanation is below and I don’tquite buy it because they’re definitely better way to solve this problemthan discarding the info altogether. In addition, they did publish <ahref=”https://research.google/pubs/latent-cross-making-use-of-context-in-recurrent-recommender-systems/”>anotherpaper on sequence-based recommendation system</a> later. I think at thispaper’s publishing time, they either didn’t want to publicize it or havenot tested that in detail.</p><blockquote><p>Consider the user has just issued a search query for “taylor swift”.Since our problem is posed as predicting the next watched video, aclassifier given this information will predict that the most likelyvideos to be watched are those which appear on the corresponding searchresults page for “taylor swift”. Unsurprisingly, reproducing the user’slast search page as homepage recommendations performs very poorly. Bydiscarding sequence information and representing search queries with anunordered bag of tokens, the classifier is no longer directly aware ofthe origin of the label.</p></blockquote></li><li><p>Next-Watch Heldout: at the time, manycollaborative filtering systems implicitly choose the labels and contextby holding out a random item and predicting it from other items in theuser’s history. They decided to always hold out and predict user’s nextwatch and achieved much better results. This is now already thestandard. In fact it appeared in my college class CS4780 byKillian.</p></li></ul><h3 id="training">Training:</h3><p>In these experiments, a vocabulary of 1M videos and 1M search tokenswere embedded with 256 floats each in a maximum bag size of 50 recentwatches and 50 recent searches. The softmax layer outputs a multinomialdistribution over the same 1M video classes with a dimension of 256(which can be thought of as a separate output video embedding).</p><ul><li><p>negative sampling: for the same user, inaddition to his watched videos, we also sample some unwatched videos togeneralize the model.</p></li><li><p>Importance Weighting: we do not take softmaxover all the 1M videos. Instead, we sample ~5000 of them, compute theirprobability, re-weight them based on importance (watch time,click rate?) and only compute loss over these samples.</p></li><li><p>loss: this is a multi-class classificationproblem, so we use cross-entropy loss naturally</p></li></ul><h3 id="inference-serving">Inference / Serving:</h3><ul><li><p>kNN: In serving, we need an algorithm sublinearto number of classes (videos to recommend). Say the last layer of thenetwork has hidden dimension dand we have N videos topredict. Decoding hidden dimension back into per-video logits and take asoftmax takes <spanclass=”math inline”>O(dN)</span>. Sortingtakes <spanclass=”math inline”>O(Nlog N)</span>. Thetotal time is <spanclass=”math inline”>O(dN + Nlog N)</span>.On the other hand, naive kNN takes <spanclass=”math inline”>O(dN)</span> time intotal and some heuristic version like Ball tree can take <spanclass=”math inline”>O(dlog N)</span>.</p><p>distance is based on dot product</p></li><li><p>how do we get video embedding? (where isdecoder?)</p>

<img src=”/images/youtube-candidate-generation-architecture.png”alt=”youtube-candidate-generation-architecture” /><figcaptionaria-hidden=”true”>youtube-candidate-generation-architecture</figcaption><p>All classification models have to include a decoder (FC layer) at theend of the network but before the softmax layer to decode the hiddenvector back to per-video logits in video ID space to make prediction. Ifwe have 1M videos and hidden vector is of dimension 256, the decoder isa matrix of size [256, 1M]. However, the graph presented in the paper isvery confusing because the authors omit drawing the decoder and made itan implicit part of the softmax layer.</p><p>Anyway, we know we do have that decoder, so it’s natural to use thevectors in the decoder as our video embedding. The i-th video’sembedding is simply decoder[:, i].</p></li><li><p>weight sharing / weight tying: this is a conceptI encountered in nanoGPT and has become clear here. At the beginning ofthe network, we have a video encoder from video ID space to hiddenspace; at the end of the network, we have a video decoder from hiddenspace back to video ID space. It is possible to share weights (use thesame weights) in encoder and decoder to save space (recall this partcosts the most parameter). This is just mentioned by people in <ahref=”https://zhuanlan.zhihu.com/p/52504407”>comment section</a> and isnot implemented by Google.</p></li></ul><h2 id="ranking">Ranking</h2><h3 id="problem-setup-1">Problem Setup</h3><p>We pose ranking as predicting the expected watch time of avideo. Ranking by click-through rate often promotes deceptivevideos that the user does not complete (“clickbait”) whereas watch timebetter captures engagement</p><h3 id="input-feature-engineering">Input (Feature Engineering)</h3><ul><li>user’s previous interaction with the item itself and other similaritems: e.g. user’s past history with the channel that uploaded the videobeing scored (how many videos has the user watched from this channel?When was the last time the user watched a video on this topic?)</li><li>propagate information from candidate generation into ranking in theform of features: e.g. which sources nominated this video candidate?What scores did they assign?</li><li>frequency of past video impressions: If a user was recentlyrecommended a video but did not watch it then the model will naturallydemote this impression on the next page load.</li></ul><h3 id="input-details">Input Details</h3><ul><li>embedding space should increase approximately proportional to thelogarithm of the number of unique values of dataspace</li><li>very large space (video id & query token) is truncatedby click-through-rate. So we only recommend videos above acertain CTR. Note these filtered out videos can still be searchedout.</li><li>Out-of-vocabulary values (new / truncated videos) are mapped to thezero embedding.</li><li>Continuous features are always normalized to [0,1)</li><li>In addition to the raw normalized feature <spanclass=”math inline”>x</span>, we also input <spanclass=”math inline”>$\sqrt x$</span> and <spanclass=”math inline”>x²</span>, giving the networkmore expressive power by allowing it to easily form super- andsub-linear functions of the feature. Feeding powers ofcontinuous features was found to improve offline accuracy.</li></ul><h3 id="training-1">Training</h3><p>We use Logistic Regression to predict expected watch time (EWT) of avideo, but LR only predicts 0 or 1. How can we use it to predict EWT? Weuse weighted logistic regression, where the positive (clicked)impressions are weighted by the observed watch time on the video.Negative (unclicked) impressions all receive unit weight.</p><p>In training, we use the weighted cross entropy loss: <spanclass=”math display”>WeightedCrossEntropy = −∑_i[T_iy_ilog p_i + (1 − y_i)log (1 − p_i)]</span>### Serving</p><p>In serving, we directly output <spanclass=”math inline”>e^θx</span> asthe predicted watch time. Why is this the watch time? Recall in LogisticRegression, we have a binary classification problem, so we can define<spanclass=”math inline”>P(Y_i = 1|X_i) = p, P(Y_i = 0|X_i) = 1 − p</span>and $p = \frac{1}{1 + e^{-\theta^T x}}$ In statistics, we define odds as: <spanclass=”math display”>$\texttt{odds} = \frac{p}{1-p} = \frac{1}{e^{-\theta^Tx}} = e^{\theta^Tx}$</span> If we take a log at both sides, we have the log odds, orlogits $\ln(\texttt{odds}) = \ln(\frac{p}{1-p}) = \theta^Tx$ Now let’s look at the our weighted logistic regressionproblem: Positive impressions are weighted by watch time <spanclass=”math inline”>T_i</span>. Negativeimpressions receive unit weight. We have a total <spanclass=”math inline”>N</span> videos, and <spanclass=”math inline”>k</span> of them are positive (clicked). Wehave k = pNand the expected watch time of all videos is <spanclass=”math inline”>$\mathbb E[T] = \frac{\sum_i^k T_i}{N}$</span>. Nowlook at our weighted odds: $\begin{align}\texttt{wghted odds} &= \frac{\text{weighted posprob}}{\text{weighted neg prob}}\\&= \frac{\sum_i^k T_i}{N-k} = \frac{\sum_i^k T_i}{N-pN} =\frac{\sum_i^k T_i}{N(1-p)} = \frac{\sum_i^k T_i}{N}\frac{1}{1-p}\\& = \mathbb E[T](1+p) \approx \mathbb E[T] \hspace{20px}\text{($p$is small)}\end{align}$ Therefore, at serving time, we can directly output <spanclass=”math inline”>e^θx</span>because it is the expected watch time.</p><h3 id="evaluation-metric">Evaluation Metric</h3><p>Since we’re essentially predicting a video’s watch time, we“wrongly-predicted video’s watch time” as our evaluation metric. In thepaper, author called it “weighted, per-user loss”. Specifically, foreach user, we feed the model both positive (clicked) and negative(unclicked) impressions shown to him on a single page. We first predicttheir respective watch time with our model. “mispredicted watch time” isthe positive video’s watch time when the negative impression receives alonger predicted watch time than the positive impression. This user’sloss is total mispredicted watch time / total watch time of ground-truth(positive) impressions.</p><h2 id="takeaway">Takeaway</h2><p>example age</p><p>expected watch time</p><p>KNN - quick serving</p><p>feature engineering</p><p>what is surrogate problem?</p><p>https://www.youtube.com/watch?v=WK_Nr4tUtl8</p><h2 id="comment">Comment</h2><blockquote><p>RNN这个方法在17年已经由AlexBeutel做上线了，其实在16年初就想做，只是效果还没有完全出来，后来Alex发现了原先做法的一些弱点，很快改进之后就上线了，作为重要的candidatesgeneration来源；排序目标只写了EWT，一是Google的技术保密要求，毕竟还是要做到HR-safe的，论文只能点到即止，二是相对有效并且能够比分开预测ctr和staytime能节省servinglatenc</p><p>– <ahref=”https://www.zhihu.com/people/dcb32aa26dd3a456bd79d2c2cdffa433”>严林</a>under <ahref=”https://zhuanlan.zhihu.com/p/52169807”>重读Youtube深度学习推荐系统论文，字字珠玑，惊为神文</a></p></blockquote>