CLIP investigates whether it is possible to transfer the success oftask-agnostic web-scale pre-training in NLP to another domain (CV).

<blockquote><p>This line of work represents the current pragmatic middle groundbetween learning from a limited amount of supervised “gold-labels” andlearning from practically unlimited amounts of raw text.</p></blockquote><h2 id="approach">2 Approach</h2><h3 id="advantage-of-natural-language-supervision">2.1 Advantage ofNatural Language Supervision</h3><ul><li>easy to scale: natural language data amount is huge, much easier toobtain than crowd-sourced labeling</li><li>flexible zero-shot transfer: connects image representation tolanguage; different from unsupervised or self-supervised model that islimited to image domain.</li></ul><h3 id="constructing-dataset">2.2 Constructing Dataset</h3><p>To explore effects of web-scale pre-training, we first build aweb-scale dataset.</p><ol type="1"><li>Construct a query list of size 500,000 that contains words occurred>= 100 times in Wikipedia</li><li>Search for images of these queries, construct a dataset of 400M(image, text) pair</li><li>Class balance (yeah that’s the word describing“make each class have the same number of samples so it’s fair”) byincluding 20,000 pairs per query</li></ol><h3 id="what-to-predict-what-is-the-loss">2.3 What to Predict? What isthe Loss?</h3><p>Previous methods with natural language supervision attempt is aboutpredicting a bag of words (BoW) / phrase n-gram representation oflabels. The authors explore different approaches. This work is all aboutlarge scale pre-training and scaling. Trainingefficiency is the key to scaling natural language supervision.Authors selected final pre-training method based on efficiency. Theycompared three approaches:</p><ol type="1"><li>Transformer language model (captioning model):train a transformer to predict the caption of an image. So this is agenerative task and uses transformer’s loss function. It learns 3 timesslower than the baseline - approach 2.</li><li>A model predicts BoW encoding of the caption: thiswas used as a simple baseline and authors found approach 1 couldn’t evenbeat this baseline. This approach still tries to predict theexact words of the text label, but the order of how wordsappear no longer matters. This is not much easier due to the widevariety of descriptions, comments, and related text that co-occur withimages.</li><li>A contrastive model predicts which text as a wholeis paired with which image: In this way, we decrease the outputspace to only the number of classes we have. We learn 4 times fasterthan the baseline - approach 2.</li></ol>

<p>See Figure 2 for a detailed comparison on accuracy vs. #(imagesfed) of these three models. This illustrates how fast / slow atraining method learns.</p><table><colgroup><col style="width: 18%" /><col style="width: 42%" /><col style="width: 39%" /></colgroup><thead><tr class="header"><th>Approach</th><th>Output Space</th><th>Answer Space: In ideal scenario, what do we choose from?</th></tr></thead><tbody><tr class="odd"><td>Transformer Language Model</td><td>All English sentences (permutation of all English words)</td><td>500K queries</td></tr><tr class="even"><td>BoW prediction model</td><td>Word count bucket of all English sentences (combination of allEnglish words)</td><td>500K queries</td></tr><tr class="odd"><td>Contrastive pairing model</td><td>Sentences describing class and labels</td><td>batch_size pre-selected queries (32768 in CLIP)</td></tr></tbody></table><p>It’s worth noting that CLIP uses a very large minibatch size of <spanclass=”math inline”>2¹⁵ = 32768</span></p><h3 id="model-architecture-and-scaling">2.4 Model Architecture andScaling</h3>

<p>Image encoder has two architectures: ResNet-50 and ViT</p>

<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre># image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
</pre></td></tr></table><p>Note:</p><ol type="1"><li>d_e represents multi-modal embedding space.</li><li>the temperature parameter <spanclass=”math inline”>τ</span> is directly optimized as alog-parameterized multiplicative scalar to avoid turning as ahyper-parameter. <ahref=”https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/model.py#L367-L368”>implementationin original release</a></li></ol><p>The authors train CLIP from scratch without initializing the imageencoder with ImageNet weights or the text encoder with pre-trainedweights.</p><p>This section also describes how to scale the text encoder and how toscale both kinds of image encoder.</p><h2 id="experiments">3 Experiments</h2><p>Authors conducted experiments on 36 different datasets.</p><h3 id="zero-shot-transfer">3.1 Zero-Shot Transfer</h3><p>Authors wanted to experiment on zero-shot transfer ability because ofthe ability demonstrated in language models. The following is the mostexciting sentence to me in this paper. I think it explains a lot oflarge-scale design choices by OpenAI team. Did this paper inspire Ilyato go all the way down the path of scaling?</p><blockquote><p>Our focus on studying zero-shot transfer as an evaluation of tasklearning is inspired by work demonstrating task learning in the field ofNLP. To our knowledge Liu et al. (2018) first identified task learningas an “unexpected side-effect” when a language model trained to generateWikipedia articles learned to reliably transliterate names betweenlanguages.</p></blockquote><p>Authors explain in detail how we do zero-shot classification and givean interpretation to the pipeline. I wrote the previous “output space”and “answer space” thing based on this interpretation.</p><blockquote><p>The cosine similarity of these embeddings is then calculated, scaledby a temperature parameter τ ,and normalized into a probability distribution via a softmax. Note thatthis prediction layer is a multinomial logistic regression classifierwith L2-normalized inputs, L2-normalized weights, no bias, andtemperature scaling. When interpreted this way, the image encoder is thecomputer vision backbone which computes a feature representation for theimage and the text encoder is a hypernetwork which generates the weightsof a linear classifier based on the text specifying the visual conceptsthat the classes represent. Continuing with this interpretation, everystep of CLIP pre-training can be viewed as optimizing the performance ofa randomly created proxy to a computer vision dataset which contains 1example per class and has 32,768 total classes defined via naturallanguage descriptions.</p></blockquote><p>prompt engineering and ensembling </p><p>Text in our training data is usually a sentence, but text in testdata is just a one word label. To bridge this gap, we use some prompttemplate.</p><ul><li>default: A photo of a {label}</li><li>on several fine-grained image classification datasets, it’s helpfulto specify the category:A photo of a {label}, a type of pet ora satellite photo of a {label}</li><li>ensembling several different prompts improve performance: usedifferent context prompts such as A photo of a big {label}and A photo of a small {label}. Authors construct theensemble over the embedding space instead of probability space. In thisway, they cache a single set of averaged text embedding so compute costdoesn’t increase in amortized time.</li></ul><p>scaling law</p>

<img src=”/images/CLIP_9.png”alt=”Zero-shot CLIP scales wrt model compute” /><figcaption aria-hidden="true">Zero-shot CLIP scales wrt modelcompute</figcaption><p>Scaling law is the law that empirically shows that performance ispredictable as a function of important quantities such as trainingcompute and dataset size.</p><p>On 36 different datasets, ResNet CLIP’s average zero-shot error iswell modeled by a log-log linear scaling trend. However, performance onindividual evaluations is much more varied despite the smooth overalltrend. Authors did not report ViT CLIP scaling results.</p><h3 id="representation-learning">3.2 Representation Learning</h3><p>To use CLIP as a representation of the image, there are two commonapproaches:</p><ul><li>Fitting a linear classifier on a representation extracted from themodel</li><li>End-to-end fine-tuning of the model.</li></ul><p>Fine-tuning increases flexibility, and prior work has convincinglydemonstrated that fine-tuning outperforms linear classification on mostimage classification datasets. However, OpenAI chooses to use linearclassifier to measure CLIP performance for the following reasons:</p><ul><li><p>the more official reason: we chose it because it’s weak andtherefore better shows how dataset-agnostic CLIP is</p><blockquote><p>Our work is focused on developing a high-performing task anddataset-agnostic pre-training approach. Fine-tuning, because it adaptsrepresentations to each dataset during the fine-tuning phase, cancompensate for and potentially mask failures to learn general and robustrepresentations during the pre-training phase. Linear classifiers,because of their limited flexibility, instead highlight these failuresand provide clear feedback during development</p></blockquote></li><li><p>the more practical reason:</p><blockquote><p>Fine-tuning opens up a much larger design and hyper-parameter space,which makes it difficult to fairly evaluate and computationallyexpensive. By comparison, linear classifiers require minimalhyper-parameter tuning and have standardized implementations andevaluation procedures.</p></blockquote></li><li><p>bonus reason:</p><blockquote><p>Linear classifier has the added benefit of being very similar to theapproach used for its zero-shot classifiers which enables extensivecomparisons and analysis</p></blockquote></li></ul><p>approach: Appendix A.3 provides a full guideline oftraining such a linear classifier, including details on hyper-parametersearch, solver method, and train-valid-test split. Notably, the input tothe Logistic Regression is the image embedding (output of the imageencoder I_f), not the multi-modal embedding (imageembedding that went through the multi-modal linear projection)</p><p>results: when comparing to other models of similarcompute requirement, small CLIP have wins and loses. However, CLIPscales very well and the largest model achieves both SOTA score andcompute efficiency.</p><p>ViT vs ResNet: The authors found CLIP ViT is about3x more compute efficient than CLIP ResNet. This is aligned with ViTpaper’s finding</p><p>Out-of-Domain Performance and Natural DistributionShift: Researchers often find models exceeding human onImageNet test set can still make simple mistakes on other test data andscore much lower than human. A common explanation is these models areadept at finding patterns within dataset, so improve in-distributionperformance. However many of these patterns are spurious and do not holdfor other distributions and result in large drops in performance onother datasets.</p><p>Most of the studies that reach the above explanation limited theirevaluation model to those trained on ImageNet. Therefore, the authorswant to know to what degree are these failures attributable to deeplearning, ImageNet, or some combination of the two? They explore this byevaluating ImageNet models on natural distribution shifted dataset.</p><p>Natural distribution shift means testing trained models on data thatis different in e.g. image style, image blurriness, geographic location,and camera operation (Hendrycks et al. The many faces ofrobustness). “Natural” is used to make a distinction from syntheticdistribution shift made through style-transferred or adversariallygenerated.</p><ol type="1"><li>Authors found CLIP perform much better on these natural distributionshifted dataset.</li><li>However, this doesn’t necessarily mean supervised learning onImageNet causes a robustness gap. Other details of CLIP, such as itslarge and diverse pre-training dataset or use of natural languagesupervision could also produce robust models.</li><li>Therefore, OpenAI measured how the performance of CLIP models changeafter adapting to the ImageNet distribution via an L2 regularizedlogistic regression classifier fit to CLIP features on the ImageNettraining set. This improved accuracy on ImageNet by 9.2% to 85.4%, butaverage accuracy under distribution shift slightly decreases.</li></ol><p>To me this doesn’t say much. If you fine-tune (or fit a linearclassifier) to a specific dataset, of course you’d expect its behaviorto be bad on some other dataset. But on the contrary, thesenatural-distribution-shifted dataset is not that different fromImageNet. Yes, there are some animations / sketches, but most are justsome more pictures of that class. And CLIP with an ImageNet linear headcannot get them right. I guess what the authors want to say is thatImageNet is not just A arbitrary dataset, but has almost become amachine learning benchmark dataset. It is supposed to be general becauseall models train on it and these models will be deployed to all sorts ofscenario.</p><p>The authors didn’t go far to attack the generality of ImageNet oreven draw any conclusion on why fitting an ImageNet classification headhurts natural distribution shift performance. The authors just prompt tocaution that though prior work has also pre-trained models ondistributions other than ImageNet, it is common to study and releasemodels only after they have been fine-tuned to ImageNet. And it would bewise to also study the models pre-trained on distributions other thanImageNet.</p><p>Results: Taken together, these results suggest thatthe recent shift towards large-scale task and dataset agnosticpre-training combined with a reorientation towards zero-shot andfew-shot benchmarking on broad evaluation suites promotes thedevelopment of more robust systems and provides a more accurateassessment of performance.</p><h2 id="data-overlap-analysis">5 Data Overlap Analysis</h2><p>A concern with pre-training on a very large internet dataset isunintentional overlap with downstream evals. One option to prevent thisis to identify and remove all duplicates before training a model. Whilethis guarantees reporting true hold-out performance, it requires knowingall possible data which a model might be evaluated on ahead of time.This has the downside of limiting the scope of benchmarking andanalysis.</p><p>Therefore, OpenAI instead built a duplicate detector, document howmuch overlap occurs, and run experiments on dataset with and withoutthese overlaps to measure how performance changes due to these overlaps.So instead of simply removing them, they record performance of beforeand after removing them.</p><p>They found that there is a median overlap of 2.2% and an averageoverlap of 3.2%. Due to this small amount of overlap, overall accuracyis rarely shifted by more than 0.1% with only 7 datasets above thisthreshold.</p><p>It would be useful if OpenAI also releases their duplicate detectormodel. Appendix C discusses it in more details but it doesn’t seem likeOpenAI ever released it.</p><h2 id="limitations">6 Limitations</h2><p>Performance:</p><ol type="1"><li>CLIP cannot beat dataset-specific trained & designed models:CLIP zero-shot performs better than a pre-trained ResNet-50 feature + alinear classifier, but on most datasets, CLIP is well below the SOTA forthat specific dataset.</li><li>zero-shot CLIP still generalizes poorly to data that is trulyout-of-distribution for it: CLIP simply has a super large domain, notreally a general model. For example, MNIST digits are not at all in itsweb-scraped huge dataset, so CLIP does surprisingly bad on this supersimple dataset.</li><li>CLIP is limited to “choosing”: CLIP cannot just take in a pictureand spit out its class. You need to give CLIP a range to choose from.CLIP is based on “choosing”, not “generating” (image captioningmodel)</li></ol><p>Training Methodology:</p><ol type="1"><li>In training time, CLIP repeatedly queried performance on fullvalidation sets to guide optimization. These validation sets often havethousands of examples, which is unrealistic for true zero-shotscenarios. On the contrary, LLM in training time doesn’t do this(?)</li><li>Training dataset comes from Internet. Its image-text pairs areunfiltered and uncurated and result in CLIP models learning many socialbiases.</li></ol><p>Supervision with Natural Language:</p><ol type="1"><li>Many complex tasks and visual concepts can be difficult to specifyjust through text.</li><li>Actual training examples are undeniably useful but CLIP does notoptimize for few-shot performance directly. In our work, we fall back tofitting linear classifiers on top of CLIP’s features. This results in acounter-intuitive drop in performance when transitioning from azero-shot to a few-shot setting.</li></ol><h2 id="broader-impacts">7 Broader Impacts</h2><p>In this section, the authors mainly introduces the bias exists inCLIP and what kind of surveillance it can be used for.</p><p>Nothing too interesting, but they discussed how tweaking the categorysystem can improve model’s performance. This reminds me of what I did inXiaomi’s oversea app store tagging project, where I added new categoryand modified existing category’s definition to improve thecos-similarity based zero-shot classification model performance.</p><blockquote><p>Given that we observed that people under 20 were the most likely tobe classified in both the crime-related and non-human animal categories,we carried out classification for the images with the same classes butwith an additional category ‘child’ added to the categories. We foundthat this drastically reduced the number of images of people under 20classified in either crime-related categories or non-human animalcategories (Table 7). This points to how class design has the potentialto be a key factor determining both the model performance and theunwanted biases or behavior the model may exhibit</p></blockquote><p>The authors then go on to conclude that</p><blockquote><p>Decisions about things like class design are a key determiner notonly of model performance, but also of how and in what contexts modelbiases manifest</p></blockquote><h2 id="takeaways">Takeaways</h2><ul><li><p>Data is still the king in ML. It is possible to transfer thesuccess of task-agnostic web-scale pre-training in NLP to CV.</p></li><li><p>The key to scaling & training efficiency is how compact youroutput space is (word permutation - > word combination ->batch_size)</p></li><li><p>We can use prompt ensembling to improve CLIP’sperformance.</p></li><li><p>To use CLIP as the feature extractor and put a linear classifieron top of it, we use the image embedding (image encoder’s output), nothe multi-modal embedding (image embedding went through the multi-modallinear projection);</p><p>On the other hand, for zero-shot classification, you use multi-modalembedding, the same as the training process except now you only have oneimage and calculate the cos similarity with all class names.</p></li><li><p>Decisions about things like class design are a key determiner notonly of model performance, but also of how and in what contexts modelbiases manifest</p></li></ul>