CLIP

 2024/04/22 

CLIP investigates whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain (CV).

This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text.

2 Approach

2.1 Advantage of Natural Language Supervision

easy to scale: natural language data amount is huge, much easier to obtain than crowd-sourced labeling
flexible zero-shot transfer: connects image representation to language; different from unsupervised or self-supervised model that is limited to image domain.

2.2 Constructing Dataset

To explore effects of web-scale pre-training, we first build a web-scale dataset.

Construct a query list of size 500,000 that contains words occurred >= 100 times in Wikipedia
Search for images of these queries, construct a dataset of 400M (image, text) pair
Class balance (yeah that’s the word describing “make each class have the same number of samples so it’s fair”) by including 20,000 pairs per query

2.3 What to Predict? What is the Loss?

Previous methods with natural language supervision attempt is about predicting a bag of words (BoW) / phrase n-gram representation of labels. The authors explore different approaches. This work is all about large scale pre-training and scaling. Training efficiency is the key to scaling natural language supervision. Authors selected final pre-training method based on efficiency. They compared three approaches:

Transformer language model (captioning model): train a transformer to predict the caption of an image. So this is a generative task and uses transformer’s loss function. It learns 3 times slower than the baseline - approach 2.
A model predicts BoW encoding of the caption: this was used as a simple baseline and authors found approach 1 couldn’t even beat this baseline. This approach still tries to predict the exact words of the text label, but the order of how words appear no longer matters. This is not much easier due to the wide variety of descriptions, comments, and related text that co-occur with images.
A contrastive model predicts which text as a whole is paired with which image: In this way, we decrease the output space to only the number of classes we have. We learn 4 times faster than the baseline - approach 2.

See Figure 2 for a detailed comparison on accuracy vs. #(images fed) of these three models. This illustrates how fast / slow a training method learns.

Approach	Output Space	Answer Space: In ideal scenario, what do we choose from?
Transformer Language Model	All English sentences (permutation of all English words)	500K queries
BoW prediction model	Word count bucket of all English sentences (combination of all English words)	500K queries
Contrastive pairing model	Sentences describing class and labels	`batch_size` pre-selected queries (32768 in CLIP)

It’s worth noting that CLIP uses a very large minibatch size of 2¹⁵ = 32768

2.4 Model Architecture and Scaling

Image encoder has two architectures: ResNet-50 and ViT

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter
# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

Note:

d_e represents multi-modal embedding space.
the temperature parameter τ is directly optimized as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter. implementation in original release

The authors train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.

This section also describes how to scale the text encoder and how to scale both kinds of image encoder.

3 Experiments

Authors conducted experiments on 36 different datasets.

3.1 Zero-Shot Transfer

Authors wanted to experiment on zero-shot transfer ability because of the ability demonstrated in language models. The following is the most exciting sentence to me in this paper. I think it explains a lot of large-scale design choices by OpenAI team. Did this paper inspire Ilya to go all the way down the path of scaling?

Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP. To our knowledge Liu et al. (2018) first identified task learning as an “unexpected side-effect” when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages.

Authors explain in detail how we do zero-shot classification and give an interpretation to the pipeline. I wrote the previous “output space” and “answer space” thing based on this interpretation.

The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into a probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling. When interpreted this way, the image encoder is the computer vision backbone which computes a feature representation for the image and the text encoder is a hypernetwork which generates the weights of a linear classifier based on the text specifying the visual concepts that the classes represent. Continuing with this interpretation, every step of CLIP pre-training can be viewed as optimizing the performance of a randomly created proxy to a computer vision dataset which contains 1 example per class and has 32,768 total classes defined via natural language descriptions.

prompt engineering and ensembling

Text in our training data is usually a sentence, but text in test data is just a one word label. To bridge this gap, we use some prompt template.

default: A photo of a {label}
on several fine-grained image classification datasets, it’s helpful to specify the category: A photo of a {label}, a type of pet or a satellite photo of a {label}
ensembling several different prompts improve performance: use different context prompts such as A photo of a big {label} and A photo of a small {label}. Authors construct the ensemble over the embedding space instead of probability space. In this way, they cache a single set of averaged text embedding so compute cost doesn’t increase in amortized time.

scaling law

Scaling law is the law that empirically shows that performance is predictable as a function of important quantities such as training compute and dataset size.

On 36 different datasets, ResNet CLIP’s average zero-shot error is well modeled by a log-log linear scaling trend. However, performance on individual evaluations is much more varied despite the smooth overall trend. Authors did not report ViT CLIP scaling results.

3.2 Representation Learning

To use CLIP as a representation of the image, there are two common approaches:

Fitting a linear classifier on a representation extracted from the model
End-to-end fine-tuning of the model.

Fine-tuning increases flexibility, and prior work has convincingly demonstrated that fine-tuning outperforms linear classification on most image classification datasets. However, OpenAI chooses to use linear classifier to measure CLIP performance for the following reasons:

the more official reason: we chose it because it’s weak and therefore better shows how dataset-agnostic CLIP is

Our work is focused on developing a high-performing task and dataset-agnostic pre-training approach. Fine-tuning, because it adapts representations to each dataset during the fine-tuning phase, can compensate for and potentially mask failures to learn general and robust representations during the pre-training phase. Linear classifiers, because of their limited flexibility, instead highlight these failures and provide clear feedback during development
the more practical reason:

Fine-tuning opens up a much larger design and hyper-parameter space, which makes it difficult to fairly evaluate and computationally expensive. By comparison, linear classifiers require minimal hyper-parameter tuning and have standardized implementations and evaluation procedures.
bonus reason:

Linear classifier has the added benefit of being very similar to the approach used for its zero-shot classifiers which enables extensive comparisons and analysis

approach: Appendix A.3 provides a full guideline of training such a linear classifier, including details on hyper-parameter search, solver method, and train-valid-test split. Notably, the input to the Logistic Regression is the image embedding (output of the image encoder I_f), not the multi-modal embedding (image embedding that went through the multi-modal linear projection)

results: when comparing to other models of similar compute requirement, small CLIP have wins and loses. However, CLIP scales very well and the largest model achieves both SOTA score and compute efficiency.

ViT vs ResNet: The authors found CLIP ViT is about 3x more compute efficient than CLIP ResNet. This is aligned with ViT paper’s finding

Out-of-Domain Performance and Natural Distribution Shift: Researchers often find models exceeding human on ImageNet test set can still make simple mistakes on other test data and score much lower than human. A common explanation is these models are adept at finding patterns within dataset, so improve in-distribution performance. However many of these patterns are spurious and do not hold for other distributions and result in large drops in performance on other datasets.

Most of the studies that reach the above explanation limited their evaluation model to those trained on ImageNet. Therefore, the authors want to know to what degree are these failures attributable to deep learning, ImageNet, or some combination of the two? They explore this by evaluating ImageNet models on natural distribution shifted dataset.

Natural distribution shift means testing trained models on data that is different in e.g. image style, image blurriness, geographic location, and camera operation (Hendrycks et al. The many faces of robustness). “Natural” is used to make a distinction from synthetic distribution shift made through style-transferred or adversarially generated.

Authors found CLIP perform much better on these natural distribution shifted dataset.
However, this doesn’t necessarily mean supervised learning on ImageNet causes a robustness gap. Other details of CLIP, such as its large and diverse pre-training dataset or use of natural language supervision could also produce robust models.
Therefore, OpenAI measured how the performance of CLIP models change after adapting to the ImageNet distribution via an L2 regularized logistic regression classifier fit to CLIP features on the ImageNet training set. This improved accuracy on ImageNet by 9.2% to 85.4%, but average accuracy under distribution shift slightly decreases.

To me this doesn’t say much. If you fine-tune (or fit a linear classifier) to a specific dataset, of course you’d expect its behavior to be bad on some other dataset. But on the contrary, these natural-distribution-shifted dataset is not that different from ImageNet. Yes, there are some animations / sketches, but most are just some more pictures of that class. And CLIP with an ImageNet linear head cannot get them right. I guess what the authors want to say is that ImageNet is not just A arbitrary dataset, but has almost become a machine learning benchmark dataset. It is supposed to be general because all models train on it and these models will be deployed to all sorts of scenario.

The authors didn’t go far to attack the generality of ImageNet or even draw any conclusion on why fitting an ImageNet classification head hurts natural distribution shift performance. The authors just prompt to caution that though prior work has also pre-trained models on distributions other than ImageNet, it is common to study and release models only after they have been fine-tuned to ImageNet. And it would be wise to also study the models pre-trained on distributions other than ImageNet.

Results: Taken together, these results suggest that the recent shift towards large-scale task and dataset agnostic pre-training combined with a reorientation towards zero-shot and few-shot benchmarking on broad evaluation suites promotes the development of more robust systems and provides a more accurate assessment of performance.

5 Data Overlap Analysis

A concern with pre-training on a very large internet dataset is unintentional overlap with downstream evals. One option to prevent this is to identify and remove all duplicates before training a model. While this guarantees reporting true hold-out performance, it requires knowing all possible data which a model might be evaluated on ahead of time. This has the downside of limiting the scope of benchmarking and analysis.

Therefore, OpenAI instead built a duplicate detector, document how much overlap occurs, and run experiments on dataset with and without these overlaps to measure how performance changes due to these overlaps. So instead of simply removing them, they record performance of before and after removing them.

They found that there is a median overlap of 2.2% and an average overlap of 3.2%. Due to this small amount of overlap, overall accuracy is rarely shifted by more than 0.1% with only 7 datasets above this threshold.

It would be useful if OpenAI also releases their duplicate detector model. Appendix C discusses it in more details but it doesn’t seem like OpenAI ever released it.

6 Limitations

Performance:

CLIP cannot beat dataset-specific trained & designed models: CLIP zero-shot performs better than a pre-trained ResNet-50 feature + a linear classifier, but on most datasets, CLIP is well below the SOTA for that specific dataset.
zero-shot CLIP still generalizes poorly to data that is truly out-of-distribution for it: CLIP simply has a super large domain, not really a general model. For example, MNIST digits are not at all in its web-scraped huge dataset, so CLIP does surprisingly bad on this super simple dataset.
CLIP is limited to “choosing”: CLIP cannot just take in a picture and spit out its class. You need to give CLIP a range to choose from. CLIP is based on “choosing”, not “generating” (image captioning model)

Training Methodology:

In training time, CLIP repeatedly queried performance on full validation sets to guide optimization. These validation sets often have thousands of examples, which is unrealistic for true zero-shot scenarios. On the contrary, LLM in training time doesn’t do this (?)
Training dataset comes from Internet. Its image-text pairs are unfiltered and uncurated and result in CLIP models learning many social biases.

Supervision with Natural Language:

Many complex tasks and visual concepts can be difficult to specify just through text.
Actual training examples are undeniably useful but CLIP does not optimize for few-shot performance directly. In our work, we fall back to fitting linear classifiers on top of CLIP’s features. This results in a counter-intuitive drop in performance when transitioning from a zero-shot to a few-shot setting.

7 Broader Impacts

In this section, the authors mainly introduces the bias exists in CLIP and what kind of surveillance it can be used for.

Nothing too interesting, but they discussed how tweaking the category system can improve model’s performance. This reminds me of what I did in Xiaomi’s oversea app store tagging project, where I added new category and modified existing category’s definition to improve the cos-similarity based zero-shot classification model performance.

Given that we observed that people under 20 were the most likely to be classified in both the crime-related and non-human animal categories, we carried out classification for the images with the same classes but with an additional category ‘child’ added to the categories. We found that this drastically reduced the number of images of people under 20 classified in either crime-related categories or non-human animal categories (Table 7). This points to how class design has the potential to be a key factor determining both the model performance and the unwanted biases or behavior the model may exhibit

The authors then go on to conclude that

Decisions about things like class design are a key determiner not only of model performance, but also of how and in what contexts model biases manifest

Takeaways

Data is still the king in ML. It is possible to transfer the success of task-agnostic web-scale pre-training in NLP to CV.
The key to scaling & training efficiency is how compact your output space is (word permutation - > word combination -> batch_size)
We can use prompt ensembling to improve CLIP’s performance.
To use CLIP as the feature extractor and put a linear classifier on top of it, we use the image embedding (image encoder’s output), not he multi-modal embedding (image embedding went through the multi-modal linear projection);

On the other hand, for zero-shot classification, you use multi-modal embedding, the same as the training process except now you only have one image and calculate the cos similarity with all class names.
Decisions about things like class design are a key determiner not only of model performance, but also of how and in what contexts model biases manifest

Author：Yao Lirong

Link：https://yao-lirong.github.io/blog/clip/

Publish date：April 22nd 2024, 12:00:00 am

Update date：September 2nd 2025, 9:42:46 pm

License：本文采用 Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) 进行许可

Next Post

GPT-4o Release
Previous Post

Gradient Scaling

CATALOG

1. 2 Approach
2. 3 Experiments
1. 2.1. 3.1 Zero-Shot Transfer
2. 2.2. 3.2 Representation Learning
3. 5 Data Overlap Analysis
4. 6 Limitations
5. 7 Broader Impacts
6. Takeaways