Intro to MatryoshkaRepresentation Learning

In Matryoshka Representation Learning (MRL), we want to construct anencoding <spanclass="math inline">e_d withdimension d such that itstruncations of different lengths (<spanclass="math inline">e_d/16, e_d/8, e_d/4, e_d/2)are each (somewhat) valid representations. Suppose you’re training on aclassification problem with the classic encoder + classifier headarchitecture. At train time:

classic setting: you just use the vector <spanclass="math inline">e_d as input tothe classifier head
MRL: construct multiple classifier heads (in our case 5) and put oneon top of encoding of each length (<spanclass="math inline">e_d/16, …, e_d)and average the loss of each classifier head. So we build heads of size[d, num_class], [d/2, num_class], ... [d/16, num_class]Note these classifier heads share weights.

Application: AdaptiveRetrieval

Online retrieval is one of the tasks where latency matters the most.Given a user query q, it isslow to compute KNN from a dataset of size 1M (<spanclass="math inline">10⁶) indexes if each index hasdimension 3072. With MRL, we can decompose the process into twostages:

Shortlist: First retrieve 2K indexes where the distance is computedusing only 1024-d vector (the first 1024 elements of the 3072vector)
Rerank: Find KNN among these 2K indexes where the distance iscomputed using the full length 3072 vector

The FLOP is therefore reduced from <spanclass="math inline">3072 × 10⁶ to <spanclass="math inline">1024 × 10⁶ + 3072 × 2K.Ce Gao tested full length 3072-dim vector vs adaptive retrieval usingMatryoshka 1024-dim. The accuracy dropped from 99% to 89% with RequestsPer Second (RPS) raises from 300 to 1000.

Find more details of Matryoshka Representation Learning and itsapplications in this wonderful <ahref="https://aniketrege.github.io/blog/2024/mrl/#what-is-mrl-really-this-time">blogpost. Read from section What is MRL? (Really this Time)</a>

Binary Vector Search

<ahref="https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors">CeGao suggested</a> another way to reduce memory and FLOP use. He proposesto turn the length d FP32vector into a length d binaryvector, where original positive value is set to 1 and original negativevalue is set to 0.

Without using adaptive retrieval, the accuracy dropped from 99% to83%, but the latency (RPS = 3000) and memory has a significantimprovement because previously one single vector / encoding consists ofd 32-bit number, whereas nowit only consists of d 1-bitnumber.

If you adapt the Adaptive Retrieval setup mentioned earlier:

Shortlist: retrieve 2K indexes using full-length but binaryvector
Rerank: find KNN among 2K indexes using full-length, FP32vector

you get a precision drop from 99% only to 96% with an RPS of1700.

P.S. I discovered this method on <ahref="https://simonwillison.net/2024/Mar/26/binary-vector-search/">SimonWillison’s blog</a>.