I was using LLaVA to query in an image how many characters there are.For higher accuracy, I decided to employ Chain of Thought, but struggledto implement it. CoT is conducted through a multiple round conversation.It is easily done in a graphical chat interface but how is it doneinternally with code?
<h2 id="token-level">Token Level</h2><p>Before diving into instruct / chat model, let’s go to the lowestlevel and think how transformers do generation. Transformer is anautoregressive model: it uses its own output as input for the nextround. Looking at <ahref=”https://github.com/karpathy/nanoGPT/blob/9755682b981a45507f6eb9b11eadef8cb83cebd5/model.py#L328”>nanoGPT’sgenerate
function</a>:</p>
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre>def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""
Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
the sequence max_new_tokens times, feeding the predictions back into the model each time.
"""
for _ in range(max_new_tokens):
idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idx
</pre></td></tr></table>
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre>token0 = tokenizer(text)
output1 = model(token0)
token1 = get_resposne(output1)
output2 = model(token0 + token1)
token2 = get_resposne(output2)
output3 = model(token0 + token1 + token2)
…
</pre></td></tr></table>model(token0 + token1)
, we forgot about all theattention we calculated in model(token0)
eventhough the attention for token0
part is actually completelythe same. This is why people complain transformer inference is slow andthis is where the inference speed-up techniques like KV-cache comesin.</p><p>This also reveals that the very popular graph demonstrating thetheory behind transformer’s inference lied (at least to me). Whencalculate <spanclass=”math inline”>yi + 1</span>, we donot re-use <spanclass=”math inline”>y0…yi</span>or the attention or the activations in the middle. We just re-feed themback into the model as something completely new.</p>
2
3
4
5
6
</pre></td><td class="code"><pre>input1 = tokenizer(text1)
output1 = model(input1)
# output1 contains input1 and model's response 1
response1 = get_resposne(output1)
input2 = tokenizer(text2)
output2 = model(input1 + response1 + input2)
</pre></td></tr></table>output2
, we feedinput1 + response1
both as new to the model, but thisshouldn’t be a concern anymore since we feed each token as newanyway.</p><h2 id="get_response">get_response
</h2><p>The question now comes to how we should implementget_response
to extract the assistant’s response from thetext-continuation model’s output.</p><ul><li><p>Find the indicator (prefix) of the start of assistant’s message:Note when the model doesn’t follow the instruction and failed togenerate such a prefix, this method fails.</p>
2
3
4
5
6
</pre></td><td class="code"><pre>prefix = "[/INST]" # escape special characters for regex
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens = 300)
detoked_output = processor.decode(output[0], skip_special_tokens=True)
answer_idx = [m.end() for m in re.finditer(prefix, detoked_output)][-1]
answer = detoked_output[answer_idx:]
</pre></td></tr></table>clean_up_tokenization_spaces
variable indecode
function which defaults to False
. (Forwhat it does, see <ahref=”https://discuss.huggingface.co/t/what-does-the-parameter-clean-up-tokenization-spaces-do-in-the-tokenizer-decode-function/17399”>thisdiscussion</a>) Hugging Face set it to True
in both call,but I tried set both to False
or one to True
the other to False
, either can give correct results. Thatsaid, it’s still best to follow what Hugging Face wrote. After all theyknow their codes best.</p>
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre>with torch.no_grad():
output = model.generate(**inputs, max_new_tokens = 300)
detoked_output = processor.decode(output[0], skip_special_tokens=True,
clean_up_tokenization_spaces=True)
cutoff = len(text_processor.decode(
inputs["input_ids"][0],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
))
answer = detoked_output[cutoff:]
</pre></td></tr></table>
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>chat = [
{"role": "user", "content": "<image>\nHow many animated characters are there in this image?"}
]
prompt = text_processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, image, return_tensors="pt").to(device)
…
detoked_output = processor.decode(output[0], skip_special_tokens=True)
cutoff = len(prompt)
</pre></td></tr></table>cutoff
is actually many indexes after the realstarting point of assistant’s response. That is because when weapply_chat_template
, we added some special tokens<s> <\s>
to indicate the start and end of oneturn of conversation with the assistant, but when we detokenize theoutput, we skip_special_tokens
to get the response only andcaused this discrepancy.</p><p>I thought at first that this discrepancy comes from LLaVA replaced<image>
token with the image embeddings (orpixel_values
as Hugging Face calls it) because<image>
also disappeared in thedetoked_output
. However, after reading LLaVA’s paper: <ahref=”https://arxiv.org/abs/2304.08485”>Visual Instruction Tuning</a>Figure 1: LLaVA network architecture, I realized LLaVA actually puts theimage in front of the text input instead of inserting it in themiddle.</p><image>
disappeared because it’s also aspecial token. However it was not inside thetokenizer.all_special_tokens
. Reading the source code oftokenizer, I’m actually not sure how it was added as a special token sowas not able to debug why it’s not in all_special_tokens
.For this specific behavior, I submitted <ahref=”https://discuss.huggingface.co/t/additional-special-tokens-are-not-added/93192”>anissue on Hugging Face forum</a>.</p><p>You can find chat template definition intokenizer_config.json -> "chat_template"
. Also in thisfile, "added_tokens_decoder"
attribute defines<image>
as a special token.</p><h2 id="the-complete-code">The Complete Code</h2><p>I referenced Hugging Face conversation pipeline for <ahref=”https://huggingface.co/docs/transformers/main/conversations#what-happens-inside-the-pipeline”>thegeneral structure</a> and <ahref=”https://github.com/huggingface/transformers/blob/1c1aec2ef1d6822fae3ffbb973b4c941f65f4ddf/src/transformers/pipelines/text_generation.py#L369-L387”>theresponse extractor</a></p>
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="code"><pre>queries = [
"<image>\nHow many animated characters are there in this image?",
"Answer with a single number in decimal format. Give no explanations."
]
def generate_response(image):
chat = []
for query in queries:
chat.append({"role": "user", "content": query})
prompt = text_processor.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, image, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens = 300)
output = processor.decode(output[0], skip_special_tokens=True)
input_ids = inputs["input_ids"]
cutoff = len(text_processor.decode(
input_ids[0],
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
))
answer = output[cutoff:]
chat.append({"role": "assistant", "content": answer})
return answer
</pre></td></tr></table>