So Google, fxxk you.
<h2 id="prerequsities">Prerequsities</h2><p>This picture very well explains how TFLite works and also whyTensorFlow 2 has both a tf
and a keras
.</p>data_dir
andoutput_dir
to export TFLite? How do I even read in thepre-trained weights?</li><li>the code itself was pretty messy: why did they have export functionand training function all at this same file run_squad.py
and the only way to tell the program whether to train/export is checkingwhether export_dir is None
rather than passing a flag?</li></ul><p>In figuring out what each part does in this code, I looked upTensorFlow 1’s doc and good lord they were broken. Google doesn’t evenhost it anywhere: you have to go to <ahref=”https://github.com/tensorflow/docs/tree/master/site/en/r1”>aGitHub repo</a> to read them in .md
format. At this momentI decided I will not touch anything written by TensorFlow 1’s API. (Iactually went through this pain back at my first ML intern in Haier, butnot again)</p></li><li><p>Sidenote before this: I didn’t know you can release model’s onKaggle (thought everyone releases on Hugging Face) and Google <ahref=”https://www.kaggle.com/discussions/product-feedback/448425”>movedtheir own TensorFlow Hub to Kaggle</a></p><p>So my supervisor found me <ahref=”https://www.kaggle.com/models/google/mobilebert/tensorFlow1”>amore readable Google release on Kaggle</a> with some high-level API anddoesn’t require you to read the painful source code. The above link hasa redirectto TensorFlow 2 implementation with an official TFLite release. Howneat.</p><p>However, the official TFLite release</p><ol type="1"><li>doesn’t have <ahref=”https://www.tensorflow.org/lite/guide/signatures”>signature</a> -TensorFlow’s specification of input and output (remember when you passinputs to a model you need to give name to theme.g. token_ids = ..., mask = ...
) which is required forXiaomi Service Framework to run a TFLite. P.S. Yes signature is notrequired to specify when exporting, but for god’s sake all your tutorialteaches people to use it and your own released ditched it? WTFGoogle.</li><li>is broken (as expected?). <ahref=”forgot%20where%20the%20guide%20was”>When I tried to run it on myPC</a>, I got the following errorindices_has_only_positive_elements was not true.gather index out of boundsNode number 2 (GATHER) failed to invoke.gather index out of boundsNode number 2 (GATHER) failed to invoke
.Someone encountered <ahref=”https://github.com/tensorflow/tensorflow/issues/59730”>a similarbug</a> while running the example code provided by TensorFlow and theGoogle SWE found a bug in their example. At this moment I decided not totrust this TFLite file anymore and just convert it on my own.</li></ol></li><li><p>So let’s use this official TensorFlow 2 implementation and <ahref=”forgot%20where%20the%20guide%20was”>convert it to TFLite</a>. Itwas all good and running on my PC, but</p><ol type="1"><li>Its output format was really weird<ul><li>It output consists of'mobile_bert_encoder', 'mobile_bert_encoder_1', 'mobile_bert_encoder_2', ..., 'mobile_bert_encoder_51'
</li><li>Each of these has shape (1, 4, 128, 128)
for aseq_length = 128, hidden_dim = 512
model. I figured 4 beingthe number of heads and the other 128 is hidden_dim
foreach head.</li><li>They output attention scores, not the final encoded vector: my inputwas 5 tokens and they output isoutput[0, 0, 0, :] = array([0.198, 0.138, 0.244, 0.148, 0.270, 0. , 0. , ...
.They sum to 1 and any other positions at output
are 0 , soattention score was my best guess.</li></ul></li><li>It doesn’t run on Android phone:tflite engine load failed due to java.lang.IllegalArgumentException: Internal error: Cannot create interpreter: Op builtin_code out of range: 153. Are you using old TFLite binary with newer model?
A <ahref=”https://stackoverflow.com/questions/67883156/tflite-runtime-op-builtin-code-out-of-range-131-are-you-using-old-tflite-bi”>StackOverflow answer</a> suggests the TensorFlow used to export TFLiterunning on my PC doesn’t match the version of TFLite run time on thisAndroid phone. It can also be caused by me messing up with the wholeenvironment while installing <ahref=”https://huggingface.co/docs/optimum/main/en/exporters/tflite/usage_guides/export_a_model”>Optimum</a>to export TFLite last night, but I didn’t bother to look because Ifinally found the solution</li></ol></li><li><p>And comes the savior, the king, the go-to solution in MLOps -Huggingface. Reminded by a discussion I read by chance, I came torealize TFMobileBertModel.from_pretrained
actually returnsthe Keras model (and the without TF
version returns aPyTorch model). That means I can just use Hugging Face API to read itin, then use the native TensorFlow 2 API to export to TFLite. Andeverything works like a charm now. The final output signature is justHugging Face’s familiar['last_hidden_state', 'pooler_output']
</p></li></ol><h2 id="converting-tensorflow-model-to-tflite">Converting TensorFlowModel to TFLite</h2><p>Conversion is pretty straight forward. You can just follow thisofficial guide: <ahref=”https://www.tensorflow.org/lite/models/convert/convert_models”>ForMobile & Edge: Convert TensorFlow models</a>. Though I actuallyfollowed my predecessor’s note (which actually comes from <ahref=”https://www.tensorflow.org/lite/guide/signatures”>another TFtutorial</a>). He also told me to caution that callingtf.disable_eager_execution()
can lead to absence ofsignature, so do not call tf.disable_eager_execution()
todisable eager mode.</p>
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre>from transformers import MobileBertTokenizerFast, TFMobileBertModel
# Convert Model
if be_sane:
bert_model = TFMobileBertModel.from_pretrained(kerasH5_model_path) if keras_file else </span>
TFMobileBertModel.from_pretrained(pytorch_model_path, from_pt = True)
converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
else: # be crazy or already knows the messy TensorFlow.SavedModel format
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
tflite_model = converter.convert()
# Save Model
tflite_output_path = '/model.tflite'
with open(tflite_output_path, 'wb') as f:
f.write(tflite_model)
# Check Signature
# Empty signature means error in the export process and the file cannot be used by Xiaomi Service Framework
interpreter = tf.lite.Interpreter(model_path=tflite_output_path)
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
signatures = interpreter.get_signature_list()
print("tflite model signatures:", signatures)
</pre></td></tr></table></figure><blockquote>
2
3
4
</pre></td><td class="code"><pre>{'serving_default': {'inputs': ['attention_mask',
'input_ids',
'token_type_ids'],
'outputs': ['last_hidden_state', 'pooler_output']}}
</pre></td></tr></table>
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>tokenizer = MobileBertTokenizerFast(f"{model_path}/vocab.txt")
t_output = tokenizer("越过长城,走向世界", return_tensors="tf")
ii, tt, am = t_output['input_ids'], t_output['token_type_ids'], t_output['attention_mask']
# get_signature_runner()
with empty input gives the "serving_default" runner
# runner
input parameter is specified by serving_default['inputs']
runner = interpreter.get_signature_runner()
output = runner(input_ids = ii, token_type_ids = tt, attention_mask = am)
assert output.keys == ['last_hidden_state', 'pooler_output']
</pre></td></tr></table>input_details
andoutput_details
. They specify the following properties,where index
is (probably) the index of this tensor in thecompute graph. To pass input values and get output values, you need toaccess them by this index.</p>
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)
</pre></td></tr></table>input_details
of the non-signatureGoogle packed MobileBert.</p>
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre>interpreter.get_input_details()
[{'name': 'model_attention_mask:0',
'index': 0,
'shape': array([ 1, 512], dtype=int32),
'shape_signature': array([ 1, 512], dtype=int32),
'dtype': numpy.int64,
'quantization': (0.0, 0),
'quantization_parameters': {'scales': array([], dtype=float32),
'zero_points': array([], dtype=int32),
'quantized_dimension': 0},
'sparsity_parameters': {}},
{…}]
</pre></td></tr></table>last_hidden_state
orpooled_output
). We do not know where thisdiscrepancy comes from.</p><h2 id="converting-tokenizer-to-tflite">Converting Tokenizer toTFLite</h2><p>We exported and ran the encoder, but that’s not enough. Wecan’t ask the user to type in token_ids
every time.Therefore, we need to integrate the preprocessor (tokenizer) into ourTFLite file. To do that, we first tried integrating <ahref=”https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/multi-cased-preprocess/3”>Google’sofficial Keras tokenizer implementation</a> into our BERT model andconvert them together into a TFLite (yeah I didn’t learn the lesson).This failed in the converting step for reasons that would become clearlater. And we switched gears to follow some other guide and first try toconvert a standalone tokenizer to TFLite.</p><p>Tokenizer is a part of the TensorFlow Text library. I followed the <ahref=”https://www.tensorflow.org/text/guide/text_tf_lite”>officialguide: Converting TensorFlow Text operators to TensorFlowLite</a> with text.FastBertTokenizer
. Note whenyou follow it, do it carefully and closely. I encountered a few problemsalong the way:</p><ol type="1"><li><p>When you change the text.WhitespaceTokenizer
inguide to our text.FastBertTokenizer
, remember to specify atext.FastBertTokenizer(vocab=vocab_lst)
. We need not thepath to the vocab but the actual liste.g. [ "[PAD]", "[unused0]", "[unused1]", ...]
describesthe vocab where [PAD]
maps to token id 0,[unused0]
to token id 1, and so on.</p></li><li><p>text.FastBertTokenizer
(or the standard version)does not add [CLS]
token for you. Google says this is tomake sure “you are able to manipulate the tokens and determine how toconstruct your segments separately” (<ahref=”https://github.com/tensorflow/text/issues/146”>GitHub issue</a>).How considerate you are, dear Google. I spent one and a half dayfiguring out how to add these tokens when the model’s input length needsto be fixed, otherwise it triggers TensorFlow’s compute graph to throw“can’t get variable-length input” error. I finally found a solution in<ahref=”https://github.com/google-ai-edge/mediapipe/blob/a91256a42bbe49f8ebdb9e2ec7643c5c69dbec6f/mediapipe/model_maker/python/text/text_classifier/bert_tokenizer.py#L58-L71”>Google’smediapipe’s implementation</a>.</p></li><li><p>Could not translate MLIR to FlatBuffer
when runningtflite_model = converter.convert()
: as mentioned, you mustfollow the guide very carefully. The guide specifies a TensorFlow Textversion. If not this version, the conversion would fail</p>
</pre></td><td class="code"><pre>pip install -U "tensorflow-text==2.11.*"
</pre></td></tr></table>Encountered unresolved custom op: FastBertNormalize
when running converted interpreter / signature: as stated in the <ahref=”https://www.tensorflow.org/text/guide/text_tf_lite#inference”>Inferencesection of the guide</a>, tokenizers are custom operations and need tobe specified when running inference. (I can’t find doc forInterpreterWithCustomOps
anywhere but it does have anargument model_path
)</p>
2
3
</pre></td><td class="code"><pre>interp = interpreter.InterpreterWithCustomOps(
model_content=tflite_model,# or model_path=TFLITE_FILE_PATH
custom_op_registerers=tf_text.tflite_registrar.SELECT_TFTEXT_OPS)
</pre></td></tr></table>
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre># Our Java tokenizer gives the following tokens, which detokenizes to the following string
tokenizer.decode([101, 6632, 19871, 20327, 14871, 8024, 6624, 14460, 13743, 17575, 102])
'[CLS] 越过长城 , 走向世界 [SEP]'
# On the other hand, official Hugging Face python BertTokenizer gives
tokenizer.decode([101, 6632, 6814, 7270, 1814, 8024, 6624, 1403, 686, 4518, 102])
'[CLS] 越 过 长 城 , 走 向 世 界 [SEP]'
# Inspecting the first difference, our Java tokenizer seems to have used sentencepiece
tokenizer.decode([19871])
'##过'
</pre></td></tr></table>
2
3
4
5
6
7
</pre></td><td class="code"><pre>def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# Chinese Logic
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
</pre></td></tr></table>
2
3
4
5
</pre></td><td class="code"><pre>public final class BasicTokenizer {
public List<String> tokenize(String text) {
String cleanedText = cleanText(text);
// Insert Here
List<String> origTokens = whitespaceTokenize(cleanedText);
</pre></td></tr></table>MobileBertForSequenceClassification
.</p><p>The default classification head only has 1 layer, I changed itsstructure to give it more expressive power.</p>
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre>model = MobileBertForSequenceClassification.from_pretrained(
model_path, num_labels=len(labels), problem_type="multi_label_classification",
id2label=id2label, label2id=label2id)
model.classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(768, 1024)),
('relu1', nn.LeakyReLU()),
('fc2', nn.Linear(1024, num_labels))
]))
# Fine-tune …
torch.save(model.state_dict(), model_path)
</pre></td></tr></table>MobileBertForSequenceClassification
is setto have one-layer classification head, so it cannot read in yourself-defined classifier’s weights.</p>
2
3
4
5
</pre></td><td class="code"><pre>torch_model = CustomMobileBertForSequenceClassification.from_pretrained(
model_path, problem_type="multi_label_classification",
num_labels=len(labels), id2label=id2label, label2id=label2id)
> Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at ./ckpts/ and are newly initialized: ['classifier.bias', 'classifier.weight']
</pre></td></tr></table>MobileBertForSequenceClassification
to understand whatexactly needs to be changed. It turns out all we have to do is to extendthe original class and change its __init__
part, so it hasa 2-layer classification head.</p>
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="code"><pre>from transformers import MobileBertForSequenceClassification, TFMobileBertForSequenceClassification
class CustomMobileBertForSequenceClassification(MobileBertForSequenceClassification):
def init(self, config):
super().init(config)
self.classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(768, 1024)),
('relu1', nn.LeakyReLU()),
('fc2', nn.Linear(1024, 28))
]))
self.post_init()
class TFCustomMobileBertForSequenceClassification(TFMobileBertForSequenceClassification):
def init(self, config, *inputs, **kwargs):
super().init(config, inputs, **kwargs)</span>
self.classifier = keras.Sequential([
keras.layers.Dense(1024, input_dim=768, name='fc1'),
keras.layers.LeakyReLU(alpha=0.01, name = 'relu1'), # Keras defaults alpha to 0.3
keras.layers.Dense(28, name='fc2')
])
torch_model = CustomMobileBertForSequenceClassification.from_pretrained(
model_path, problem_type="multi_label_classification",
num_labels=len(labels), id2label=id2label, label2id=label2id)
tf_model = TFCustomMobileBertForSequenceClassification.from_pretrained(
…, from_pt=True)
</pre></td></tr></table></figure><p>However, you may find these two models output different values on thesame input. A closer look at weights unveil that Hugging Facedidn’t convert classifier’s weights from our Torch model to TensorFlowmodel correctly. We have to set them manually instead.</p>
2
</pre></td><td class="code"><pre>tf_model.classifier.get_layer("fc1").set_weights([torch_model.classifier.fc1.weight.transpose(1, 0).detach(), torch_model.classifier.fc1.bias.detach()])
tf_model.classifier.get_layer("fc2").set_weights([torch_model.classifier.fc2.weight.transpose(1, 0).detach(), torch_model.classifier.fc2.bias.detach()])
</pre></td></tr></table>
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>vanilla_converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
tflite_model = vanilla_converter.convert()
quant8_converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
quant8_converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant8_model = quant8_converter.convert()
quant16_converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
quant16_converter.optimizations = [tf.lite.Optimize.DEFAULT]
quant16_converter.target_spec.supported_types = [tf.float16]
tflite_quant16_model = quant16_converter.convert()
</pre></td></tr></table>class_num
].This was tested on a Xiaomi 12X with snapdragon 870. The baseline modelis my colleague’s BERT-Large implementation with accuracy 88.50% andsize 1230MB. My model’s accuracy was bad at first: 75.01% withhyper-parameter weight_decay = 0.01, learning_rate = 1e-4
,but we searched out a good hyper-parameter ofweight_decay = 2e-4,learning_rate = 2e-5
giving 86.01%. Wehad 28 classes, 38000 training data in total, and trained for 5 epochswhere the validation accuracy roughly flattens.</p><table><colgroup><col style="width: 12%" /><col style="width: 11%" /><col style="width: 5%" /><col style="width: 24%" /><col style="width: 10%" /><col style="width: 12%" /><col style="width: 10%" /><col style="width: 4%" /><col style="width: 7%" /></colgroup><thead><tr class="header"><th>Quantization</th><th>Logit Difference</th><th>Accuracy</th><th>Accuracy (after hyper-param search)</th><th>Model Size (MB)</th><th>Inference Time(ms)</th><th>Power Usage(ma)</th><th>CPU(%)</th><th>Memory(MB)</th></tr></thead><tbody><tr class="odd"><td>float32 (No quant)</td><td>0</td><td>75.01%</td><td>86.094%</td><td>101.4</td><td>1003.3</td><td>89.98</td><td>108.02</td><td>267.11</td></tr><tr class="even"><td>float16</td><td>0.015%</td><td>75.01%</td><td>86.073%</td><td>51</td><td>838</td><td>64.15</td><td>108.77</td><td>377.11</td></tr><tr class="odd"><td>int8</td><td>4.251%</td><td>63.49%</td><td>85.947%</td><td>25.9</td><td>573.8</td><td>60.09</td><td>110.83</td><td>233.19</td></tr></tbody></table><p>If look at the not fine-tuned, vanilla transformer encoder only, thelast_hidden_state
has a difference:</p><table><thead><tr class="header"><th>Quantization</th><th>Logit Difference</th><th>Model Size (MB)</th></tr></thead><tbody><tr class="odd"><td>float32 (No quant)</td><td>0</td><td>97</td></tr><tr class="even"><td>float16</td><td>0.1%</td><td>48.1</td></tr><tr class="odd"><td>int8</td><td>19.8%</td><td>24.9</td></tr></tbody></table><h2 id="small-language-models">Small Language Models</h2><p>BERT is the go-to option for classification task. But when it comesto small BERT, we had several options:</p><ul><li><p>mobileBERT</p></li><li><p>distilledBERT</p></li><li><p>tinyBERT</p></li></ul><p>As the post is about, we used mobileBERT at last because it’s byGoogle Brain and Google probably knows their thing best.</p><p>On the other hand, if you’re looking for small generative model,which people mostly call SLM (Small Language Model) as opposed to LLM, Ifound these options but didn’t try them myself.</p><ul><li>openELM: Apple, 1.1B</li><li>Phi-2: Microsoft, 2.7B</li></ul><h2 id="post-script">Post Script</h2><p>If you want to build an app utilizing edge transformer, I wouldrecommend to read the source code of <ahref=”https://github.com/huggingface/tflite-android-transformers”>HuggingFace’s toy app</a>. It doesn’t have a README or tutorial, nor have Igone through it personally, but everything from TensorFlow sucks(including MediaPipe unfortunately)</p><p>When checking back on this tutorial at date 2024/12/28, I foundGoogle released <ahref=”https://github.com/google-ai-edge/ai-edge-torch”>AI EdgeTorch</a>, the official tool converting PyTorch models into a .tfliteformat. So you may probably want to try this first, but again, don’ttrust anything from TensorFlow team.</p>