So Google, fxxk you.

<h2 id="prerequsities">Prerequsities</h2><p>This picture very well explains how TFLite works and also whyTensorFlow 2 has both a tf and a keras.</p>

<imgsrc=”https://web.archive.org/web/20220216170621if_/https://www.tensorflow.org/lite/images/convert/workflow.svg”alt=”TFLite Workflow” /><figcaption aria-hidden="true">TFLite Workflow</figcaption>
<h2 id="detours">Detours</h2><p>This section is mostly rant, but it is meaningful in preventing youfrom taking any of the wrong path. Skip to the next section for atutorial on what to do.</p><ol type="1"><li><p>We first found the Google’s official release <ahref=”https://github.com/google-research/google-research/tree/master/mobilebert”>http://google-research/mobilebert/</a>,but</p><ul><li>the tutorial was unclear: Why do I need data_dir andoutput_dir to export TFLite? How do I even read in thepre-trained weights?</li><li>the code itself was pretty messy: why did they have export functionand training function all at this same file run_squad.pyand the only way to tell the program whether to train/export is checkingwhether export_dir is None rather than passing a flag?</li></ul><p>In figuring out what each part does in this code, I looked upTensorFlow 1’s doc and good lord they were broken. Google doesn’t evenhost it anywhere: you have to go to <ahref=”https://github.com/tensorflow/docs/tree/master/site/en/r1”>aGitHub repo</a> to read them in .md format. At this momentI decided I will not touch anything written by TensorFlow 1’s API. (Iactually went through this pain back at my first ML intern in Haier, butnot again)</p></li><li><p>Sidenote before this: I didn’t know you can release model’s onKaggle (thought everyone releases on Hugging Face) and Google <ahref=”https://www.kaggle.com/discussions/product-feedback/448425”>movedtheir own TensorFlow Hub to Kaggle</a></p><p>So my supervisor found me <ahref=”https://www.kaggle.com/models/google/mobilebert/tensorFlow1”>amore readable Google release on Kaggle</a> with some high-level API anddoesn’t require you to read the painful source code. The above link hasa redirectto TensorFlow 2 implementation with an official TFLite release. Howneat.</p><p>However, the official TFLite release</p><ol type="1"><li>doesn’t have <ahref=”https://www.tensorflow.org/lite/guide/signatures”>signature</a> -TensorFlow’s specification of input and output (remember when you passinputs to a model you need to give name to theme.g. token_ids = ..., mask = ...) which is required forXiaomi Service Framework to run a TFLite. P.S. Yes signature is notrequired to specify when exporting, but for god’s sake all your tutorialteaches people to use it and your own released ditched it? WTFGoogle.</li><li>is broken (as expected?). <ahref=”forgot%20where%20the%20guide%20was”>When I tried to run it on myPC</a>, I got the following errorindices_has_only_positive_elements was not true.gather index out of boundsNode number 2 (GATHER) failed to invoke.gather index out of boundsNode number 2 (GATHER) failed to invoke.Someone encountered <ahref=”https://github.com/tensorflow/tensorflow/issues/59730”>a similarbug</a> while running the example code provided by TensorFlow and theGoogle SWE found a bug in their example. At this moment I decided not totrust this TFLite file anymore and just convert it on my own.</li></ol></li><li><p>So let’s use this official TensorFlow 2 implementation and <ahref=”forgot%20where%20the%20guide%20was”>convert it to TFLite</a>. Itwas all good and running on my PC, but</p><ol type="1"><li>Its output format was really weird<ul><li>It output consists of'mobile_bert_encoder', 'mobile_bert_encoder_1', 'mobile_bert_encoder_2', ..., 'mobile_bert_encoder_51'</li><li>Each of these has shape (1, 4, 128, 128) for aseq_length = 128, hidden_dim = 512 model. I figured 4 beingthe number of heads and the other 128 is hidden_dim foreach head.</li><li>They output attention scores, not the final encoded vector: my inputwas 5 tokens and they output isoutput[0, 0, 0, :] = array([0.198, 0.138, 0.244, 0.148, 0.270, 0. , 0. , ....They sum to 1 and any other positions at output are 0 , soattention score was my best guess.</li></ul></li><li>It doesn’t run on Android phone:tflite engine load failed due to java.lang.IllegalArgumentException: Internal error: Cannot create interpreter: Op builtin_code out of range: 153. Are you using old TFLite binary with newer model?A <ahref=”https://stackoverflow.com/questions/67883156/tflite-runtime-op-builtin-code-out-of-range-131-are-you-using-old-tflite-bi”>StackOverflow answer</a> suggests the TensorFlow used to export TFLiterunning on my PC doesn’t match the version of TFLite run time on thisAndroid phone. It can also be caused by me messing up with the wholeenvironment while installing <ahref=”https://huggingface.co/docs/optimum/main/en/exporters/tflite/usage_guides/export_a_model”>Optimum</a>to export TFLite last night, but I didn’t bother to look because Ifinally found the solution</li></ol></li><li><p>And comes the savior, the king, the go-to solution in MLOps -Huggingface. Reminded by a discussion I read by chance, I came torealize TFMobileBertModel.from_pretrained actually returnsthe Keras model (and the without TF version returns aPyTorch model). That means I can just use Hugging Face API to read itin, then use the native TensorFlow 2 API to export to TFLite. Andeverything works like a charm now. The final output signature is justHugging Face’s familiar['last_hidden_state', 'pooler_output']</p></li></ol><h2 id="converting-tensorflow-model-to-tflite">Converting TensorFlowModel to TFLite</h2><p>Conversion is pretty straight forward. You can just follow thisofficial guide: <ahref=”https://www.tensorflow.org/lite/models/convert/convert_models”>ForMobile & Edge: Convert TensorFlow models</a>. Though I actuallyfollowed my predecessor’s note (which actually comes from <ahref=”https://www.tensorflow.org/lite/guide/signatures”>another TFtutorial</a>). He also told me to caution that callingtf.disable_eager_execution() can lead to absence ofsignature, so do not call tf.disable_eager_execution() todisable eager mode.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre>from transformers import MobileBertTokenizerFast, TFMobileBertModel

# Convert Model
if be_sane:
bert_model = TFMobileBertModel.from_pretrained(kerasH5_model_path) if keras_file else </span>
TFMobileBertModel.from_pretrained(pytorch_model_path, from_pt = True)
converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
else: # be crazy or already knows the messy TensorFlow.SavedModel format
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
tflite_model = converter.convert()

# Save Model
tflite_output_path = '/model.tflite'
with open(tflite_output_path, 'wb') as f:
f.write(tflite_model)

# Check Signature
# Empty signature means error in the export process and the file cannot be used by Xiaomi Service Framework
interpreter = tf.lite.Interpreter(model_path=tflite_output_path)
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
signatures = interpreter.get_signature_list()
print("tflite model signatures:", signatures)
</pre></td></tr></table></figure><blockquote>
<table><tr><td class="gutter"><pre>1
2
3
4
</pre></td><td class="code"><pre>{'serving_default': {'inputs': ['attention_mask',
'input_ids',
'token_type_ids'],
'outputs': ['last_hidden_state', 'pooler_output']}}
</pre></td></tr></table>
</blockquote><p>In addition, summarizing from the detours I took,</p><ul><li>Do not use Hugging Face’s Optimum for (at least vanilla) conversionbecause it just calls the above command (see <ahref=”https://github.com/huggingface/optimum/blob/e0f58121140ce4baa01919ad70a6c13e936f7605/optimum/exporters/tflite/convert.py#L363-L371”>code</a>)</li><li>Do not even bother to look at <ahref=”https://github.com/google-research/google-research/tree/master/mobilebert#export-mobilebert-to-tf-lite-format”>Google’soriginal code</a> converting MobileBert to TFLite because nobody knowswhat they’re writing.</li></ul><h2 id="running-tflite-on-pc">Running TFLite (on PC)</h2><p>Running TFLite on Android phone is the other department’s task. Ijust want to run the TFLite file on PC to test everything’s good. To dothat, I strictly followed TensorFlow’s official guide: <ahref=”https://www.tensorflow.org/lite/guide/inference#load_and_run_a_model_in_python”>TensorFlowLite inference: Load and run a model in Python</a>.Ourconverted models have the signatures, you can just follow the “with adefined SignatureDef” guide.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>tokenizer = MobileBertTokenizerFast(f"{model_path}/vocab.txt")
t_output = tokenizer("越过长城,走向世界", return_tensors="tf")
ii, tt, am = t_output['input_ids'], t_output['token_type_ids'], t_output['attention_mask']
# get_signature_runner() with empty input gives the "serving_default" runner
# runner input parameter is specified by serving_default[&#x27;inputs&#x27;]
runner = interpreter.get_signature_runner()
output = runner(input_ids = ii, token_type_ids = tt, attention_mask = am)
assert output.keys == ['last_hidden_state', 'pooler_output']
</pre></td></tr></table>
<p>On the other hand, for a model without signatures, you need to usethe more primitive API input_details andoutput_details. They specify the following properties,where index is (probably) the index of this tensor in thecompute graph. To pass input values and get output values, you need toaccess them by this index.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)
</pre></td></tr></table>
<p>The following is the input_details of the non-signatureGoogle packed MobileBert.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre>interpreter.get_input_details()
[{'name': 'model_attention_mask:0',
'index': 0,
'shape': array([ 1, 512], dtype=int32),
'shape_signature': array([ 1, 512], dtype=int32),
'dtype': numpy.int64,
'quantization': (0.0, 0),
'quantization_parameters': {'scales': array([], dtype=float32),
'zero_points': array([], dtype=int32),
'quantized_dimension': 0},
'sparsity_parameters': {}},
{…}]
</pre></td></tr></table>
<h2 id="numerical-accuracy">Numerical Accuracy</h2><p>Our original torch/TensorFlow encoder and the converted TFLiteencoder, when both running on PC using Python, has a 1.2% difference intheir output (last_hidden_state orpooled_output). We do not know where thisdiscrepancy comes from.</p><h2 id="converting-tokenizer-to-tflite">Converting Tokenizer toTFLite</h2><p>We exported and ran the encoder, but that’s not enough. Wecan’t ask the user to type in token_ids every time.Therefore, we need to integrate the preprocessor (tokenizer) into ourTFLite file. To do that, we first tried integrating <ahref=”https://www.kaggle.com/models/tensorflow/bert/TensorFlow2/multi-cased-preprocess/3”>Google’sofficial Keras tokenizer implementation</a> into our BERT model andconvert them together into a TFLite (yeah I didn’t learn the lesson).This failed in the converting step for reasons that would become clearlater. And we switched gears to follow some other guide and first try toconvert a standalone tokenizer to TFLite.</p><p>Tokenizer is a part of the TensorFlow Text library. I followed the <ahref=”https://www.tensorflow.org/text/guide/text_tf_lite”>officialguide: Converting TensorFlow Text operators to TensorFlowLite</a> with text.FastBertTokenizer. Note whenyou follow it, do it carefully and closely. I encountered a few problemsalong the way:</p><ol type="1"><li><p>When you change the text.WhitespaceTokenizer inguide to our text.FastBertTokenizer, remember to specify atext.FastBertTokenizer(vocab=vocab_lst). We need not thepath to the vocab but the actual liste.g. [ "[PAD]", "[unused0]", "[unused1]", ...] describesthe vocab where [PAD] maps to token id 0,[unused0] to token id 1, and so on.</p></li><li><p>text.FastBertTokenizer (or the standard version)does not add [CLS] token for you. Google says this is tomake sure “you are able to manipulate the tokens and determine how toconstruct your segments separately” (<ahref=”https://github.com/tensorflow/text/issues/146”>GitHub issue</a>).How considerate you are, dear Google. I spent one and a half dayfiguring out how to add these tokens when the model’s input length needsto be fixed, otherwise it triggers TensorFlow’s compute graph to throw“can’t get variable-length input” error. I finally found a solution in<ahref=”https://github.com/google-ai-edge/mediapipe/blob/a91256a42bbe49f8ebdb9e2ec7643c5c69dbec6f/mediapipe/model_maker/python/text/text_classifier/bert_tokenizer.py#L58-L71”>Google’smediapipe’s implementation</a>.</p></li><li><p>Could not translate MLIR to FlatBuffer when runningtflite_model = converter.convert(): as mentioned, you mustfollow the guide very carefully. The guide specifies a TensorFlow Textversion. If not this version, the conversion would fail</p>
<table><tr><td class="gutter"><pre>1
</pre></td><td class="code"><pre>pip install -U "tensorflow-text==2.11.*"
</pre></td></tr></table>
</li><li><p>Encountered unresolved custom op: FastBertNormalizewhen running converted interpreter / signature: as stated in the <ahref=”https://www.tensorflow.org/text/guide/text_tf_lite#inference”>Inferencesection of the guide</a>, tokenizers are custom operations and need tobe specified when running inference. (I can’t find doc forInterpreterWithCustomOps anywhere but it does have anargument model_path)</p>
<table><tr><td class="gutter"><pre>1
2
3
</pre></td><td class="code"><pre>interp = interpreter.InterpreterWithCustomOps(
model_content=tflite_model,# or model_path=TFLITE_FILE_PATH
custom_op_registerers=tf_text.tflite_registrar.SELECT_TFTEXT_OPS)
</pre></td></tr></table>
</li><li><p>TensorFlow Text custom ops are not found on Android: the aboveinference guide writes</p><blockquote><p>while the example below shows inference in Python, the steps aresimilar in other languages with some minor API translations</p></blockquote><p>which is a total lie. Android does not support these operations asthe <ahref=”https://www.tensorflow.org/lite/guide/op_select_allowlist#tensorflow_text_and_sentencepiece_operators”>customtext op list</a> only mentions python support.</p></li></ol><p>At the end, I did manage to 1 merge the above tokenizer andHuggingFace model, 2 export a TFLite model that reads in a text andoutputs the last hidden state. However, I seem to have lost that pieceof the code. Don’t worry though. Because thanks to Google’s shittyframework, it only works with very few tokenizer implementations anyway.The work-for-all solution is to build your own tokenizer in Java.</p><blockquote><p>P.S. While debugging the FlatBuffer error, I came across the <ahref=”https://www.tensorflow.org/lite/guide/authoring”>TensorFlowauthoring tool</a> that can explicitly specify a function’s input outputformat and detect op unsupported by TFLite. However, the tools is prettybroken for me. Debugging this tool would probably take longer thanfinding the problem yourself online / ask on a forum.</p></blockquote><h2 id="writing-your-own-tokenizer">Writing Your Own Tokenizer</h2><p>What’s weird is TensorFlow does have an official BERT on Androidexample. Reading it again, I found their tokenizer is actuallyimplemented by C++ (<ahref=”https://www.tensorflow.org/lite/inference_with_metadata/task_library/bert_nl_classifier#key_features_of_the_bertnlclassifier_api”>seethis example</a>). The repo containing the tokenizer code is called <ahref=”https://github.com/tensorflow/tflite-support/blob/master/tensorflow_lite_support/cc/text/tokenizers/bert_tokenizer.h”>tflite-support</a>.Finding <ahref=”https://www.tensorflow.org/lite/inference_with_metadata/lite_support#current_use-case_coverage”>thislibrary’s doc</a>, it becomes clear that the text-related operations arecurrently not supported.</p>
<img src=”/images/tflite-support.png”alt=”TFLite-Support Current use-case coverage” /><figcaption aria-hidden="true">TFLite-Support Current use-casecoverage</figcaption>
<p>Google seems to have used JNI to call the C++ implementation oftokenizer (<ahref=”https://github.com/tensorflow/tflite-support/blob/8ed4a7b70df385a253aad7ed7df782439f42da6c/tensorflow_lite_support/java/src/java/org/tensorflow/lite/task/text/nlclassifier/BertNLClassifier.java#L39-L53”>seecode</a>).</p><p>Therefore, we’d better write our own tokenizer. Luckily Hugging Facealso has a Bert on Android example - <ahref=”https://github.com/huggingface/tflite-android-transformers/tree/master/bert”>tflite-android-transformers</a>and writes more accessible code. We directly copied <ahref=”https://github.com/huggingface/tflite-android-transformers/tree/master/bert/src/main/java/co/huggingface/android_transformers/bertqa/tokenization”>theirtokenizer implementation</a>.</p><p>However, when switching to Chinese vocabulary, the tokenizer goesglitchy. See the following example where we tokenize thesentence「越过长城 ,走向世界」</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre># Our Java tokenizer gives the following tokens, which detokenizes to the following string
tokenizer.decode([101, 6632, 19871, 20327, 14871, 8024, 6624, 14460, 13743, 17575, 102])
'[CLS] 越过长城 , 走向世界 [SEP]'

# On the other hand, official Hugging Face python BertTokenizer gives
tokenizer.decode([101, 6632, 6814, 7270, 1814, 8024, 6624, 1403, 686, 4518, 102])
'[CLS] 越 过 长 城 , 走 向 世 界 [SEP]'

# Inspecting the first difference, our Java tokenizer seems to have used sentencepiece
tokenizer.decode([19871])
'##过'
</pre></td></tr></table>
<p>It turns out <ahref=”https://github.com/google-research/bert/blob/master/multilingual.md#tokenization”>BERTin its original implementation</a> (<ahref=”https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/tokenization.py#L207”>code</a>)does not use sentence-piece tokenizer on Chinese characters. Instead, ituses character level tokenizer. Therefore, we need to first insert awhitespace to every character to ensure sentence-piece isn’t applied.Note Hugging Face tokenizer follows BERT original python code veryclosely so you can <ahref=”https://github.com/huggingface/tflite-android-transformers/blob/dcd6da1bfb28e3cd6bc83b58a112cdcd3d6cc2fe/bert/src/main/java/co/huggingface/android_transformers/bertqa/tokenization/BasicTokenizer.java#L34”>easilyfind where to insert</a> that piece of code.</p><ul><li><p>Bert original implementation in Python, with Chinese logic</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
</pre></td><td class="code"><pre>def tokenize(self, text):
"""Tokenizes a piece of text."""
text = convert_to_unicode(text)
text = self._clean_text(text)
# Chinese Logic
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
</pre></td></tr></table>
</li><li><p>Hugging Face tokenizer in Java, without Chinese logic</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
</pre></td><td class="code"><pre>public final class BasicTokenizer {
public List<String> tokenize(String text) {
String cleanedText = cleanText(text);
// Insert Here
List<String> origTokens = whitespaceTokenize(cleanedText);
</pre></td></tr></table>
</li></ul><h2 id="building-a-classifier">Building a Classifier</h2><p>The final task is actually to build a classifier of 28 online storecommodity classes. As I mentioned in the Detourssection, I do not know and don’t wanna bother to know how to defineor change a signature. Therefore, I again turn to Hugging Face for itsMobileBertForSequenceClassification.</p><p>The default classification head only has 1 layer, I changed itsstructure to give it more expressive power.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre>model = MobileBertForSequenceClassification.from_pretrained(
model_path, num_labels=len(labels), problem_type="multi_label_classification",
id2label=id2label, label2id=label2id)
model.classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(768, 1024)),
('relu1', nn.LeakyReLU()),
('fc2', nn.Linear(1024, num_labels))
]))
# Fine-tune …
torch.save(model.state_dict(), model_path)
</pre></td></tr></table>
<p>However, this throws error when you try to read such a fine-tunedmodel back in. MobileBertForSequenceClassification is setto have one-layer classification head, so it cannot read in yourself-defined classifier’s weights.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
</pre></td><td class="code"><pre>torch_model = CustomMobileBertForSequenceClassification.from_pretrained(
model_path, problem_type="multi_label_classification",
num_labels=len(labels), id2label=id2label, label2id=label2id)

> Some weights of MobileBertForSequenceClassification were not initialized from the model checkpoint at ./ckpts/ and are newly initialized: ['classifier.bias', 'classifier.weight']
</pre></td></tr></table>
<p>To solve this, you can</p><ol type="1"><li>Save encoder weight and classifier weight separately, then load themseparately</li><li>Create a custom class corresponding to your weights and initializean instance of that class instead</li></ol><p>2 is clearly <ahref=”https://github.com/huggingface/transformers/issues/1001#issuecomment-520162877”>themore sensible way</a>. You should read the very clearly writtenMobileBertForSequenceClassification to understand whatexactly needs to be changed. It turns out all we have to do is to extendthe original class and change its __init__ part, so it hasa 2-layer classification head.</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
</pre></td><td class="code"><pre>from transformers import MobileBertForSequenceClassification, TFMobileBertForSequenceClassification

class CustomMobileBertForSequenceClassification(MobileBertForSequenceClassification):
def init(self, config):
super().init(config)
self.classifier = nn.Sequential(OrderedDict([
('fc1', nn.Linear(768, 1024)),
('relu1', nn.LeakyReLU()),
('fc2', nn.Linear(1024, 28))
]))
self.post_init()

class TFCustomMobileBertForSequenceClassification(TFMobileBertForSequenceClassification):
def init(self, config, *inputs, **kwargs):
super().init(config, inputs, **kwargs)</span>
self.classifier = keras.Sequential([
keras.layers.Dense(1024, input_dim=768, name='fc1'),
keras.layers.LeakyReLU(alpha=0.01, name = 'relu1'), # Keras defaults alpha to 0.3
keras.layers.Dense(28, name='fc2')
])

torch_model = CustomMobileBertForSequenceClassification.from_pretrained(
model_path, problem_type="multi_label_classification",
num_labels=len(labels), id2label=id2label, label2id=label2id)
tf_model = TFCustomMobileBertForSequenceClassification.from_pretrained(
…, from_pt=True)
</pre></td></tr></table></figure><p>However, you may find these two models output different values on thesame input. A closer look at weights unveil that Hugging Facedidn’t convert classifier’s weights from our Torch model to TensorFlowmodel correctly. We have to set them manually instead.</p>
<table><tr><td class="gutter"><pre>1
2
</pre></td><td class="code"><pre>tf_model.classifier.get_layer("fc1").set_weights([torch_model.classifier.fc1.weight.transpose(1, 0).detach(), torch_model.classifier.fc1.bias.detach()])
tf_model.classifier.get_layer("fc2").set_weights([torch_model.classifier.fc2.weight.transpose(1, 0).detach(), torch_model.classifier.fc2.bias.detach()])
</pre></td></tr></table>
<p>And now we are finally ready to go.</p><h2 id="quantization">Quantization</h2><p>I followed this official doc: <ahref=”https://ai.google.dev/edge/litert/models/post_training_quantization”>Post-trainingquantization</a>. Because of time limit, I didn’t try Quantization AwareTraining (QAT).</p>
<table><tr><td class="gutter"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>vanilla_converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
tflite_model = vanilla_converter.convert()

quant8_converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
quant8_converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant8_model = quant8_converter.convert()

quant16_converter = tf.lite.TFLiteConverter.from_keras_model(bert_model)
quant16_converter.optimizations = [tf.lite.Optimize.DEFAULT]
quant16_converter.target_spec.supported_types = [tf.float16]
tflite_quant16_model = quant16_converter.convert()
</pre></td></tr></table>
<p>Below I report several key metrics for this Chinese-MobileBERT + a2-layer classification head of [768
1024, 1024*class_num].This was tested on a Xiaomi 12X with snapdragon 870. The baseline modelis my colleague’s BERT-Large implementation with accuracy 88.50% andsize 1230MB. My model’s accuracy was bad at first: 75.01% withhyper-parameter weight_decay = 0.01, learning_rate = 1e-4,but we searched out a good hyper-parameter ofweight_decay = 2e-4,learning_rate = 2e-5 giving 86.01%. Wehad 28 classes, 38000 training data in total, and trained for 5 epochswhere the validation accuracy roughly flattens.</p><table><colgroup><col style="width: 12%" /><col style="width: 11%" /><col style="width: 5%" /><col style="width: 24%" /><col style="width: 10%" /><col style="width: 12%" /><col style="width: 10%" /><col style="width: 4%" /><col style="width: 7%" /></colgroup><thead><tr class="header"><th>Quantization</th><th>Logit Difference</th><th>Accuracy</th><th>Accuracy (after hyper-param search)</th><th>Model Size (MB)</th><th>Inference Time(ms)</th><th>Power Usage(ma)</th><th>CPU(%)</th><th>Memory(MB)</th></tr></thead><tbody><tr class="odd"><td>float32 (No quant)</td><td>0</td><td>75.01%</td><td>86.094%</td><td>101.4</td><td>1003.3</td><td>89.98</td><td>108.02</td><td>267.11</td></tr><tr class="even"><td>float16</td><td>0.015%</td><td>75.01%</td><td>86.073%</td><td>51</td><td>838</td><td>64.15</td><td>108.77</td><td>377.11</td></tr><tr class="odd"><td>int8</td><td>4.251%</td><td>63.49%</td><td>85.947%</td><td>25.9</td><td>573.8</td><td>60.09</td><td>110.83</td><td>233.19</td></tr></tbody></table><p>If look at the not fine-tuned, vanilla transformer encoder only, thelast_hidden_state has a difference:</p><table><thead><tr class="header"><th>Quantization</th><th>Logit Difference</th><th>Model Size (MB)</th></tr></thead><tbody><tr class="odd"><td>float32 (No quant)</td><td>0</td><td>97</td></tr><tr class="even"><td>float16</td><td>0.1%</td><td>48.1</td></tr><tr class="odd"><td>int8</td><td>19.8%</td><td>24.9</td></tr></tbody></table><h2 id="small-language-models">Small Language Models</h2><p>BERT is the go-to option for classification task. But when it comesto small BERT, we had several options:</p><ul><li><p>mobileBERT</p></li><li><p>distilledBERT</p></li><li><p>tinyBERT</p></li></ul><p>As the post is about, we used mobileBERT at last because it’s byGoogle Brain and Google probably knows their thing best.</p><p>On the other hand, if you’re looking for small generative model,which people mostly call SLM (Small Language Model) as opposed to LLM, Ifound these options but didn’t try them myself.</p><ul><li>openELM: Apple, 1.1B</li><li>Phi-2: Microsoft, 2.7B</li></ul><h2 id="post-script">Post Script</h2><p>If you want to build an app utilizing edge transformer, I wouldrecommend to read the source code of <ahref=”https://github.com/huggingface/tflite-android-transformers”>HuggingFace’s toy app</a>. It doesn’t have a README or tutorial, nor have Igone through it personally, but everything from TensorFlow sucks(including MediaPipe unfortunately)</p><p>When checking back on this tutorial at date 2024/12/28, I foundGoogle released <ahref=”https://github.com/google-ai-edge/ai-edge-torch”>AI EdgeTorch</a>, the official tool converting PyTorch models into a .tfliteformat. So you may probably want to try this first, but again, don’ttrust anything from TensorFlow team.</p>