One day before Google I/O, OpenAI made a <ahref="https://openai.com/index/spring-update/">Spring UpdateRelease</a>, introducing multi-modal end-to-end model <ahref="https://openai.com/index/hello-gpt-4o/">GPT4-o</a>

<h2 id="capabilities-and-features">Capabilities and Features</h2><p>In their <ahref=”https://www.youtube.com/watch?v=DQacCB9tDaw”>release live</a>, wesee</p><ul><li>Real-time responsiveness in audio mode, ability to beinterruptted</li><li>Detect tone and mood in your speech, including how hard youbreath</li><li>Real-time responsiveness in vision mode: no need to take a picture,just hold your phone’s camera there and it can screenshot(?) foryou</li></ul><p>Right after the live, OpenAI <ahref=”https://openai.com/index/hello-gpt-4o/”>updated their blog</a>,showing more demos:</p><ul><li>Two GPT-4os harmonizing:on the same device same session. They can sing and harmonize. They canfollow user’s instruction to sing faster, sing slower, or sing in ahigher voice.</li><li>Lullaby: user can giveinstruction by speech to tell GPT-4o to go lighter, louder, …</li><li>Taking faster: user cangive instruction by speech to tell GPT-4o to speak faster, slower</li></ul><p>Failure cases: itsometimes</p><ul><li>go wild and speak in another language</li><li>fail in translation tasks</li><li>fail in teaching intonation in Chinese</li></ul><h2 id="technicality">Technicality</h2><ul><li>GPT-4o is a single new model end-to-end across text, vision,and audio, meaning that all inputs and outputs are processed bythe same neural network. Previously, ChatGPT Voice Mode is a pipeline ofthree separate models: audio-text transcription, GPT-3.5/GPT-4 textmodel, text-to-audio conversion model</li><li>Designed a new tokenizer with greater compression: Hindi has 2.9xfewer tokens, Russian 1.7x, Korean 1.7x, Chinese 1.4x, Japanese 1.4x,and European languages, including English, has 1.1x fewer tokens. Newtokenizer means fully new pre-trained model (brought up in <ahref=”https://www.reddit.com/r/MachineLearning/comments/1cr5lv8/comment/l3ww1y7/”>thisreddit thread</a>)</li><li>It is super fast, responding to audio inputs with an average of 320milliseconds, while original ChatGPT Voice Mode has latencies of 2.8seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. At the same time,GPT-4o “achieves GPT-4 Turbo-level performance on text, reasoning, andcoding intelligence.” What did they do to speed up inference? Is itQuantization, MoE or something else? (brought up in <ahref=”https://www.reddit.com/r/MachineLearning/comments/1cr5lv8/comment/l3wqmit/”>thisreddit thread</a>) What’s the model size? Nothing is reported.</li></ul><h2 id="inspecting-the-new-tokenizer">Inspecting the New Tokenizer</h2><p>When I used reddit on the day GPT-4o released, <ahref=”https://www.reddit.com/r/real_China_irl/comments/1crvv4m/openai%E7%AE%80%E5%8D%95%E8%BE%B1%E4%B8%AA%E5%8D%8E/”>thispost</a> came to me suggesting Chinese tokens in OpenAI’s new tokenizerare greatly contaminated.</p><p>The new tokenizer o200k_base is actually twice as largeas the last cl100k_base and has already been loaded toGitHub in <ahref=”https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74cdb96406c7f3d9add0ae2f8”>thiscommit</a>.</p>