One day before Google I/O, OpenAI made a Spring Update Release, introducing multi-modal end-to-end model GPT4-o
Capabilities and Features
In their release live, we see
- Real-time responsiveness in audio mode, ability to be interruptted
- Detect tone and mood in your speech, including how hard you breath
- Real-time responsiveness in vision mode: no need to take a picture, just hold your phone’s camera there and it can screenshot(?) for you
Right after the live, OpenAI updated their blog, showing more demos:
- Two GPT-4os harmonizing: on the same device same session. They can sing and harmonize. They can follow user’s instruction to sing faster, sing slower, or sing in a higher voice.
- Lullaby: user can give instruction by speech to tell GPT-4o to go lighter, louder, …
- Taking faster: user can give instruction by speech to tell GPT-4o to speak faster, slower
Failure cases: it sometimes
- go wild and speak in another language
- fail in translation tasks
- fail in teaching intonation in Chinese
Technicality
- GPT-4o is a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Previously, ChatGPT Voice Mode is a pipeline of three separate models: audio-text transcription, GPT-3.5/GPT-4 text model, text-to-audio conversion model
- Designed a new tokenizer with greater compression: Hindi has 2.9x fewer tokens, Russian 1.7x, Korean 1.7x, Chinese 1.4x, Japanese 1.4x, and European languages, including English, has 1.1x fewer tokens. New tokenizer means fully new pre-trained model (brought up in this reddit thread)
- It is super fast, responding to audio inputs with an average of 320 milliseconds, while original ChatGPT Voice Mode has latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. At the same time, GPT-4o “achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence.” What did they do to speed up inference? Is it Quantization, MoE or something else? (brought up in this reddit thread) What’s the model size? Nothing is reported.
Inspecting the New Tokenizer
When I used reddit on the day GPT-4o released, this post came to me suggesting Chinese tokens in OpenAI’s new tokenizer are greatly contaminated.
The new tokenizer o200k_base
is actually twice as large
as the last cl100k_base
and has already been loaded to
GitHub in this
commit.