GPT-4o Release

 2024/05/14 

One day before Google I/O, OpenAI made a Spring Update Release, introducing multi-modal end-to-end model GPT4-o

Capabilities and Features

In their release live, we see

Real-time responsiveness in audio mode, ability to be interruptted
Detect tone and mood in your speech, including how hard you breath
Real-time responsiveness in vision mode: no need to take a picture, just hold your phone’s camera there and it can screenshot(?) for you

Right after the live, OpenAI updated their blog, showing more demos:

Two GPT-4os harmonizing: on the same device same session. They can sing and harmonize. They can follow user’s instruction to sing faster, sing slower, or sing in a higher voice.
Lullaby: user can give instruction by speech to tell GPT-4o to go lighter, louder, …
Taking faster: user can give instruction by speech to tell GPT-4o to speak faster, slower

Failure cases: it sometimes

GPT-4o is a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Previously, ChatGPT Voice Mode is a pipeline of three separate models: audio-text transcription, GPT-3.5/GPT-4 text model, text-to-audio conversion model
Designed a new tokenizer with greater compression: Hindi has 2.9x fewer tokens, Russian 1.7x, Korean 1.7x, Chinese 1.4x, Japanese 1.4x, and European languages, including English, has 1.1x fewer tokens. New tokenizer means fully new pre-trained model (brought up in this reddit thread)
It is super fast, responding to audio inputs with an average of 320 milliseconds, while original ChatGPT Voice Mode has latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. At the same time, GPT-4o “achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence.” What did they do to speed up inference? Is it Quantization, MoE or something else? (brought up in this reddit thread) What’s the model size? Nothing is reported.