Yao Lirong's Blog

GPT-4o Release

2024/05/14
loading

One day before Google I/O, OpenAI made a Spring Update Release, introducing multi-modal end-to-end model GPT4-o

Capabilities and Features

In their release live, we see

  • Real-time responsiveness in audio mode, ability to be interruptted
  • Detect tone and mood in your speech, including how hard you breath
  • Real-time responsiveness in vision mode: no need to take a picture, just hold your phone’s camera there and it can screenshot(?) for you

Right after the live, OpenAI updated their blog, showing more demos:

  • Two GPT-4os harmonizing: on the same device same session. They can sing and harmonize. They can follow user’s instruction to sing faster, sing slower, or sing in a higher voice.
  • Lullaby: user can give instruction by speech to tell GPT-4o to go lighter, louder, …
  • Taking faster: user can give instruction by speech to tell GPT-4o to speak faster, slower

Failure cases: it sometimes

  • go wild and speak in another language
  • fail in translation tasks
  • fail in teaching intonation in Chinese

Technicality

  • GPT-4o is a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Previously, ChatGPT Voice Mode is a pipeline of three separate models: audio-text transcription, GPT-3.5/GPT-4 text model, text-to-audio conversion model
  • Designed a new tokenizer with greater compression: Hindi has 2.9x fewer tokens, Russian 1.7x, Korean 1.7x, Chinese 1.4x, Japanese 1.4x, and European languages, including English, has 1.1x fewer tokens. New tokenizer means fully new pre-trained model (brought up in this reddit thread)
  • It is super fast, responding to audio inputs with an average of 320 milliseconds, while original ChatGPT Voice Mode has latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. At the same time, GPT-4o “achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence.” What did they do to speed up inference? Is it Quantization, MoE or something else? (brought up in this reddit thread) What’s the model size? Nothing is reported.

Inspecting the New Tokenizer

When I used reddit on the day GPT-4o released, this post came to me suggesting Chinese tokens in OpenAI’s new tokenizer are greatly contaminated.

The new tokenizer o200k_base is actually twice as large as the last cl100k_base and has already been loaded to GitHub in this commit.

CATALOG
  1. 1. Capabilities and Features
  2. 2. Technicality
  3. 3. Inspecting the New Tokenizer