OpenAI's president Greg Brockman has posted from his X account what appears to be the first public image generated using the company's brand new GPT-4o model.
As you'll see in the image below, it is quite convincingly photorealistic, showing a person wearing a black T-shirt with an OpenAI logo writing chalk text on a blackboard that reads "Transfer between Modalities. Suppose we directly model P (text, pixels, sound) with one big autoregressive transformer. What are the pros and cons?"
The new GPT-4o model, which debuted on Monday, improves upon the prior GPT-4 family of models (GPT-4, GPT-4 Vision, and GPT-4 Turbo) by being faster, cheaper, and retaining more information from inputs such as audio and vision.
It is able to do so because OpenAI took a different approach from its prior GPT-4 class LLMs. While those chained multiple different models together and converted other media such as audio and visuals to text and back, the new GPT-4o was trained on multimedia tokens from the get-go, allowing it to directly analyze and interpret vision and audio without first converting it into text.
Based on the above image, the new approach is a noticeable improvement over OpenAI's last image generation model DALL-E 3 which debuted in September 2023. I ran a similar prompt through DALL-E 3 in ChatGPT and here is the result.

As you can see, the image shared by Brockman created with GPT-4o improves significantly in quality, photorealism, and accuracy of text generation.
However, GPT-4o's native image generation capabilities are not yet publicly available. As Brockman alluded to in his X post by saying "Team is working hard to bring those to the world."
