Was OpenAI GPT-4o Hype A Troll On Google? via @sejournal, @martinibuster

OpenAI's GPT-4o performs worse than GPT-4 Turbo in reading comprehension and is barely better elsewhere. Was it worth the hype? The post Was OpenAI GPT-4o Hype A Troll On Google? appeared first on Search Engine Journal.

May 15, 2024 - 21:05

0 78

Was OpenAI GPT-4o Hype A Troll On Google? via @sejournal, @martinibuster

OpenAI managed to steal the attention away from Google in the weeks leading up to Google’s biggest event of the year (Google I/O). When the big announcement arrived there all they had to show was a language model that was slightly better than the previous one with the “magic” part not even in Alpha testing stage.

OpenAI may have left users feeling like a mom receiving a vacuum cleaner for Mothers Day but it surely succeeded in minimizing press attention for Google’s important event.

The Letter O

The first hint that there’s at least a little trolling going on is the name of the new GPT model, 4 “o” with the letter “o” as in the name of Google’s event, I/O.

OpenAI says that the letter O stands for Omni, which means everything, but it sure seems like there’s a subtext to that choice.

GPT-4o Oversold As Magic

Sam Altman in a tweet the Friday before the announcement promised “new stuff” that felt like “magic” to him:

“not gpt-5, not a search engine, but we’ve been hard at work on some new stuff we think people will love! feels like magic to me.”

OpenAI co-founder Greg Brockman tweeted:

“Introducing GPT-4o, our new model which can reason across text, audio, and video in real time.

It’s extremely versatile, fun to play with, and is a step towards a much more natural form of human-computer interaction (and even human-computer-computer interaction):”

The announcement itself explained that previous versions of ChatGPT used three models to process audio input. One model to turn audio input into text. Another model to complete the task and output the text version of it and a third model to turn the text output into audio. The breakthrough for GPT-4o is that it can now process the audio input and output within a single model and output it all in the same amount of time that it takes a human to listen and respond to a question.

But the problem is that the audio part isn’t online yet. They’re still working on getting the guardrails working and it will take weeks before an Alpha version is released to a few users for testing. Alpha versions are expected to possibly have bugs while the Beta versions are generally closer to the final products.

This is how OpenAI explained the disappointing delay:

“We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities.

The most important part of GPT-4o, the audio input and output, is finished but the safety level is not yet ready for public release.

Some Users Disappointed

It’s inevitable that an incomplete and oversold product would generate some negative sentiment on social media.

AI engineer Maziyar Panahi (LinkedIn profile) tweeted his disappointment:

“I’ve been testing the new GPT-4o (Omni) in ChatGPT. I am not impressed! Not even a little! Faster, cheaper, multimodal, these are not for me.
Code interpreter, that’s all I care and it’s as lazy as it was before!”

He followed up with:

“I understand for startups and businesses the cheaper, faster, audio, etc. are very attractive. But I only use the Chat, and in there it feels pretty much the same. At least for Data Analytics assistant.

Also, I don’t believe I get anything more for my $20. Not today!”

There are others across Facebook and X that expressed similar sentiments although many others were happy with what they felt was an improvement in speed and cost for the API usage.

Did OpenAI Oversell GPT-4o?

Given that the GPT-4o is in an unfinished state it’s hard not to miss the impression that the release was timed to coincide with and detract from Google I/O. Releasing it on the eve of Google’s big day with a half-finished product may have inadvertently created the impression that GPT-4o in the current state is a minor iterative improvement.

In the current state it’s not a revolutionary step forward but once the audio portion of the model exits Alpha testing stage and makes it through the Beta testing stage then we can start talking about revolutions in large language model. But by the time that happens Google and Anthropic may already have staked a flag on that mountain.

OpenAI’s announcement paints a lackluster image of the new model, promoting the performance as on the same level as GPT-4 Turbo. The only bright spots is the significant improvements in languages other than English and for API users.

OpenAI explains:

“It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API.”

Here are the ratings across six benchmarks that shows GPT-4o barely squeaking past GPT-4T in most tests but falling behind GPT-4T in an important benchmark for reading comprehension.

Here are the scores:

MMLU (Massive Multitask Language Understanding)
This is a benchmark for multitasking accuracy and problem solving in over fifty topics like math, science, history and law. GPT-4o (scoring 88.7) is slightly ahead of GPT4 Turbo (86.9). GPQA (Graduate-Level Google-Proof Q&A Benchmark)
This is 448 multiple-choice questions written by human experts in various fields like biology, chemistry, and physics. GPT-4o scored 53.6, slightly outscoring GPT-4T (48.0). Math
GPT 4o (76.6) outscores GPT-4T by four points (72.6). HumanEval
This is the coding benchmark. GPT-4o (90.2) slightly outperforms GPT-4T (87.1) by about three points. MGSM (Multilingual Grade School Math Benchmark)
This tests LLM grade-school level math skills across ten different languages. GPT-4o scores 90.5 versus 88.5 for GPT-4T. DROP (Discrete Reasoning Over Paragraphs)
This is a benchmark comprised of 96k questions that tests language model comprehension over the contents of paragraphs. GPT-4o (83.4) scores nearly three points lower than GPT-4T (86.0).

Did OpenAI Troll Google With GPT-4o?

Given the provocatively named model with the letter o, it’s hard to not consider that OpenAI is trying to steal media attention in the lead-up to Google’s important I/O conference. Whether that was the intention or not OpenAI wildly succeeded in minimizing attention given to Google’s upcoming search conference.

Does a language model that barely outperforms its predecessor worth all the hype and media attention it received? The pending announcement dominated news coverage over Google’s big event so for OpenAI the answer is clearly yes, it was worth the hype.

Featured Image by Shutterstock/BeataGFX