May 15, 2024 - Georg Zoeller

Analysis: OpenAI Annoucements 2024-05-14

OpenAI Announcements 2024-05-14, Analysis.

Exec Summary

An ad hoc event primarily designed to regain narrative in response to Meta’s recent AI actions and to anticipate further annoucements from Google, Apple and Microsoft in the coming days, showing primarily predictable progress with non-commiting timelines.

No immediately important actions required for most companies from this announcement. Regulators should take note of capability trajectory.

Overview

OpenAI announced an upgraded GPT4 (GPT-4o) that is faster and cheaper (for them) and is extending access to it for free users.
App upgrade for mobile with new UI, Desktop app for windows.
It adds low latency audio processing and video vision capabilities and combines them with their cutting edge voice model to drive real time conversations.
Their voice model is shown to be top of market, rivaling or exceeding existing products from ElevenLabs, but no details are provided on customizability, limits or cost.
No details are provided on API costs, benchmark data, etc at all. No details on improvements on reasoning, factuality, known limitations (hallucinations, etc).

Analysis

This event appears to be partially reactive and aimed at regaining narrative initiative as the leader in the Generative AI field after being put on the defensive by Meta’s Llama3 release^[https://docs.ailti.org/s/Analysis-Meta-Strikes-Back?both].

The timing of the event, it’s short announcement window and the undisclosed but likely extensive rollout period for the changes indicate that it was not something OpenAI had planned long beforehand.

A slightly unfair, cynical commentary would read

“Meta successfully destroyed our ability to monetize our top end model by releasing an equivalent for free across all their surfaces, so we’re doing the only thing we can do to stay in the game: Burn a lot more investor dollars on trying to forestall Llama3-450 and staying competitive. Thank god for quantization.”

This follows are pattern of strategic moves made by Meta over the years.

Commoditized as a Complement

Making GPT4-o available to free users is now a necessary move to stay competitive that will significantly increase OpenAI’s capital burn.

A textbook example of “Commoditize your complements”^[“Laws of Tech: Commoditize your Complement”], with Meta having shifted the expectation for top-end, expensive AI to be “free” and the ecosystem having no choice but to run with it, forced to compete with Meta on their terms (free) and turf (ads) where Meta has every advantage.

Even if OpenAI wanted to switch to ads for monetization, they would be forced to inject them into their main product, degrading it. Meta however has time and a long history of not degrading user experience with ads until the competition in an area has been wiped out.

Fundamental progress and trends

First model in the market to feature both audio and vision transformer for comprehension.
Demonstrated performance among the top of the line audio generation models as well as video vision. Unclear how much of the demo was scripted.
Cost reduction and efficiency increase in line with similar open source progress, likely through quantization.
Efficiency increase plus removal of the additional step of audio transcription by performing it natively in the model enables much lower latency operation.
No details on cost, rate limits, technical details (token window).
No details on improvements on fundamental limitations (hallucinations, jailbreak, recall, etc)

Platform

No significant changes. No details on what specficially will be enabled on the API beyond “GPT4-o”
The 1 million GPTs mentioned as an achievement are meaningless as the GPT platform continues to be a proof-of-concept / prompt sharing mechanism rather than a serious developer platform.
The main issues hampering it, massive reliability issues rooted in core technology limitations, fundamentally confound any ability to create reliable agent technology due to compounding errors.
There’s reason to be sceptical about how much of these capabilities are scalable given the current state of their platform. If the technology relies significantly on performance gains through adoption of Nvidia’s Blackwell Platform, “over the coming weeks” could be a very extended time frame.

Latency Improvement

Demonstrating latency improvement was very important for OpenAI because being able to achieve close to real time performance is required to stay competitive against edge (on device) models expected to be rolled out by Google and Apple on their hardware.
Demonstrating that realistic, real time latency is achieveable for audio and video processing will help reduce investor anxiety around disruption from that direction.
Without understanding cost profile and resource requirements for these usecases, it is impossible to predict long term success in this regard.
They did mention the leveraging Nvidia’s Blackwell platform, but if that’s required to operate the product at scale, “a few weeks” would be very optimistic for rollout period.

Voice and Video Capabilities

The multi modal voice/video comprehension capabilities have been expected and the execution, should real world performance match the demo they gave, comes in along those expectations. We can expect Open Source models to achieve parity on voice/video comprehension within the next 6-9 months.
The voice generation capabilities match that of the best specialized players in the field (e.g. ElevenLabs), with emotion, real time style changes (robot, dramatic), etc. It is likely the remaining smaller companies in this field will exit over time or attempt to compete by enabling “more permissive” terms/high risk usecases.
The focus on Voice and real time interaction capabilities is likely a preemptive move against a to-be-expected Siri refresh from Apple.
A key constraint remains that voice is a very situational and often inefficient interface, even when it works due to privacy and public decorum concerns and hardware limitations. With much of the device using population trained on efficient communication, behavioral barriers exist to fully “native” conversations with AI assistants or search interfaces.

Safety and Security

While OpenAI made some mentions around safety, no details were divulged. Some aspects (such as global clipboard monitoring and auto-sharing with ChatGPT on Windows) look not very well thought through and should have companies afraid of accidental data-exfiltration on high alert.
While the progress on video and audio capabilities was in the ballpark of industry insiders, we can expect that some outsiders will perceive them as an unexpected, even shocking limit break, enabling (for them) unpredicted forms of abuse.

The demonstrated latency and human-like performance enables fully autonomous AI robot calls that are indistinguishable from humans and, given the models ability to read emotions and perceive video events, is likely able to fool the majority of the non tech aware population. Potential dynamite in election season.
We can (and OpenAI absolutely does) expect strong reactions in the press in the coming days and it would not be surprising to see OpenAI leveraging those reactions to position their narrative of “these capabilities should only be in the hands of completely trustable US tech companies and never in Open Source” they desperately need to protect themselves from Meta’s onslaught.
OpenAI did not detail their safety mechanisms, but we can expect it to be confined to specific, over time highly iconic voices and strong API limits. Overall however, the options are limited - Watermarking for example would be ineffective for audible content.

Industry specific impact

Edutech / Conversational

Companies in Edutech are impacted as the event showed the steady, unrellenting path to indistinguishable from human avatars. If their value is tied to human talent or technology, they are at high risk of disintermediation.
DuoLingo (an OpenAI GTM partner) specifically got hit because their usecase (translation/foreign language teaching) was demo’d verbatim on stage and the realisation is setting in that their business lives on OpenAI’s whim the same way StackOverflow now does. Given their partner status, we can assume that Duolingo will announce leveraging the technology from OpenAI soon, however investors will take not of the permanent impairment related to the reliance on OpenAI technology until such time as Open Source commoditizes it.
Any company and startup in that field will have to ask the question on whether they have any moat or advantage against large AI companies deciding to take interest in their field

Games / Entertainment

Real time conversation are a capability of high interest for the Gaming and entertainment industry, especially in controller limited Metaverse environments. While impressive, the cutting edge technology demonstrated here will not be commercially feasible in games, at scale, for several years at current trajectories.
Top end voice models are now breaking into the 95th percentile of performance, posing an existential threat to the voice over profession, especially at rapdily decreasing costs. We can expect that within 9-18 months, fully automated voice generation pipelines for all but the most discerning of games is a technological possibility.
Audiobook narration is on an accelerated path to extinction given the demonstrated abilities of emotion extraction from context.

Call centers, Support, Outreach.

While certainly inspiring dreams of radical cost cutting of support organisations or even low cost sales outreach professional adoption in call centers is currently limited by fundamental, unsolved challenges with the technology (prompt injection, jailbreaks) and brittle, expensive scaffolding (RAG).
As a result, the apparent increase in quality and move towards real time is not fundamentally changing the adoption trajectory yet.
For less conservative / shady usecases (such as scam callcenters), the increased ability to pass as humans would likely drive immediate adoption if commercially or technically (with stolen credit cards) feasible.

Cybersecurity professionals / CISOS

The demonstrated ChatGPT for Windows clipboard feature seems to lack of consideration for corporate security needs, an initial block until evaluated stance is advisable.

Market Regulators

Taking note of this textbook example of overwhelming market dominance in technology that shows the highly centralized and captured nature of the market. Understanding Commoditize your Complements as a disruptive, likely anticompetitive patter is neccessary to understand the direction AI.
Developing deeper understanding of the “dark variable” of Open Source is absolutely critical, as Meta has been combining both effects in their strategy to extremely potent effect.
If even the best funded startups like OpenAI and Inflection [https://www.forbes.com/sites/alexkonrad/2024/03/19/inflection-abandons-chatgpt-challenger-ceo-suleyman-joins-microsoft/] are unable to survive without selling themselves to large tech companies, it should raise serious questions about ROI of current AI incentive schemes.

Cyber Security

The fundamental trend to “AI generated content approaching ‘indistinguishable from humans’ quality” continues to accelerate and will likely trigger increasingly aggressive responses in society. (See step progress of less than 12 month here and here)
Attempts to “train people to recognize AI” are doomed to fail on text, image and audio within this year and video within 1-2 years.
Forward progress at this point is entirely predictable from existing data-points and scaling laws.
Mandated technical mitigations will be largely ineffective within current parameters (open access compute) as they do not constrain bad actors, especially nation state actors leveraging the technology for influence operations.
Decades of experience around content protection indicate that watermarks are a technological dead end doomed to fail. The industry understands that, but welcomes the ability to demonstrate “compliance” and to seize gatekeeper opportunties.
Higher level mitigations (understanding context, fostering critical thinking, access to quality information) are required.
The most realistic regulatory targets for impact are the content distribution platforms and social media.
“Coming Later This Year” is becoming a pattern.

TL;DR

No unexpected technical breakthroughs were demonstrated. Progress shown follows the expected scaling of technology powered by efficiency gains.

Primarily aimed at regaining narrative control (competition with the free Llama 3), driving specific policy objectives (Safety, Deepfakes) and forestalling threats (Siri / edge models) in the immediate future, the event offered very little progress for companies apart from the potential availability of a state of the art video vision transformer (at unknown cost).

Investors and industry are likely were more interested in the expected GPT-5 or true breakthroughs in learning (Q). They are also are likely feeling a bit of unease about the massive increase in costs without corresponding revenue increase.

Analysis: OpenAI Annoucements 2024-05-14

Overview

Analysis

Fundamental progress and trends

Platform

Latency Improvement

Voice and Video Capabilities

Safety and Security

Industry specific impact

Edutech / Conversational

Games / Entertainment

Call centers, Support, Outreach.

Cybersecurity professionals / CISOS

Market Regulators

Cyber Security

TL;DR

Commentary