Jul 08, 2024 - C4AIL Staff

AI Tools - Video Ecosystem Update July 2024

An overview of progress on production ready GenAI products in the Video Generation Space.

State of the Video Ecosystem

It’s been 5 months since the announcement of OpenAI’s Sora teasing the next generation of video models, and while OpenAI’s product continues to be unavailable for the general public, several competitors have released products that achieve state-of-the-art performance.

Improvements

Here’s the quick delta of progress on video over the last half year

More Choices
There’s now multiple vendor’s in the market, adding some competition to Runway’s previously lone position in the market.
Increased Resolution
Resolution increase is following the paths of image generation and has accelerated since the start of the year. Most platforms now offer resolution at 1024x1024 or even 1080p, up from 512x512 in the previous generation, with additional upscaling options available in some products.
Longer Videos
Most solutions now offer video generation in excess of the early 30 and 60 second limitations, at least in their paid tiers.
Increased Quality
Similar to image generation, there’s a marked increase in quality, catching up with the expectations set in the Sora announcement. Video’s key challenge, Temporal Consistency, is showing significant improvements compared to six months ago: Most models now retain enough context to use consistent faces and scene composition, especially in longer shots.
Still Identifiable as AI Generation
The technology continues to struggle with complex scenes and as a result AI generated videos are still identifiable as such. Beyond consistency challenges such as depth sorting issues, unnatural teeth and unreadable text, most AI videos tend utilize a specific type of moving pan and zoom effect making them highly recognizable as AI.
Commercial Products Ahead of Open Source
As expected given the cost and data volume required for training video models, commercial models are about a year ahead of Open Source products at this point in time. Major Open Source contributors, like Meta, are careful about releasing video models into the public domain due to the very high misuse potential of the technology. Outside of niche uscases we don’t expect this to change anytime soon - the complexity and hardware needs of video processing pipelines strongly favors cloud based solutions at this point in time.

Trends Going Forward

Scaling as expected
Video generation leverages the same diffusion model technology as image generation and progress on image generation tends to benefit video generation as well. Technical progress therefore has been more or less predictable and the future trajectory is becoming more clear: In the short term we will continue to see better, cheaper, faster models every couple of months due as technical progress improves generation efficiency by 10-50x over the next 12 months.
From consumers to creators
Similar to music and image generation, video generation products are currently targeting consumers, primarily to pad their numbers and maximize marketing effects. Over time once easy scaling through larger models is exhausted, much more attention will be given to creator UX and professional audiences.
Video as a major driver of future compute demand
Video is emerging as a major hope for both Nvidia and cloud compute providers in the future, likely following the decades long pattern observed in video games: Continued efficiency increased through hardware and software progress will drive increasingly realistic video generation capabilites - all the way to 4k/8k 144pfs real time generation over many hardware generations.

From a business perspective, Video will likely make up an increasing share of global compute capacity. Driven by the ever increasing hunger for short form video content for social media, much of today’s creator economy will likely be forced, by competitive dynamics and indirect incentivisation, to adopt increasing amount of AI. There are risks to this bet: Consumer’s are not exactly welcoming AI content with open arms (see below).
Still contentious and risky to use in public facing use-cases
As observed in the recent Toy’s R Us AI commercial row, consumers, especially in the west, continue to be negatively predisposed to AI generated video and audience reaction tends to be overwhelmingly negative on social media. We expect this to get worse, not better, due to the increased visibility of AI related job-loss in creative professions and advise our members to avoid obvious investor “AI” signalling through consumer marketing.

Video Generation

While there were a few “text-to-video” solutions in the market earlier this year, the palette of available tools has broadened and some of them offer additional control beyond simple prompting

Image to Video

An image is worth a thousand words, and thus image to video provides much more detailed control over the resulting video than using a prompt alone. This particular usecase has become popular since the release of Luma. Images can be anything from historic photos to AI generated scenes, usually by top end models like Midjourney.

Keyframe interpolation

Keyframe interpolation provides more control over creation, using AI to blend between provided reference images / key-frames guided by a prompt. First teased in an Open Source mode called ToonCrafter (see below), other AI tools like RunwayML and Kling have started offering them as well, opening up much more direct control for creators.

Here’s a YouTube Video showing Keyframe Interpolation in Luma AI.

Flagship Products

While there are dozens of startups in the field, there’s currently three products we recommend as state-of-the-art:

RunwayML Gen-3

As the player already on their third iteration in the market, RunwayML’s creator tooling is currently the most advanced.

Kling AI

Kling is a chinese video model produce by Kuaishow company, able to generate up to 3 minute clips of with competitive quality and cost.

Access to Kling is not straightforward from the western audiences as the product requires a Chinese phonenumber - a limitation that can be circumvented by signing up for an account at the parent company and using that to access the model. Since July 2024, an Kuaishow has added an english version of their page, indicating possible moves to officially enter the non Chinese market.

LumaLabs DreamMachine

Luma, a company already in the market with a number of Video to 3D services, such as a Video to 3D Object (NERF) API as well as an iPhone App has released their DreamMachine webapp for text-to-video and image-to-video usecases.

They offer a free tier with 20 generations/120s per creations as well as paid tiers with more generous limits. For more details and how to get started, refer to their documentation.

StabilityAI Stable Video

Stability AI is one of the oldest players and they do offer a video creation product powered by their popular Stable Diffusion model.

Unfortunately StableVideo is extremely limited and not competitive with the above options. Additionally, perpetual challenges around Stability’s financial state make them a poor choice for companies looking for longer term adoption.

OpenAI Sora

While announced earlier this year, Sora is still not available to the public at this point in time.

Google

Like OpenAI, Google has teased video AI models for about a year now, none of the products are available to customers yet. Their Google Vids for Workspace indicates a Gemini branded model coming to Google Workspace at some point in the future.

Odyssey Systems

Odyssey Systems is a startup looking to provide “Hollywood-grade visual AI” with extensive controllability. Their product is unreleased but probably worthwhile keeping an eye on in the future as a more professional than consumer oriented product

Cartoon Frame Interpolation

ToonCrafter
is an implementaiton of the Open Source ToonCrafter Paper which promises anime frame interpolation. Given two key-frames and a prompt, the AI will interpolate between, promising rapid speed up of animaton related workflows. The product is currently in an early research state and not production ready, but foreshadows the kind of productivity gains AI will create in video creation.
The model is also available for a quick test-drive Replicate.com.
LivePortrait [code]
is an open source model implementing portrait animation from an image and driving video, similar to Microsoft’s unreleased Vasa-1 research. It can be used to create high quality avatars / character portraits on consumer GPUs using Open Source tooling like ComfyUI [Tutorial Video]. The model is also available for a quick test-drive Replicate.com.

Adding Sound Effects

As extensively covered in our State of Audio Update, several video focused products exist to help adding sound and music to existing videos:

ElevenLabs Eleven Studios offers high quality, automatic video dubbing / translation with high production values for content with less than 10 unique speakers and makes sense cutting down social media content production times.
For generating sound effects matching a video, ElevenLabs has released an API and demo webpage at videotosoundeffects.com with the ability to analyze an input video and generate custom sound effects for it.