xAI

Grok Imagine Video

AI Video Generation Model

xAI's video generation model based on the Aurora architecture, supporting text-to-video, image-to-video, and video editing with native audio-visual synthesis at up to 720p

Grok Imagine Video Benchmarks

Grok Imagine Video is ranked #1 in Text-to-Video with an Elo of 1076 on the Lumenfall Arena, where real users pick the better image in blind comparisons. These rankings are based on 2 blind-vote competitions.

Lumenfall Arena

Text-to-Video

1076 Elo

Text-to-Video Landscape

Elo vs Cost

Elo vs Speed

Speed data is still warming up

We only have enough recent requests for Grok Imagine Video (384ms average).

1 model waiting for enough speed data

Competition Results

Uncategorized

Neon Rain Reverie

5 models

Text-to-Video

Prompt

“Hyper-realistic cinematic video of an elegant young woman in a flowing white silk dress dancing gracefully in heavy pouring rain at night on a neon-lit Tokyo street. Her long wet hair whips dramatically in the wind, the dress clings and flows with realistic fabric and water physics, raindrops splash and create perfect reflections of pink and blue neon signs on the wet pavement. Subtle emotional expression of freedom mixed with melancholy on her face, water droplets on skin and eyelashes catching the light. Smooth dynamic orbiting camera with slight cinematic handheld feel, dramatic volumetric lighting with god rays piercing through the rain, photorealistic, 8K, film grain, shallow depth of field, anamorphic lens flare.”

The Rubik's Gauntlet

5 models

Text-to-Video

Prompt

“Hyper-realistic cinematic close-up of a professional speedcuber solving a 3x3 Rubik's Cube at world-record pace. His hands move with insane precision and blistering speed — fingers flying across the glossy colored faces in a complex sequence of advanced algorithms, rapid twists, and smooth layer turns. The cube rotates with perfect realistic physics, slight motion blur on fast turns, and flawless color consistency as it progresses toward a solved state. Subtle sweat glistening on skin, visible veins, hyper-detailed fingerprints and nail textures. Intense focused facial expression with micro-expressions of concentration in shallow depth of field. Dramatic cinematic side lighting with strong specular highlights and reflections dancing across the cube surfaces and skin. Smooth slow orbiting camera that circles the hands and cube, capturing every intricate finger movement from dynamic angles. Photorealistic, 8K, subtle film grain, anamorphic lens flare, moody intense atmosphere, 24fps.”