Alibaba's multimodal generation model from the Wan AI suite, supporting text-to-video, image-to-video, reference-to-video with audio, and text-to-image, in both Chinese and English

Wan 2.6 Benchmarks

Wan 2.6 ranks 23rd in global text-to-image performance with a 1214 Elo, while achieving a significantly higher 7th place ranking in image editing with a 1220 Elo. The model demonstrates competitive stability across the Wan AI suite for both English and Chinese prompt instructions.

Lumenfall Arena
#2
Image-to-Video
1044 Elo

Image-to-Video Landscape

Competition Results

Image-to-Video

Cinematic

View leaderboard
#4
Celebrity Arrival
5 models
Image-to-Video
Source
Prompt

“Ultra-realistic celebrity arrival scene in New York City, filmed as one continuous handheld shot from inside a dense crowd behind barricades. Documentary realism, natural micro-shake, no cuts, no artificial camera moves. Use the subject from the reference image as the main character. Keep face identity highly consistent across all frames. The outfit must match the reference image exactly. The subject should appear calm and controlled, with a subtle confident smile. Nighttime outside a fancy hotel. Lighting comes from street lights, hotel entrance lights, media flashes, phone screens, and reflections on polished cars and glass. Soft realistic shadows and slight atmospheric haze. Natural environment audio only: loud crowd cheering, overlapping voices, shouting fans, camera shutter clicks, phones recording, distant city noise, footsteps, fabric movement, and security activity. Scene flow: The camera begins inside the crowd, partially blocked by heads, raised phones, waving hands, and phone screens. The crowd is chaotic and excited. The camera lifts slightly above shoulder level. Focus shifts naturally between the crowd and the hotel entrance. The subject exits the hotel in the distance as camera flashes go off. Security pushes the crowd back, causing natural camera shake. Through gaps in the crowd, the subject becomes visible, first soft and partially obscured, then clearer as he walks forward with a small escort team. The subject walks up to a fan and signs a printed photo of himself from the reference image. The camera pushes in naturally, not digitally. The subject is now clearly visible near center frame. He walks confidently, raises one hand, and gives a calm wave with a slight smile while the camera struggles to keep him framed. A convoy of three large premium SUVs appears by the curb. Security opens the back door of the middle black Suburban. The subject enters quickly, rolls the window down, and waves to the crowd as the vehicles begin moving away. The crowd jumps, cheers, and records the moment on their phones.”

Help rank Wan 2.6 Pick the better image in blind matchups. Results update rankings in real time.
Start Voting