The Reversed Rodeo

Vote
Text-to-Image Photorealism Art

4 models were given the same prompt, and the community voted blind on which outputs looked best. How it works

This competition tests how well AI image models truly understand language versus how much they rely on visual habits from their training data. The prompt is deliberately simple on the surface but devilishly hard in practice. Most models default to the familiar trope of an astronaut riding a horse. By forcing the reversal, we measure three critical capabilities that separate good models from great ones:

  • Strict instruction following (including negations)
  • Accurate subject-object relationships and spatial hierarchy
  • Resistance to strong dataset biases
Prompt
Horse riding astronaut in space — horse on top, not vice versa. Surreal, highly detailed, cinematic.
Voters were asked to judge by Horse actually on the back of the astronaut

Challenge Rankings

4 models
# Model Elo
1 1143
2 1089
3 1062
4 1061

GPT Image 2 leads the challenge with a 100% win rate and an 1143 Elo, maintaining a 54-point gap over its closest competitor, Wan 2.7 (1089 Elo). Despite being the slowest and most expensive model in the top three, OpenAI's model is the only one to perfectly navigate the spatial hierarchy requirements that caused win rates for Alibaba's Qwen Image 2.0 (1062 Elo) to drop to 25%.

1 model without pricing omitted

Elo vs Speed

Speed data is still warming up

We only have enough recent requests for GPT Image 2 (73.8s average).

3 models waiting for enough speed data

Competitors

4 models, ranked by Elo
1

GPT Image 2

Try in Playground →
2

Wan 2.7

Try in Playground →
3

Qwen Image 2.0

Try in Playground →

Stable Diffusion 3.5 Medium

Try in Playground →