The Reversed Rodeo
Vote4 models were given the same prompt, and the community voted blind on which outputs looked best. How it works
This competition tests how well AI image models truly understand language versus how much they rely on visual habits from their training data. The prompt is deliberately simple on the surface but devilishly hard in practice. Most models default to the familiar trope of an astronaut riding a horse. By forcing the reversal, we measure three critical capabilities that separate good models from great ones:
- Strict instruction following (including negations)
- Accurate subject-object relationships and spatial hierarchy
- Resistance to strong dataset biases
#1 — GPT Image 2
Challenge Rankings
| # | Model | Elo |
|---|---|---|
| 1 | 1143 | |
| 2 | 1089 | |
| 3 | 1062 | |
| 4 | 1061 |
GPT Image 2 leads the challenge with a 100% win rate and an 1143 Elo, maintaining a 54-point gap over its closest competitor, Wan 2.7 (1089 Elo). Despite being the slowest and most expensive model in the top three, OpenAI's model is the only one to perfectly navigate the spatial hierarchy requirements that caused win rates for Alibaba's Qwen Image 2.0 (1062 Elo) to drop to 25%.
Elo vs Cost
Elo vs Speed
Speed data is still warming up
We only have enough recent requests for GPT Image 2 (73.8s average).
Competitors
4 models, ranked by EloHighlighted Battles
The most competitive head-to-head matchups, selected by closeness and vote count.