The Capybara Taxi Driver
Vote15 models were given the same prompt, and the community voted blind on which outputs looked best. How it works
This challenge seems to be difficult for models because it mixes reality with fiction. Most models struggle to keep the taxi realistic or loose instructions like placing the passenger not in the backseat.
#1 — GPT Image 2
Challenge Rankings
| # | Model | Elo |
|---|---|---|
| 1 | 1210 | |
| 2 | 1180 | |
| 3 | 1158 | |
| 4 | 1150 | |
| 5 | 1138 | |
| 6 | 1133 | |
| 7 | 1133 | |
| 8 | 1128 | |
| 9 | 1112 | |
| 10 | 1110 | |
| 11 | 1103 | |
| 12 | 1098 | |
| 13 | 1087 | |
| 14 | 991 | |
| 15 | 989 |
GPT Image 2 leads with a 1210 Elo and 71.4% win rate, though the budget-friendly Z-Image Turbo (1180 Elo) remains highly competitive at nearly one-eighth of the price and significantly faster generation speeds. Seedream 4.5 and Nano Banana Pro share the highest individual win rate of 87.5%, demonstrating superior handling of the complex passenger-placement constraints compared to lower-ranked premium models.
Elo vs Cost
Elo vs Speed
Competitors
15 models, ranked by EloHighlighted Battles
The most competitive head-to-head matchups, selected by closeness and vote count.