Arena Methodology

How We Rank AI Models.

We don't use benchmarks or take sponsorships. Thousands of people vote blind on which image looks better.

The Process

Three steps. No tricks.

1

We give a prompt

Every challenge starts with a prompt — the same text instruction sent to every competing model. Same input, different outputs.

2

You vote blind

You see two images side by side. No model names, no labels, no hints. You just pick the one you think looks better — or call it a tie. Your gut feeling matters here.

3

Math does the rest

Every vote updates both models' ratings using a proven algorithm. Win against a strong model? Big jump. Lose to a weak one? Big drop. Over hundreds of votes, the true ranking emerges.

The Rating System

What the Elo score actually means

You've probably seen Elo ratings in chess. Magnus Carlsen is rated ~2830, a total beginner is around 400. Higher number = stronger player. We use the same idea for AI models, but with a twist.

We call our rating "Elo" because the concept is familiar — you've seen it in chess, gaming, and across AI comparison platforms like Chatbot Arena and Artificial Analysis. Under the hood, we use TrueSkill, an algorithm developed by Microsoft Research. Like chess Elo, it produces a single number that represents skill. But TrueSkill is smarter about one thing: uncertainty.

Every model has two hidden numbers behind its Elo score:

  • μ Mu (μ) — The system's best guess of the model's true skill. Think of it as "probably this good."
  • σ Sigma (σ) — How uncertain the system is about that guess. High sigma means "we're not sure yet." Low sigma means "we're pretty confident."

The Elo score you see on the leaderboard is calculated as:

Elo = 1000 + 10 × ( μ )

The inner part (μ − 3σ) is a conservative estimate — the lower bound of what we think a model's true skill is. We then scale it to a familiar Elo range starting at 1000, matching scales used by other AI comparison platforms. New models start at exactly 1000 (no uncertainty penalty yet). As they win consistently, their Elo climbs into the 1100–1300 range. You have to prove you're good over many matchups — a few lucky wins won't cheese the leaderboard.

Rating Updates

How votes change the ranking

Upset win

A low-ranked model beats a top model. The winner's score jumps significantly, the loser takes a bigger hit. Upsets are the fastest way for a new model to climb.

Expected win

The top model beats a weaker one? Expected. Both ratings barely move. The algorithm already predicted this outcome, so there's not much new information.

Tie

Can't decide? Ties are valid votes. If two similarly-ranked models tie, barely anything changes. But if a weak model ties with a strong one, the weak model gets a small boost — holding your own against a champion counts.

New model

Every model starts with high uncertainty (sigma). Early votes cause big swings in either direction. After 20-30 matchups, the rating stabilizes and each vote has a smaller effect. New models find their place quickly.

A model needs at least 4 battles before it appears on any leaderboard. This prevents models with just one lucky win from showing up in the ranking.

Ranking Scopes

Rankings at every level

Every vote updates ratings at three independent levels simultaneously. A model can be #1 overall but #3 in a specific category — because different prompts test different strengths.

Overall ranking

The big picture. Every vote for a competition type (Text-to-Image or Image Editing) feeds into one overall rating per model. This is what you see on the main leaderboard page, the overall ranking across all challenges combined.

Category ranking

Zoom in by style. Categories like "Photorealism" group related challenges together. A model might dominate photorealistic portraits but struggle with abstract art. Category rankings reveal these strengths and weaknesses.

Challenge ranking

The most granular view. Each challenge is one specific prompt given to all models. Challenge rankings show exactly which model nailed that particular prompt.

Fair Play

What makes it fair

Completely blind

Model names are hidden during voting. You can't be biased toward a brand if you don't know which brand made which image. Names are revealed only after you've voted.

Randomized sides

Left or right position is randomized for every matchup. No model gets an advantage from always being on a particular side.

Same prompt, every model

Within each challenge, every model generates from the exact same prompt. No cherry-picking prompts that favor one model over another.

Exploration-first matchups

Our matchup selector prioritizes models with fewer votes and higher uncertainty. New models get tested quickly against the field, instead of the same popular models battling over and over.

Conservative ranking

Subtracting 3× uncertainty from the skill estimate means a model must prove itself consistently. A few lucky wins won't inflate a ranking. Only sustained performance across many matchups pushes a model to the top.

For The Curious

Technical details

The exact parameters behind the system, for those who want to know.

Parameter Value What it means
μ0 25.0 Starting skill estimate for every new model. After normalization, new models begin at Elo 1000.
σ0 8.333 Starting uncertainty. High on purpose — we don't know anything yet.
β 4.167 Performance variance. Accounts for random fluctuations in generation quality.
τ 0.083 Dynamics factor. Prevents sigma from hitting zero — keeps the system responsive to real skill changes over time.
Draw probability 10% Expected rate of ties. Calibrated from observed voting patterns.
Conservative multiplier 3 How many sigmas we subtract. 3σ gives 99.87% confidence — the displayed Elo is almost certainly not an overestimate.
Minimum battles 4 Models need at least this many matchups to appear on leaderboards.
Elo normalization 1000 + 10× Offset and scale applied to the conservative score (μ−3σ) to produce the displayed Elo number.
Algorithm TrueSkill Developed by Microsoft Research. Paper

The leaderboard is only as good as the people behind it

Vote blind and help the community figure out which AI image models are actually the best.