Ranking

Each counting vote is one head-to-head result: model A beat model B on a controlled prompt from the Random pool. The leaderboard is what falls out when you fit all of those results together.

The rating

We rank models with a Bradley–Terry model - a standard way to turn pairwise wins and losses into a single strength score per competitor. It's fit over the entire vote history at once, so a model's rating reflects every matchup it has been in, not just its recent ones. The result doesn't depend on the order votes arrived in.

Ratings are centered around 1500, so the numbers read like familiar Elo scores. The fit is cached and refreshed as votes come in.

Rank by the lower bound

A model's rating comes with a confidence interval - wide when there's little data, narrow once it has played a lot. We show the rating but sort by the bottom of that interval.

The effect: a model can't shoot to the top off a handful of lucky wins. It has to be both good and well-tested to rank highly. Each row shows a ± next to its rating so you can see how settled it is.

New models

A model needs 100 votes to appear on the board. Below that the rating swings too much to mean anything.
Under 300 votes it's marked Preliminary - ranked normally, but still moving.
Brand-new models with only a few votes are hidden by default. Tick Show new models with few votes on the leaderboard to see them, with wide error bars.

To keep tiny samples from producing absurd numbers (a model that has only ever won would otherwise rate infinitely high), the fit is lightly regularized toward the average. The nudge fades as real votes accumulate and is gone within a few hundred.

Knowing a model's name changes how people hear it, so identities stay hidden until after the vote. And "which of these two is better?" is a far easier and more reliable judgment than scoring a single clip out of context - it's how listening tests have always been run.

Only clean votes on first-use Random prompts count. Typed custom prompts, votes flagged by the anti-fraud system, and votes from quarantined accounts are left out of the fit. See Voting.

The rating

Rank by the lower bound

New models

Why blind and pairwise

On this page