Introduction
A crowdsourced, blind benchmark for text-to-speech.
TTS Arena ranks text-to-speech models by ear. You type a line, two anonymous models read it back, and you pick the one that sounds more human. Each vote feeds the leaderboard.
The models stay hidden until you've voted, so the choice is about the audio, not the name attached to it.
Why
There hasn't been a good way to measure how natural a synthetic voice sounds. Word error rate tells you whether speech is intelligible, not whether it sounds alive. Mean opinion scores rely on a small panel in a lab. TTS Arena uses large-scale human preference instead - anyone can listen, compare, and vote, and the resulting leaderboard is open.
Start here
Voting
How to vote and the rules that keep the board fair.
Ranking
How votes become a leaderboard.
Submit a model
Add your model, publicly or under a codename.
Provider API
The HTTP contract your TTS endpoint needs to meet.
Quick facts
- Sign in with Hugging Face to vote; accounts must be at least 30 days old.
- Prompts are English-only for now, capped at 1,000 characters.
- Models are revealed only after you vote.
- TTS Arena is open source under Apache 2.0 - source on GitHub.