TTS Arena Docs

Introduction

A crowdsourced, blind benchmark for text-to-speech.

TTS Arena ranks text-to-speech models by ear. You type a line, two anonymous models read it back, and you pick the one that sounds more human. Each vote feeds the leaderboard.

The models stay hidden until you've voted, so the choice is about the audio, not the name attached to it.

Open the arena and vote →

Why

There hasn't been a good way to measure how natural a synthetic voice sounds. Word error rate tells you whether speech is intelligible, not whether it sounds alive. Mean opinion scores rely on a small panel in a lab. TTS Arena uses large-scale human preference instead - anyone can listen, compare, and vote, and the resulting leaderboard is open.

Start here

Quick facts

  • Sign in with Hugging Face to vote; accounts must be at least 30 days old.
  • Prompts are English-only for now, capped at 1,000 characters.
  • Models are revealed only after you vote.
  • TTS Arena is open source under Apache 2.0 - source on GitHub.

On this page