// AI Model Leaderboards

The AI Model Leaderboards.

Top 10 models for every modality — image, video, speech, and music. Ranked by performance benchmarks, with pricing and access detail where available. Updated weekly.

Categories Tracked 7
Models Ranked 70
Last update May 2026 LIVE · UPDATED WEEKLY

Text-to-Image

10 models sorted by Arena ELO verified May 2026

Models that generate images from text prompts. Ranked by Artificial Analysis Image Arena ELO and community quality benchmarks.

# Model Arena ELO Speed Output / Detail Pricing Access
1
Midjourney v7
Midjourney
1267 ~30s/img Up to 2048×2048 · v7 ultra mode Subscription · $10–60/mo Subscription
2
GPT-Image 2
OpenAI
1248 ~15s/img Up to 1792×1024 · ChatGPT integrated $0.04 / image API
3
Flux 1.2 Ultra
Black Forest Labs
1239 ~12s/img Up to 4MP · best-in-class detail $0.055 / image API
4
Imagen 4 Ultra
Google
1234 ~10s/img Up to 2048×2048 · Vertex AI $0.04 / image API
5
Recraft V3
Recraft
1221 ~18s/img Brand-style controls · vector output $0.04 / image API
6
Ideogram 3.0
Ideogram
1215 ~14s/img Best-in-class text rendering in images $0.08 / image · API tier API
7
Stable Diffusion 3.5 Large
Stability AI
1198 ~6s/img on A100 8B params · run anywhere Open source Open Source
8
Flux 1.1 Pro
Black Forest Labs
1191 ~8s/img Production tier · 1080p+ $0.04 / image API
9
Adobe Firefly Image 4
Adobe
1184 ~12s/img Commercial-safe training · CC integrated Subscription · Adobe CC Subscription
10
Leonardo Phoenix 2
Leonardo.ai
1176 ~10s/img Style-consistent prompt adherence Subscription · $10–48/mo Subscription

Image Generation & Editing

10 models sorted by Edit Arena ELO verified May 2026

Models that edit existing images via instruction, inpainting, or guided generation. Ranked by edit fidelity, instruction adherence, and quality preservation.

# Model Edit Arena ELO Speed Output / Detail Pricing Access
1
Flux 1.2 Fill Ultra
Black Forest Labs
1281 ~10s/edit Best instruction-following on edit prompts $0.05 / edit API
2
GPT-Image 2 Edit
OpenAI
1262 ~14s/edit Conversational editing · multi-turn refinement $0.05 / edit API
3
Imagen 4 Edit
Google
1248 ~9s/edit Mask-guided + instruction-guided edits $0.05 / edit API
4
Adobe Firefly Edit 4
Adobe
1241 ~11s/edit Generative Fill in Photoshop · commercial-safe Subscription · Adobe CC Subscription
5
Recraft Edit V3
Recraft
1227 ~16s/edit Brand-consistent edits · style locks $0.05 / edit API
6
Ideogram Magic Fill 3.0
Ideogram
1213 ~12s/edit Text-preserving inpainting $0.08 / edit API
7
Stable Diffusion 3.5 ControlNet
Stability AI
1206 ~5s/edit on A100 Full open ecosystem · LoRAs · masks Open source Open Source
8
Midjourney Editor v7
Midjourney
1197 ~25s/edit Variations, vary region, zoom out Subscription · $10–60/mo Subscription
9
FLUX.1 Kontext Pro
Black Forest Labs
1188 ~7s/edit Context-aware multi-image editing $0.04 / edit API
10
Leonardo Canvas
Leonardo.ai
1172 ~14s/edit Inpaint, outpaint, sketch-to-image Subscription · $10–48/mo Subscription

Text-to-Video

10 models sorted by Video Arena ELO verified May 2026

Models that generate video clips from text prompts. Ranked by motion quality, prompt adherence, and visual fidelity at native resolution.

# Model Video Arena ELO Speed Output / Detail Pricing Access
1
Veo 3
Google
1294 ~90s/clip Up to 8s · 1080p · synchronized audio gen $0.35 / second API
2
Sora 2
OpenAI
1281 ~120s/clip Up to 20s · 1080p · best physics Subscription · ChatGPT Pro+ Subscription
3
Kling 2.0 Master
Kuaishou
1267 ~80s/clip Up to 10s · 1080p · strong human motion $0.28 / second API
4
Runway Gen-4
Runway
1255 ~60s/clip Up to 16s · 1080p · production tooling $0.05 / second · subscription tiers Subscription
5
Hailuo 02
MiniMax
1241 ~70s/clip Up to 10s · 1080p · cinematic camera $0.22 / second API
6
Pika 2.2
Pika Labs
1228 ~50s/clip Up to 10s · 1080p · Pikaframes editing Subscription · $10–95/mo Subscription
7
Luma Ray 2.5
Luma AI
1217 ~75s/clip Up to 10s · 1080p · ray-traced lighting $0.06 / second · subscription tiers Subscription
8
Wan 2.5
Alibaba
1204 ~70s/clip Up to 10s · 1080p · multilingual prompts $0.20 / second API
9
HunyuanVideo 1.5
Tencent
1196 ~3min/clip on H100 13B params · runs locally Open source Open Source
10
Mochi 1.5
Genmo
1188 ~4min/clip on H100 10B params · Apache 2.0 Open source Open Source

Image-to-Video

10 models sorted by Video Arena ELO (I2V) verified May 2026

Models that animate or extend a still image into video. Ranked by motion realism, subject preservation, and edit-prompt adherence.

# Model Video Arena ELO (I2V) Speed Output / Detail Pricing Access
1
Kling 2.0 Master I2V
Kuaishou
1288 ~80s/clip Up to 10s · best human-figure animation $0.28 / second API
2
Runway Gen-4 I2V
Runway
1272 ~55s/clip Up to 16s · ACT-1 motion transfer $0.05 / second · subscription tiers Subscription
3
Luma Ray 2.5 I2V
Luma AI
1258 ~70s/clip Up to 10s · keyframe interpolation $0.06 / second · subscription tiers Subscription
4
Hailuo 02 I2V
MiniMax
1244 ~70s/clip Up to 10s · strong subject fidelity $0.22 / second API
5
Veo 3 I2V
Google
1231 ~85s/clip Up to 8s · audio-aware animation $0.35 / second API
6
Pika 2.2 I2V
Pika Labs
1218 ~50s/clip Up to 10s · Pikaffects motion presets Subscription · $10–95/mo Subscription
7
Wan 2.5 I2V
Alibaba
1207 ~70s/clip Up to 10s · first-frame conditioning $0.20 / second API
8
Stable Video 4D
Stability AI
1192 ~2min/clip on H100 4-second clips · multi-view 3D-aware Open source Open Source
9
HunyuanVideo I2V
Tencent
1183 ~3min/clip on H100 13B params · open weights Open source Open Source
10
CogVideoX-5B I2V
Zhipu AI
1171 ~90s/clip on A100 5B params · efficient inference Open source Open Source

Speech-to-Text

10 models sorted by WER (lower = better) verified May 2026

Models that transcribe spoken audio into text. Ranked by Word Error Rate (lower is better) on standard benchmarks (LibriSpeech, FLEURS, AMI).

# Model WER (lower = better) Speed Output / Detail Pricing Access
1
Universal-2
AssemblyAI
1.8% WER ~0.05× realtime 99 languages · best diarization in class $0.27 / hour audio API
2
Nova-3
Deepgram
1.9% WER ~0.04× realtime 99 languages · streaming-optimized $0.26 / hour audio API
3
Whisper v4 Large
OpenAI
2.1% WER ~0.1× realtime 99 languages · open weights + API $0.36 / hour (API) · free self-hosted API + Open Source
4
Speechmatics Ursa 2
Speechmatics
2.2% WER ~0.08× realtime 55 languages · enterprise-grade accents Custom enterprise pricing API
5
Scribe v1
ElevenLabs
2.4% WER ~0.05× realtime 99 languages · speaker tags · sound events $0.40 / hour audio API
6
Chirp 3
Google
2.5% WER ~0.05× realtime 125+ languages · Google Cloud STT $0.36 / hour audio API
7
Voxtral Small
Mistral
2.7% WER ~0.1× realtime 24B params · multilingual native $0.30 / hour audio API
8
Whisper v3 Turbo
OpenAI
3.1% WER ~8× faster than Whisper v3 Distilled · 8× speedup Open source Open Source
9
Parakeet-TDT 1.1B
NVIDIA
3.3% WER ~0.02× realtime on H100 Fastest open STT · NeMo framework Open source Open Source
10
Canary-1B
NVIDIA
3.6% WER ~0.03× realtime on H100 4 languages · translation built-in Open source Open Source

Text-to-Speech

10 models sorted by TTS Arena ELO verified May 2026

Models that generate spoken audio from text. Ranked by TTS Arena ELO and voice quality / naturalness benchmarks.

# Model TTS Arena ELO Speed Output / Detail Pricing Access
1
ElevenLabs v3
ElevenLabs
1276 ~0.5s latency 32 languages · emotion control · voice cloning $0.30 per 1K chars · subscription tiers API + Subscription
2
OpenAI tts-1-hd
OpenAI
1252 ~0.8s latency 6 voices · multilingual · GPT integrated $30 / 1M chars API
3
Cartesia Sonic 2
Cartesia
1247 ~90ms latency Lowest latency in class · real-time apps $0.18 per 1K chars API
4
Play.ht 3.0
Play.ht
1232 ~0.6s latency 800+ voices · 50+ languages Subscription · $39–99/mo Subscription
5
Hume Octave 2
Hume AI
1218 ~0.4s latency Emotional intelligence · voice design $0.20 per 1K chars API
6
Resemble AI v3
Resemble AI
1207 ~0.5s latency Voice cloning · localization · 60+ languages Custom enterprise pricing API
7
Coqui XTTS-v3
Coqui
1192 ~0.7s latency 17 languages · voice cloning · open weights Open source Open Source
8
Fish Audio S1
Fish Audio
1184 ~0.5s latency Multilingual · open community model $0.15 per 1K chars API
9
Kokoro TTS
Hexgrad
1171 ~0.3s latency 82M params · runs on CPU · Apache 2.0 Open source Open Source
10
Bark Large
Suno
1158 ~1.5s latency Music + non-verbal sounds · open weights Open source Open Source

Music Generation

10 models sorted by Composite Score verified May 2026

Models that generate full musical compositions from text prompts. Ranked by composite community evaluations and producer review scores — no consensus benchmark exists.

# Model Composite Score Speed Output / Detail Pricing Access
1
Suno v4.5
Suno
94 ~30s/song Up to 4min songs · vocals + instrumentation · 100+ genres Subscription · $8–24/mo Subscription
2
Udio v1.6
Udio
92 ~35s/song Up to 2.5min/clip · advanced producer mode Subscription · $10–30/mo Subscription
3
Stable Audio 2.5
Stability AI
86 ~20s/song Up to 3min instrumental · commercial use friendly $0.10 / song · subscription tiers API + Subscription
4
Riffusion FUZZ
Riffusion
81 ~25s/song Full song generation · stems export Subscription · $9–25/mo Subscription
5
MusicGen Large v2
Meta
76 ~30s/song on A100 3.3B params · open weights · text + melody conditioned Open source Open Source
6
AudioCraft Hybrid
Meta
73 ~40s/song on A100 Music + sound effects + audio editing Open source Open Source
7
AIVA 4.0
AIVA
71 ~45s/song Classical/cinematic specialist · MIDI export Subscription · $11–48/mo Subscription
8
Beatoven AI 2
Beatoven
68 ~60s/song Background music for videos · royalty-free Subscription · $20–60/mo Subscription
9
Soundraw 3
Soundraw
64 ~50s/song Royalty-free · genre and mood controls Subscription · $17–50/mo Subscription
10
YuE Music v1
M-A-P
61 ~90s/song on A100 7B params · open weights · full vocal songs Open source Open Source

Frequently asked questions.

How are the leaderboards ranked?

Each leaderboard uses the most credible benchmark for that modality. Text-to-Image uses Artificial Analysis Image Arena ELO. Speech-to-Text uses Word Error Rate (WER) on LibriSpeech + FLEURS. Text-to-Speech uses TTS Arena ELO. Video uses Artificial Analysis Video Arena. Music Generation uses a composite community evaluation since no consensus benchmark exists yet — methodology is credited on every section.

What's the best text-to-image model in 2026?

Midjourney v7 leads the Image Arena ELO at 1267 as of May 2026, followed closely by GPT-Image 2 (OpenAI) at 1248 and Flux 1.2 Ultra (Black Forest Labs) at 1239. For text rendering specifically (signage, posters, UI mockups), Ideogram 3.0 remains the leader. For brand-safe commercial use, Adobe Firefly Image 4 leads. Best open-source: Stable Diffusion 3.5 Large at 1198 ELO.

What's the best text-to-video model in 2026?

Google Veo 3 holds the top Video Arena ELO at 1294 with synchronized audio generation as a differentiator. OpenAI Sora 2 follows at 1281 with the strongest physics simulation. Kling 2.0 Master leads on human motion realism at 1267. Best open-source: HunyuanVideo 1.5 (Tencent, 13B params).

What's the fastest speech-to-text model?

Deepgram Nova-3 runs at ~0.04× realtime (1 hour of audio in ~2.4 minutes) and posts 1.9% WER on LibriSpeech clean. AssemblyAI Universal-2 narrowly leads on accuracy at 1.8% WER. For self-hosted speed, NVIDIA Parakeet-TDT 1.1B on H100 is the open-source benchmark at ~0.02× realtime.

Why do some models show "Open Source" instead of a price?

Open-source models like Stable Diffusion 3.5, Whisper v4, MusicGen, and HunyuanVideo are released under permissive licenses and run on your own infrastructure. The "Pricing" cell shows "Open source" instead of a per-call price because the cost is your compute (GPU hours, electricity) rather than a vendor fee. This is performance ranking, not cost ranking — open-source models earn their rank on the same benchmarks as commercial APIs.

How often is this data updated?

The dataset is reviewed weekly against published benchmarks and vendor releases. New flagship model launches are typically added within 7 days. Quality scores update as new arena evaluations are published — the "Verified May 2026" badge on each section shows the most recent review date.

Why isn't there a single Quality Score across all leaderboards?

Each modality has its own canonical evaluation methodology. Image generation uses pairwise arena rankings (ELO). Speech-to-text uses Word Error Rate (a percentage). Text-to-speech uses arena ELO with different judging criteria. Normalizing these to a single 0-100 score would obscure the actual evaluation signal. We preserve the native benchmark per category and credit the source — builders can trust the signal more than a derivative composite.

Where can I see text-LLM rankings?

Text large language models have their own dedicated page: the LLM Cost Calculator. It tracks 100 top text-LLMs by Artificial Analysis Intelligence Index alongside cost, speed, latency, and context window. Link in the page header.

How do you handle models with subscription-only pricing?

Some models (Midjourney, Suno, Udio, Adobe Firefly) are subscription-only with no per-output API pricing. We show "Subscription · $X–$Y/mo" with the range, plus link to the vendor's pricing page. For builders comparing per-output cost specifically, we recommend either getting an API-accessible alternative or budgeting at the subscription tier.

Will you add other categories (3D generation, agents, embeddings)?

Probably — based on user demand. Text-to-3D, agentic frameworks, and embeddings are tracked as planned categories. Subscribe to the API waitlist above and we'll notify you when they ship.

Missing a model or category?

We add new entries when they earn a leaderboard spot. Suggest a model or modality — we review submissions within 7 days.