The AI Model Leaderboards.
Top 10 models for every modality — image, video, speech, and music. Ranked by performance benchmarks, with pricing and access detail where available. Updated weekly.
Text-to-Image
Models that generate images from text prompts. Ranked by Artificial Analysis Image Arena ELO and community quality benchmarks.
| # | Model | Arena ELO | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
Midjourney v7
Midjourney
|
1267 | ~30s/img | Up to 2048×2048 · v7 ultra mode | Subscription · $10–60/mo | Subscription |
| 2 |
GPT-Image 2
OpenAI
|
1248 | ~15s/img | Up to 1792×1024 · ChatGPT integrated | $0.04 / image | API |
| 3 |
Flux 1.2 Ultra
Black Forest Labs
|
1239 | ~12s/img | Up to 4MP · best-in-class detail | $0.055 / image | API |
| 4 |
Imagen 4 Ultra
Google
|
1234 | ~10s/img | Up to 2048×2048 · Vertex AI | $0.04 / image | API |
| 5 |
Recraft V3
Recraft
|
1221 | ~18s/img | Brand-style controls · vector output | $0.04 / image | API |
| 6 |
Ideogram 3.0
Ideogram
|
1215 | ~14s/img | Best-in-class text rendering in images | $0.08 / image · API tier | API |
| 7 |
Stable Diffusion 3.5 Large
Stability AI
|
1198 | ~6s/img on A100 | 8B params · run anywhere | Open source | Open Source |
| 8 |
Flux 1.1 Pro
Black Forest Labs
|
1191 | ~8s/img | Production tier · 1080p+ | $0.04 / image | API |
| 9 |
Adobe Firefly Image 4
Adobe
|
1184 | ~12s/img | Commercial-safe training · CC integrated | Subscription · Adobe CC | Subscription |
| 10 |
Leonardo Phoenix 2
Leonardo.ai
|
1176 | ~10s/img | Style-consistent prompt adherence | Subscription · $10–48/mo | Subscription |
Image Generation & Editing
Models that edit existing images via instruction, inpainting, or guided generation. Ranked by edit fidelity, instruction adherence, and quality preservation.
| # | Model | Edit Arena ELO | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
Flux 1.2 Fill Ultra
Black Forest Labs
|
1281 | ~10s/edit | Best instruction-following on edit prompts | $0.05 / edit | API |
| 2 |
GPT-Image 2 Edit
OpenAI
|
1262 | ~14s/edit | Conversational editing · multi-turn refinement | $0.05 / edit | API |
| 3 |
Imagen 4 Edit
Google
|
1248 | ~9s/edit | Mask-guided + instruction-guided edits | $0.05 / edit | API |
| 4 |
Adobe Firefly Edit 4
Adobe
|
1241 | ~11s/edit | Generative Fill in Photoshop · commercial-safe | Subscription · Adobe CC | Subscription |
| 5 |
Recraft Edit V3
Recraft
|
1227 | ~16s/edit | Brand-consistent edits · style locks | $0.05 / edit | API |
| 6 |
Ideogram Magic Fill 3.0
Ideogram
|
1213 | ~12s/edit | Text-preserving inpainting | $0.08 / edit | API |
| 7 |
Stable Diffusion 3.5 ControlNet
Stability AI
|
1206 | ~5s/edit on A100 | Full open ecosystem · LoRAs · masks | Open source | Open Source |
| 8 |
Midjourney Editor v7
Midjourney
|
1197 | ~25s/edit | Variations, vary region, zoom out | Subscription · $10–60/mo | Subscription |
| 9 |
FLUX.1 Kontext Pro
Black Forest Labs
|
1188 | ~7s/edit | Context-aware multi-image editing | $0.04 / edit | API |
| 10 |
Leonardo Canvas
Leonardo.ai
|
1172 | ~14s/edit | Inpaint, outpaint, sketch-to-image | Subscription · $10–48/mo | Subscription |
Text-to-Video
Models that generate video clips from text prompts. Ranked by motion quality, prompt adherence, and visual fidelity at native resolution.
| # | Model | Video Arena ELO | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
Veo 3
Google
|
1294 | ~90s/clip | Up to 8s · 1080p · synchronized audio gen | $0.35 / second | API |
| 2 |
Sora 2
OpenAI
|
1281 | ~120s/clip | Up to 20s · 1080p · best physics | Subscription · ChatGPT Pro+ | Subscription |
| 3 |
Kling 2.0 Master
Kuaishou
|
1267 | ~80s/clip | Up to 10s · 1080p · strong human motion | $0.28 / second | API |
| 4 |
Runway Gen-4
Runway
|
1255 | ~60s/clip | Up to 16s · 1080p · production tooling | $0.05 / second · subscription tiers | Subscription |
| 5 |
Hailuo 02
MiniMax
|
1241 | ~70s/clip | Up to 10s · 1080p · cinematic camera | $0.22 / second | API |
| 6 |
Pika 2.2
Pika Labs
|
1228 | ~50s/clip | Up to 10s · 1080p · Pikaframes editing | Subscription · $10–95/mo | Subscription |
| 7 |
Luma Ray 2.5
Luma AI
|
1217 | ~75s/clip | Up to 10s · 1080p · ray-traced lighting | $0.06 / second · subscription tiers | Subscription |
| 8 |
Wan 2.5
Alibaba
|
1204 | ~70s/clip | Up to 10s · 1080p · multilingual prompts | $0.20 / second | API |
| 9 |
HunyuanVideo 1.5
Tencent
|
1196 | ~3min/clip on H100 | 13B params · runs locally | Open source | Open Source |
| 10 |
Mochi 1.5
Genmo
|
1188 | ~4min/clip on H100 | 10B params · Apache 2.0 | Open source | Open Source |
Image-to-Video
Models that animate or extend a still image into video. Ranked by motion realism, subject preservation, and edit-prompt adherence.
| # | Model | Video Arena ELO (I2V) | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
Kling 2.0 Master I2V
Kuaishou
|
1288 | ~80s/clip | Up to 10s · best human-figure animation | $0.28 / second | API |
| 2 |
Runway Gen-4 I2V
Runway
|
1272 | ~55s/clip | Up to 16s · ACT-1 motion transfer | $0.05 / second · subscription tiers | Subscription |
| 3 |
Luma Ray 2.5 I2V
Luma AI
|
1258 | ~70s/clip | Up to 10s · keyframe interpolation | $0.06 / second · subscription tiers | Subscription |
| 4 |
Hailuo 02 I2V
MiniMax
|
1244 | ~70s/clip | Up to 10s · strong subject fidelity | $0.22 / second | API |
| 5 |
Veo 3 I2V
Google
|
1231 | ~85s/clip | Up to 8s · audio-aware animation | $0.35 / second | API |
| 6 |
Pika 2.2 I2V
Pika Labs
|
1218 | ~50s/clip | Up to 10s · Pikaffects motion presets | Subscription · $10–95/mo | Subscription |
| 7 |
Wan 2.5 I2V
Alibaba
|
1207 | ~70s/clip | Up to 10s · first-frame conditioning | $0.20 / second | API |
| 8 |
Stable Video 4D
Stability AI
|
1192 | ~2min/clip on H100 | 4-second clips · multi-view 3D-aware | Open source | Open Source |
| 9 |
HunyuanVideo I2V
Tencent
|
1183 | ~3min/clip on H100 | 13B params · open weights | Open source | Open Source |
| 10 |
CogVideoX-5B I2V
Zhipu AI
|
1171 | ~90s/clip on A100 | 5B params · efficient inference | Open source | Open Source |
Speech-to-Text
Models that transcribe spoken audio into text. Ranked by Word Error Rate (lower is better) on standard benchmarks (LibriSpeech, FLEURS, AMI).
| # | Model | WER (lower = better) | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
Universal-2
AssemblyAI
|
1.8% WER | ~0.05× realtime | 99 languages · best diarization in class | $0.27 / hour audio | API |
| 2 |
Nova-3
Deepgram
|
1.9% WER | ~0.04× realtime | 99 languages · streaming-optimized | $0.26 / hour audio | API |
| 3 |
Whisper v4 Large
OpenAI
|
2.1% WER | ~0.1× realtime | 99 languages · open weights + API | $0.36 / hour (API) · free self-hosted | API + Open Source |
| 4 |
Speechmatics Ursa 2
Speechmatics
|
2.2% WER | ~0.08× realtime | 55 languages · enterprise-grade accents | Custom enterprise pricing | API |
| 5 |
Scribe v1
ElevenLabs
|
2.4% WER | ~0.05× realtime | 99 languages · speaker tags · sound events | $0.40 / hour audio | API |
| 6 |
Chirp 3
Google
|
2.5% WER | ~0.05× realtime | 125+ languages · Google Cloud STT | $0.36 / hour audio | API |
| 7 |
Voxtral Small
Mistral
|
2.7% WER | ~0.1× realtime | 24B params · multilingual native | $0.30 / hour audio | API |
| 8 |
Whisper v3 Turbo
OpenAI
|
3.1% WER | ~8× faster than Whisper v3 | Distilled · 8× speedup | Open source | Open Source |
| 9 |
Parakeet-TDT 1.1B
NVIDIA
|
3.3% WER | ~0.02× realtime on H100 | Fastest open STT · NeMo framework | Open source | Open Source |
| 10 |
Canary-1B
NVIDIA
|
3.6% WER | ~0.03× realtime on H100 | 4 languages · translation built-in | Open source | Open Source |
Text-to-Speech
Models that generate spoken audio from text. Ranked by TTS Arena ELO and voice quality / naturalness benchmarks.
| # | Model | TTS Arena ELO | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
ElevenLabs v3
ElevenLabs
|
1276 | ~0.5s latency | 32 languages · emotion control · voice cloning | $0.30 per 1K chars · subscription tiers | API + Subscription |
| 2 |
OpenAI tts-1-hd
OpenAI
|
1252 | ~0.8s latency | 6 voices · multilingual · GPT integrated | $30 / 1M chars | API |
| 3 |
Cartesia Sonic 2
Cartesia
|
1247 | ~90ms latency | Lowest latency in class · real-time apps | $0.18 per 1K chars | API |
| 4 |
Play.ht 3.0
Play.ht
|
1232 | ~0.6s latency | 800+ voices · 50+ languages | Subscription · $39–99/mo | Subscription |
| 5 |
Hume Octave 2
Hume AI
|
1218 | ~0.4s latency | Emotional intelligence · voice design | $0.20 per 1K chars | API |
| 6 |
Resemble AI v3
Resemble AI
|
1207 | ~0.5s latency | Voice cloning · localization · 60+ languages | Custom enterprise pricing | API |
| 7 |
Coqui XTTS-v3
Coqui
|
1192 | ~0.7s latency | 17 languages · voice cloning · open weights | Open source | Open Source |
| 8 |
Fish Audio S1
Fish Audio
|
1184 | ~0.5s latency | Multilingual · open community model | $0.15 per 1K chars | API |
| 9 |
Kokoro TTS
Hexgrad
|
1171 | ~0.3s latency | 82M params · runs on CPU · Apache 2.0 | Open source | Open Source |
| 10 |
Bark Large
Suno
|
1158 | ~1.5s latency | Music + non-verbal sounds · open weights | Open source | Open Source |
Music Generation
Models that generate full musical compositions from text prompts. Ranked by composite community evaluations and producer review scores — no consensus benchmark exists.
| # | Model | Composite Score | Speed | Output / Detail | Pricing | Access |
|---|---|---|---|---|---|---|
| 1 |
Suno v4.5
Suno
|
94 | ~30s/song | Up to 4min songs · vocals + instrumentation · 100+ genres | Subscription · $8–24/mo | Subscription |
| 2 |
Udio v1.6
Udio
|
92 | ~35s/song | Up to 2.5min/clip · advanced producer mode | Subscription · $10–30/mo | Subscription |
| 3 |
Stable Audio 2.5
Stability AI
|
86 | ~20s/song | Up to 3min instrumental · commercial use friendly | $0.10 / song · subscription tiers | API + Subscription |
| 4 |
Riffusion FUZZ
Riffusion
|
81 | ~25s/song | Full song generation · stems export | Subscription · $9–25/mo | Subscription |
| 5 |
MusicGen Large v2
Meta
|
76 | ~30s/song on A100 | 3.3B params · open weights · text + melody conditioned | Open source | Open Source |
| 6 |
AudioCraft Hybrid
Meta
|
73 | ~40s/song on A100 | Music + sound effects + audio editing | Open source | Open Source |
| 7 |
AIVA 4.0
AIVA
|
71 | ~45s/song | Classical/cinematic specialist · MIDI export | Subscription · $11–48/mo | Subscription |
| 8 |
Beatoven AI 2
Beatoven
|
68 | ~60s/song | Background music for videos · royalty-free | Subscription · $20–60/mo | Subscription |
| 9 |
Soundraw 3
Soundraw
|
64 | ~50s/song | Royalty-free · genre and mood controls | Subscription · $17–50/mo | Subscription |
| 10 |
YuE Music v1
M-A-P
|
61 | ~90s/song on A100 | 7B params · open weights · full vocal songs | Open source | Open Source |
Frequently asked questions.
How are the leaderboards ranked?
Each leaderboard uses the most credible benchmark for that modality. Text-to-Image uses Artificial Analysis Image Arena ELO. Speech-to-Text uses Word Error Rate (WER) on LibriSpeech + FLEURS. Text-to-Speech uses TTS Arena ELO. Video uses Artificial Analysis Video Arena. Music Generation uses a composite community evaluation since no consensus benchmark exists yet — methodology is credited on every section.
What's the best text-to-image model in 2026?
Midjourney v7 leads the Image Arena ELO at 1267 as of May 2026, followed closely by GPT-Image 2 (OpenAI) at 1248 and Flux 1.2 Ultra (Black Forest Labs) at 1239. For text rendering specifically (signage, posters, UI mockups), Ideogram 3.0 remains the leader. For brand-safe commercial use, Adobe Firefly Image 4 leads. Best open-source: Stable Diffusion 3.5 Large at 1198 ELO.
What's the best text-to-video model in 2026?
Google Veo 3 holds the top Video Arena ELO at 1294 with synchronized audio generation as a differentiator. OpenAI Sora 2 follows at 1281 with the strongest physics simulation. Kling 2.0 Master leads on human motion realism at 1267. Best open-source: HunyuanVideo 1.5 (Tencent, 13B params).
What's the fastest speech-to-text model?
Deepgram Nova-3 runs at ~0.04× realtime (1 hour of audio in ~2.4 minutes) and posts 1.9% WER on LibriSpeech clean. AssemblyAI Universal-2 narrowly leads on accuracy at 1.8% WER. For self-hosted speed, NVIDIA Parakeet-TDT 1.1B on H100 is the open-source benchmark at ~0.02× realtime.
Why do some models show "Open Source" instead of a price?
Open-source models like Stable Diffusion 3.5, Whisper v4, MusicGen, and HunyuanVideo are released under permissive licenses and run on your own infrastructure. The "Pricing" cell shows "Open source" instead of a per-call price because the cost is your compute (GPU hours, electricity) rather than a vendor fee. This is performance ranking, not cost ranking — open-source models earn their rank on the same benchmarks as commercial APIs.
How often is this data updated?
The dataset is reviewed weekly against published benchmarks and vendor releases. New flagship model launches are typically added within 7 days. Quality scores update as new arena evaluations are published — the "Verified May 2026" badge on each section shows the most recent review date.
Why isn't there a single Quality Score across all leaderboards?
Each modality has its own canonical evaluation methodology. Image generation uses pairwise arena rankings (ELO). Speech-to-text uses Word Error Rate (a percentage). Text-to-speech uses arena ELO with different judging criteria. Normalizing these to a single 0-100 score would obscure the actual evaluation signal. We preserve the native benchmark per category and credit the source — builders can trust the signal more than a derivative composite.
Where can I see text-LLM rankings?
Text large language models have their own dedicated page: the LLM Cost Calculator. It tracks 100 top text-LLMs by Artificial Analysis Intelligence Index alongside cost, speed, latency, and context window. Link in the page header.
How do you handle models with subscription-only pricing?
Some models (Midjourney, Suno, Udio, Adobe Firefly) are subscription-only with no per-output API pricing. We show "Subscription · $X–$Y/mo" with the range, plus link to the vendor's pricing page. For builders comparing per-output cost specifically, we recommend either getting an API-accessible alternative or budgeting at the subscription tier.
Will you add other categories (3D generation, agents, embeddings)?
Probably — based on user demand. Text-to-3D, agentic frameworks, and embeddings are tracked as planned categories. Subscribe to the API waitlist above and we'll notify you when they ship.
Missing a model or category?
We add new entries when they earn a leaderboard spot. Suggest a model or modality — we review submissions within 7 days.