// AI Model Leaderboards

The AI Model Leaderboards.

Top 10 models for every modality — image, video, speech, and music. Ranked by performance benchmarks, with pricing and access detail where available. Updated weekly.

Jump to leaderboards ↓ Text LLM Cost Calculator →

Categories Tracked 7

Models Ranked 70

Last update July 2026 LIVE · UPDATED WEEKLY

// Jump to: ◆Text-to-Image ✎Image Generation & Editing ▶Text-to-Video ↻Image-to-Video ◑Speech-to-Text ◕Text-to-Speech ♪Music Generation

◆

Text-to-Image

10 models sorted by Arena ELO verified July 2026

Models that generate images from text prompts. Ranked by Artificial Analysis Image Arena ELO and community quality benchmarks.

#	Model	Arena ELO	Speed	Output / Detail	Pricing	Access
1	GPT Image 2 (high) OpenAI	1339	~12s/img	ChatGPT integrated · arena leader	$0.05 / image	API
2	MAI-Image-2.5 Microsoft AI	1273	~15s/img	Microsoft Foundry · new June 2 release	$0.04 / image	API
3	HiDream-O1-Image-1.5 HiDream	1264	~14s/img	O1-line reasoning · text rendering specialist	$0.05 / image	API
4	GPT Image 1.5 (high) OpenAI	1262	~10s/img	Previous-gen ChatGPT image, still strong	$0.04 / image	API
5	Nano Banana 2 (Gemini 3.1 Flash Image) Google	1254	~8s/img	Photoreal · Gemini-integrated	$0.04 / image	API
6	Reve 2.0 Reve	1218	~12s/img	Native 4K · June 3 release	$0.06 / image	API
7	Midjourney v8 Midjourney	1212	~30s/img	Stylized leader · v8 ultra mode	Subscription · $10–60/mo	Subscription
8	Ideogram 3.5 Ideogram	1198	~14s/img	Best-in-class text rendering in images	$0.08 / image · API tier	API
9	Adobe Firefly Image 5 Adobe	1191	~12s/img	Commercial-safe training · CC integrated	Subscription · Adobe CC	Subscription
10	Stable Diffusion 3.5 Large Stability AI	1184	~6s/img on A100	8B params · open weights · run anywhere	Open source	Open Source

✎

Image Generation & Editing

10 models sorted by Edit Arena ELO verified July 2026

Models that edit existing images via instruction, inpainting, or guided generation. Ranked by edit fidelity, instruction adherence, and quality preservation.

#	Model	Edit Arena ELO	Speed	Output / Detail	Pricing	Access
1	Flux 1.2 Fill Ultra Black Forest Labs	1281	~10s/edit	Best instruction-following on edit prompts	$0.05 / edit	API
2	GPT-Image 2 Edit OpenAI	1262	~14s/edit	Conversational editing · multi-turn refinement	$0.05 / edit	API
3	Imagen 4 Edit Google	1248	~9s/edit	Mask-guided + instruction-guided edits	$0.05 / edit	API
4	Adobe Firefly Edit 4 Adobe	1241	~11s/edit	Generative Fill in Photoshop · commercial-safe	Subscription · Adobe CC	Subscription
5	Recraft Edit V3 Recraft	1227	~16s/edit	Brand-consistent edits · style locks	$0.05 / edit	API
6	Ideogram Magic Fill 3.0 Ideogram	1213	~12s/edit	Text-preserving inpainting	$0.08 / edit	API
7	Stable Diffusion 3.5 ControlNet Stability AI	1206	~5s/edit on A100	Full open ecosystem · LoRAs · masks	Open source	Open Source
8	Midjourney Editor v7 Midjourney	1197	~25s/edit	Variations, vary region, zoom out	Subscription · $10–60/mo	Subscription
9	FLUX.1 Kontext Pro Black Forest Labs	1188	~7s/edit	Context-aware multi-image editing	$0.04 / edit	API
10	Leonardo Canvas Leonardo.ai	1172	~14s/edit	Inpaint, outpaint, sketch-to-image	Subscription · $10–48/mo	Subscription

▶

Text-to-Video

10 models sorted by Video Arena ELO verified July 2026

Models that generate video clips from text prompts. Ranked by motion quality, prompt adherence, and visual fidelity at native resolution.

#	Model	Video Arena ELO	Speed	Output / Detail	Pricing	Access
1	Dreamina Seedance 2.0 720p ByteDance Seed	1219	~60s/clip	Up to 10s · 720p · arena leader	$9.07 / min	API
2	HappyHorse-1.1 Alibaba-ATH	1151	~70s/clip	Up to 10s · 1080p · fast-rising	$9.90 / min	API
3	HappyHorse-1.0 Alibaba-ATH	1123	~80s/clip	Up to 10s · 1080p	$13.20 / min	API
4	SkyReels V4 Skywork AI	1106	~90s/clip	Up to 10s · 1080p · narrative-friendly	$21.00 / min	API
5	Kling 3.0 1080p (Pro) KlingAI	1104	~80s/clip	Up to 10s · 1080p · Pro tier	$20.16 / min	API
6	Kling 3.0 Omni 1080p (Pro) KlingAI	1098	~80s/clip	Up to 10s · 1080p · multimodal Omni	$16.80 / min	API
7	Kling 3.0 720p (Standard) KlingAI	1097	~70s/clip	Up to 10s · 720p · standard tier	$15.12 / min	API
8	Kling 3.0 Omni 720p (Standard) KlingAI	1094	~70s/clip	Up to 10s · 720p · multimodal Omni	$13.44 / min	API
9	Veo 3.1 Google	1094	~90s/clip	Up to 8s · 1080p · synchronized audio	$24.00 / min	API
10	Wan 2.7 Alibaba	1094	~80s/clip	Up to 10s · 1080p · multilingual prompts	$16.90 / min	API

↻

Image-to-Video

10 models sorted by Video Arena ELO (I2V) verified July 2026

Models that animate or extend a still image into video. Ranked by motion realism, subject preservation, and edit-prompt adherence.

#	Model	Video Arena ELO (I2V)	Speed	Output / Detail	Pricing	Access
1	Dreamina Seedance 2.0 720p ByteDance Seed	1194	~60s/clip	Up to 10s · 720p · I2V leader	$9.07 / min	API
2	HappyHorse-1.1 Alibaba-ATH	1117	~70s/clip	Up to 10s · 1080p · fast-rising	$9.90 / min	API
3	grok-imagine-video-1.5-preview xAI	1110	~60s/clip	Up to 10s · 1080p · preview	$8.40 / min	API
4	Wan 2.7 Alibaba	1090	~80s/clip	Up to 10s · 1080p · multilingual prompts	$16.90 / min	API
5	HappyHorse-1.0 Alibaba-ATH	1089	~80s/clip	Up to 10s · 1080p	$13.20 / min	API
6	Veo 3.1 Google	1087	~90s/clip	Up to 8s · 1080p · synchronized audio	$24.00 / min	API
7	SkyReels V4 Skywork AI	1082	~90s/clip	Up to 10s · 1080p · narrative-friendly	$21.00 / min	API
8	grok-imagine-video xAI	1081	~55s/clip	Up to 10s · 1080p · cheapest I2V tier	$4.20 / min	API
9	PixVerse V6 PixVerse	1075	~50s/clip	Up to 8s · 1080p	$6.90 / min	API
10	Veo 3.1 Fast Google	1075	~45s/clip	Up to 8s · 1080p · speed-optimized	$9.00 / min	API

◑

Speech-to-Text

10 models sorted by WER (lower = better) verified July 2026

Models that transcribe spoken audio into text. Ranked by Word Error Rate (lower is better) on standard benchmarks (LibriSpeech, FLEURS, AMI).

#	Model	WER (lower = better)	Speed	Output / Detail	Pricing	Access
1	Fun-Realtime-ASR-preview Alibaba (Fun)	1.7%	realtime	AA-WER 1.7% · preview	Preview · contact	Preview
2	Scribe v2 ElevenLabs	2.2%	realtime	AA-WER 2.2% · production	$3.67 / 1k min	API
3	MAI-Transcribe-1.5 Microsoft Azure	2.4%	realtime	AA-WER 2.4% · Azure-native	$6.00 / 1k min	API
4	Smallest AI Pulse Pro Smallest.ai	2.4%	realtime	AA-WER 2.4% · low-cost tier	$3.50 / 1k min	API
5	MAI-Transcribe-1 Microsoft Azure	2.6%	realtime	AA-WER 2.6% · prior Azure gen	$6.00 / 1k min	API
6	Voxtral Small Mistral	2.8%	realtime	AA-WER 2.8%	$4.00 / 1k min	API
7	Gemini 3.1 Pro (High) Google	2.8%	~realtime	AA-WER 2.8% · Pro reasoning	$18.15 / 1k min	API
8	Gemini 3 Flash (High) Google	2.9%	realtime	AA-WER 2.9% · Flash tier	$13.70 / 1k min	API
9	Gemini 2.5 Pro Google	2.9%	~realtime	AA-WER 2.9% · prior Pro gen	$11.39 / 1k min	API
10	Solaria-3 Gladia	3.2%	realtime	AA-WER 3.2% · multilingual	$10.16 / 1k min	API

◕

Text-to-Speech

10 models sorted by TTS Arena ELO verified July 2026

Models that generate spoken audio from text. Ranked by TTS Arena ELO and voice quality / naturalness benchmarks.

#	Model	TTS Arena ELO	Speed	Output / Detail	Pricing	Access
1	Sonic 3.5 Cartesia	1218	~realtime	Quality ELO 1218 · arena leader	$0.03 / 1k chars	API
2	Gemini 3.1 Flash TTS Google	1216	~realtime	Quality ELO 1216 · Flash-tier price	$0.075 / 1k chars	API
3	Fun-Realtime-TTS Alibaba (Fun)	1210	realtime	Quality ELO 1210 · preview	Preview · contact	Preview
4	Realtime TTS-2 (Research Preview) OpenAI	1202	realtime	Quality ELO 1202 · research preview	Preview · OpenAI tier	Preview
5	xAI Text to Speech xAI	1197	~realtime	Quality ELO 1197	$0.06 / 1k chars	API
6	ElevenLabs v3 Multilingual ElevenLabs	1188	~realtime	32 languages · vocal-realism leader	Subscription · $5–330/mo	Subscription
7	PlayHT 3.0 Conversational PlayHT	1172	~realtime	Conversational tuning · streaming	$0.06 / 1k chars · API tier	API
8	Azure Neural Voice Pro Microsoft Azure	1157	~realtime	Enterprise-grade · 140+ languages	$0.024 / 1k chars (Neural)	API
9	OpenAI tts-1-hd OpenAI	1148	~realtime	6 voices · simple integration	$0.030 / 1k chars	API
10	Coqui XTTS-v2 Coqui AI	1132	~3x realtime on A100	Open weights · voice cloning · 17 languages	Open source	Open Source

♪

Music Generation

10 models sorted by Composite Score verified July 2026

Models that generate full musical compositions from text prompts. Ranked by composite community evaluations and producer review scores — no consensus benchmark exists.

#	Model	Composite Score	Speed	Output / Detail	Pricing	Access
1	Suno v5 Suno	1293	~25s/song	Quality leader · vocal realism + structure · ELO 1293	Subscription · $10–30/mo	Subscription
2	Suno v5.5 Suno	1278	~25s/song	Refined v5 · easiest "idea → release" workflow	Subscription · $10–30/mo	Subscription
3	Udio v1.5 Udio	1232	~30s/song	Stems + section regeneration · closest to AI DAW	Subscription · $10–30/mo	Subscription
4	ElevenLabs Music ElevenLabs	1218	~25s/song	Licensed-data trained · 44.1kHz · commercial-safe	Subscription · $5–330/mo	Subscription
5	Google Lyria 3 Google	1205	~30s/song	Vocal-capable (new Feb 2026) · API/platform path	API · Vertex AI pricing	API
6	Stable Audio 2.5 Stability AI	1187	~20s/song	Sound design + instrumental beds · commercial use	$0.10 / song · subscription tiers	API
7	AIVA 4.0 AIVA	1172	~45s/song	Classical/cinematic specialist · MIDI export	Subscription · $11–48/mo	Subscription
8	Riffusion FUZZ Riffusion	1164	~25s/song	Full song generation · stems export	Subscription · $9–25/mo	Subscription
9	MusicGen Large v2 Meta	1145	~30s/song on A100	3.3B params · open weights · text + melody conditioned	Open source	Open Source
10	YuE Music v1 M-A-P	1128	~90s/song on A100	7B params · open weights · full vocal songs	Open source	Open Source

Frequently asked questions.

How are the leaderboards ranked?

Each leaderboard uses the most credible benchmark for that modality. Text-to-Image and Text-to-Speech use Artificial Analysis Arena ELO. Speech-to-Text uses Artificial Analysis WER (AA-AgentTalk + VoxPopuli-Cleaned + Earnings22-Cleaned). Video (T2V + I2V) uses Artificial Analysis Video Arena. Music Generation uses a composite of Suno Arena ELO and community evaluations — methodology is credited on every section.

What's the best text-to-image model in 2026?

OpenAI GPT Image 2 leads the AA Image Arena at ELO 1339 as of July 2026, followed by Microsoft MAI-Image-2.5 (1273), HiDream-O1-Image-1.5 (1264), and Nano Banana 2 / Gemini 3.1 Flash Image (1254). For native 4K output, Reve 2.0 is worth tracking. Ideogram 3.5 remains the text-rendering specialist. For brand-safe commercial use, Adobe Firefly Image 5 leads. Best open-source: Stable Diffusion 3.5 Large.

What's the best text-to-video model in 2026?

ByteDance Dreamina Seedance 2.0 leads the AA Video Arena at ELO 1219 as of July 2026, with Alibaba HappyHorse-1.1 (1151) and HappyHorse-1.0 (1123) close behind. Skywork SkyReels V4 (1106) and the Kling 3.0 family (1094–1104) round out the top tier. Google Veo 3.1 (1094) remains the differentiator for synchronized audio.

What's the fastest speech-to-text model?

On the Artificial Analysis STT leaderboard (AA-AgentTalk + VoxPopuli + Earnings22), Alibaba's Fun-Realtime-ASR-preview leads accuracy at AA-WER 1.7% (preview). ElevenLabs Scribe v2 leads production-ready accuracy at 2.2% and $3.67/1k minutes. Microsoft MAI-Transcribe-1.5 follows at 2.4% / $6.00/1k min. All three run at realtime or near-realtime.

Why do some models show "Open Source" instead of a price?

Open-source models like Stable Diffusion 3.5, Whisper v4, MusicGen, and HunyuanVideo are released under permissive licenses and run on your own infrastructure. The "Pricing" cell shows "Open source" instead of a per-call price because the cost is your compute (GPU hours, electricity) rather than a vendor fee. This is performance ranking, not cost ranking — open-source models earn their rank on the same benchmarks as commercial APIs.

How often is this data updated?

The dataset is reviewed weekly against published benchmarks and vendor releases. New flagship model launches are typically added within 7 days. Quality scores update as new arena evaluations are published — the per-section "verified" badge on each section shows the most recent review date.

Why isn't there a single Quality Score across all leaderboards?

Each modality has its own canonical evaluation methodology. Image generation uses pairwise arena rankings (ELO). Speech-to-text uses Word Error Rate (a percentage). Text-to-speech uses arena ELO with different judging criteria. Normalizing these to a single 0-100 score would obscure the actual evaluation signal. We preserve the native benchmark per category and credit the source — builders can trust the signal more than a derivative composite.

Where can I see text-LLM rankings?

Text large language models have their own dedicated page: the LLM Cost Calculator. It tracks 100 top text-LLMs by Artificial Analysis Intelligence Index alongside cost, speed, latency, and context window. Link in the page header.

How do you handle models with subscription-only pricing?

Some models (Midjourney, Suno, Udio, Adobe Firefly) are subscription-only with no per-output API pricing. We show "Subscription · $X–$Y/mo" with the range, plus link to the vendor's pricing page. For builders comparing per-output cost specifically, we recommend either getting an API-accessible alternative or budgeting at the subscription tier.

Will you add other categories (3D generation, agents, embeddings)?

Probably — based on user demand. Text-to-3D, agentic frameworks, and embeddings are tracked as planned categories. Subscribe to the API waitlist above and we'll notify you when they ship.

Missing a model or category?

We add new entries when they earn a leaderboard spot. Suggest a model or modality — we review submissions within 7 days.

Start with an Audit

Architect & Implement

Train, Deploy, Optimize

AI Tools

AI Playbooks

Content & Editorial

Humane AI & Accountability

Three lanes, one program

Constellation · Atlas · ANT · OWL

Specs & open releases

Cite, partner, build