What does "building with AI" actually mean?

Building with AI means shipping software where one or more LLMs are a structural component — not a wrapper around a hosted chat product. It covers LLM-as-feature work (adding AI to an existing product), LLM-native software (where the LLM is the core value), agentic systems (where the LLM makes decisions and calls tools), and internal AI tooling. Each has different production patterns and reliability bars.

Which model should I use for my project?

It depends on the task. Classification and high-volume routine work usually fits smaller models (Haiku, GPT-4o-mini, Flash). Mid-complexity work like summarization fits mid-tier models. Complex reasoning, agent loops, and code generation usually justify frontier models. Many production systems use multiple models in the same workflow — cheap for routing, frontier for the hard step.

Do I need evals to ship AI features?

For hobby projects, no. For production AI software, yes — meaningfully. A versioned eval set is what lets you ship prompt and model changes confidently. Without evals you're shipping on vibes, and quality drift goes unnoticed until users complain.

What's the difference between RAG and an agent?

RAG (Retrieval-Augmented Generation) retrieves relevant context and injects it into a single prompt. The LLM doesn't make decisions about what to retrieve — that's handled by retrieval logic. An agent is an LLM that decides which tools to call, including possibly retrieval, and adapts based on results. RAG is simpler and often the right choice; agents are more powerful and more complex to ship reliably.

How do I keep LLM costs from getting out of hand?

Five practices: route by complexity (smaller models for routine work), cap agent loops with step limits, filter context to what's necessary, use prompt caching where available, and instrument cost monitoring with alerts before scale. Cost discipline is a build-time decision, not a fix-it-later decision.

How does multi-agent compare to single-agent?

Anthropic's published research shows multi-agent systems perform roughly 90% better than single-agent on hard tasks. But multi-agent is also more complex to build, evaluate, and operate. The right call: start single-agent, measure where it fails, escalate to multi-agent only when single-agent has measurably hit its ceiling.

Building with AI · A Practitioner's Guide

Building software with AI inside it has gotten dramatically easier in the last two years and dramatically harder in the same window. Easier because the models are better, the SDKs are mature, and most of the obvious patterns are documented. Harder because the standards for what counts as "production-grade" have moved — users now expect AI features to work reliably, respond fast, and not silently degrade.

A demo that works once in a notebook is not the same as software users can depend on. The gap between the two is most of the work. This pillar is the practitioner's guide to closing that gap — the architecture decisions, model selection patterns, evaluation discipline, and cost mechanics that separate hobby projects from products that hold up.

It's written for builders. If you're not writing code or making architecture decisions directly, this pillar will be technical-heavy — the broader Pillar 09 on AI Powered Workflows is the better starting point. If you are building, this is the operator-tier reference.

01What "building with AI" actually means

The phrase covers a wide range. It helps to be precise about which kind of building you're doing, because the patterns differ:

LLM-as-feature. Adding AI capability to an existing product. A summarization feature in your note app, a categorization step in your CRM, a draft-generation tool inside your form builder. The LLM is one component among many.
LLM-native software. Software whose core value proposition only exists because of the LLM. Most of the AI-native products launched in 2024-2026 fall here — the LLM is the central component, not a side feature.
Agentic systems. Software where one or more LLMs operate as agents — making decisions, calling tools, maintaining state, working across long-running tasks. The hardest production category, with the highest reliability bar.
Internal tooling. AI-assisted developer tooling, internal automations, and back-office systems. Lower reliability bar than customer-facing work, but the patterns transfer up.

The architecture decisions, evaluation discipline, and cost discipline all scale with the criticality of the use case. A user-facing agentic product has a very different bar than an internal categorization helper. Both are "building with AI" — both have different right answers.

02Architecture patterns that survive production

Four patterns cover most production AI software in 2026. Each has a fit, each has a failure mode.

PATTERN 1

Single call, single response

One LLM call per user action. Input goes in, output comes back, response renders. The simplest pattern and the right answer surprisingly often — summarization, classification, draft generation, Q&A.

When it fits: task is well-defined, context is bounded, no multi-step reasoning needed.

PATTERN 2

RAG (Retrieval-Augmented Generation)

Retrieve relevant context from a knowledge base, inject it into the prompt, generate. The dominant pattern for "AI over your data" use cases. Quality depends almost entirely on retrieval quality, not model quality.

When it fits: answers depend on private/specific data the model doesn't know.

PATTERN 3

Tool-using agent

LLM decides which tools to call to complete a task. APIs, database queries, MCP servers, calculators. The agent reasons about which tool, calls it, integrates the result, decides next step. Powerful — and the highest production complexity.

When it fits: task requires interacting with external systems and adapting based on what comes back.

PATTERN 4

Multi-agent orchestration

Multiple specialized agents coordinating on a complex task. A planner agent decomposes work, specialist agents handle sub-tasks, a synthesizer puts results together. Anthropic's research shows ~90% improvement on hard tasks vs single-agent.

When it fits: task is too complex or context-heavy for a single agent to handle reliably.

The mistake teams make: jumping to Pattern 3 or 4 because they're interesting, when Pattern 1 or 2 would have solved the problem with less complexity and cost. Start with the simplest pattern that could work. Escalate only when the simpler pattern has measurably failed.

03Picking the right model for the job

Model selection in 2026 isn't "pick the best one." Cost varies by 50-100× between frontier and small models. Latency varies by 10×. Quality varies by task, not in general. The right model depends on what you're doing.

Task type	Typical right answer	Why
Classification / categorization	Smaller models (Haiku 4.5, GPT-5.5 Mini, Gemini Flash)	Cheap, fast, capable enough. Save frontier models for harder work.
Summarization of moderate-length text	Mid-tier (Claude Sonnet 4.6, GPT-5.5)	Balance of quality and cost. Streaming UX is good.
Complex reasoning, multi-step tasks	Frontier models (Claude Opus 4.7, GPT-5.5, OpenAI o3, Gemini 3 Pro)	Where the model quality gap matters most. Worth the cost on hard tasks.
Code generation	Claude Opus 4.7, Claude Sonnet 4.5/4.6, or code-specialized models	Code quality compounds — model strength pays back in fewer iterations.
Tool-using agent reasoning	Frontier models with strong tool-use training (Claude Opus 4.7, GPT-5.5)	Agent loops cost compounds; you want the model that makes the right call first.
Long-context document understanding	Claude Opus 4.7 (up to 1M tokens) or Gemini 3 Pro (up to 10M tokens)	Long-context capability matters more than raw quality when context dominates.
High-volume routine tasks	Smallest model that hits quality threshold	At scale, cost dominates. Run evals to confirm a smaller model is good enough.
Self-hosted / data-sensitive deployment	Open-source: DeepSeek V3.1/R1, Kimi K2 Thinking, Qwen	Complete data control, no per-token API costs, tier-one reasoning on top open-source.

The general pattern: route requests by complexity. Cheap models handle the easy tasks; frontier models handle the hard ones. Sophisticated production systems often use multiple models in the same workflow — Haiku 4.5 for routing decisions, Sonnet 4.6 for the actual work, Opus 4.7 only when the work demands it.

04Benchmark performance across the frontier

Public evaluation data shows distinct frontrunners across the categories that matter most in production work. The table below summarizes where each frontier model is leading at time of publication. Numbers shift fast — treat this as a snapshot, not a permanent ranking, and re-check before making major architecture commitments.

Category	Top performer	Key runner-up	Best open-source alternative
Deep reasoning (GPQA Diamond)	Claude 3 Opus (95.4%)	Claude Opus 4.7 (94.2%)	Kimi K2 Thinking / DeepSeek-R1
Agentic coding (SWE Bench)	Claude Opus 4.7 (87.6%)	Claude Sonnet 4.5 (82.0%)	DeepSeek-V4-Pro
Advanced math (AIME 2025)	Gemini 3 Pro (100%)	GPT 5.2 (100%)	Kimi K2 Thinking (99.1%)
Visual reasoning (ARC-AGI 2)	GPT-5.5 (85.0%)	Claude Opus 4.6 (68.8%)	Qwen2.5-VL-32B

Proprietary frontier models

The commercial frontier offers premium, managed intelligence primarily through API integrations or official chat tiers. Three families currently dominate:

Anthropic Claude suite. Claude Opus 4.7 and Claude Sonnet 4.6 lead in long-context document understanding (up to 1 million tokens) and complex software architecture execution. Heavily favored by enterprises prioritizing trust, guardrails, and system reliability.
OpenAI GPT and o-series. GPT-5.5 and the OpenAI o3 series provide broad generalization across analytics, image rendering, and conversational structure. Massive concurrent token throughput makes them well-suited to customer-facing workspace agents.
Google Gemini series. Gemini 3 Pro sets the benchmark for data processing scale, with a 10-million token context window. Optimal for processing entire code repositories, hours of raw video, or deep multimodal document audits.

Leading open-source models

For self-hosting, complete data control, and zero per-token API markups, open-source has caught up materially. Two families to know:

DeepSeek (V3.1 and R1). Uses efficient Mixture-of-Experts architectures to switch between lightweight "non-thinking" generation and heavyweight logical reasoning. Delivers tier-one reasoning performance at a fraction of standard operational costs.
Kimi K2 series (K2.6 / Thinking). Emerged as a dominant force in advanced math and reasoning benchmarks, challenging the major Western labs on technical problem-solving capabilities.

The strategic implication for builders: a 2026 production system that needs deep reasoning, long context, and reliable tool use has Claude Opus 4.7 as the default. One that needs massive context windows over multimodal data leans Gemini 3 Pro. One that needs maximum data control or self-hosting can reach tier-one quality through DeepSeek or Kimi. The frontier is plural, not single.

05Context management and retrieval

Most "the AI hallucinated" complaints in 2026 trace to context, not model quality. The model didn't have what it needed to do the work — either because retrieval was weak, the prompt was incomplete, or context windows weren't managed well.

Four practical principles:

Retrieval quality dominates RAG outcomes.

Most RAG quality problems are retrieval problems, not LLM problems. Investing in better chunking, hybrid search (keyword + semantic), and metadata filtering pays back faster than upgrading models.

Context windows are not free even when they're large.

1M-token context windows exist. Using them costs real money and slows responses. Filter to what's necessary; don't dump everything in.

System prompts are infrastructure.

Treat system prompts like code — version them, test them, measure changes. A change to the system prompt is a deployment.

Memory is its own architecture decision.

For multi-turn agents, decide explicitly how memory works — what persists, what summarizes, what gets discarded. Default behavior is rarely right at scale.

# Bad pattern — dump everything into context prompt = f"Here's everything in the database: {entire_db}. Now answer..." # Good pattern — retrieve, filter, structure relevant = retrieve(query=user_input, k=5, filter={"recent": True}) prompt = format_with_context(user_input, relevant)

06Evaluation and observability

The single biggest delta between hobby AI projects and production AI projects: evals.

A hobby project ships when "it looks good in the demo." A production project ships when "it scores above threshold on the eval set, doesn't regress on the safety eval, and the latency p95 is acceptable." Without evals, you can't know if a prompt change is an improvement, a regression, or a coin flip.

What a real eval setup includes:

Reference test set. 50-500 representative inputs with expected outputs (or scoring criteria). Curated, versioned, treated as production data.
Automated scoring. Either rule-based (for structured tasks) or LLM-as-judge (for open-ended). Reproducible runs.
Regression tracking. Every prompt change, model change, or pipeline change runs the eval. Scores tracked over time.
Safety / refusal eval. Separate set covering edge cases — prompt injection attempts, harmful requests, ambiguous edge cases. Run on every change.
Production logging. Inputs, outputs, model used, tokens, latency, cost, user feedback signals. Mineable for the next eval expansion.

Teams who don't have this end up shipping prompt changes on vibes, getting surprised when production quality drops, and arguing about whether the new model is "really better." Teams who do have this can ship confidently every week.

07Cost discipline at scale

LLM API costs in 2026 are 90%+ lower than 2023. The economics of AI features are real. They're also easy to mismanage.

Five patterns that cause runaway cost:

Agent loops with no termination guard. An agent that keeps calling tools without a max-step limit. One bug, one bad input — and you're billing for a thousand iterations on a single user action.
Frontier model for routine tasks. Using Opus or GPT-4.5+ for classification. Spending 50× what Haiku or Mini would have cost for the same answer.
Verbose context. Dumping unnecessary context into every prompt. Pays per token both directions, every request.
No caching. Identical or near-identical prompts being re-computed. Prompt caching exists; use it.
No observability. Cost surprises arrive monthly because nobody's watching daily. Set up dashboards before you scale, not after.

The discipline: budget per workflow, monitor in real time, alert on threshold breaches, and run cost evals when changing models or prompts. A change that improves quality by 5% and increases cost by 10× is rarely worth it. A change that drops cost by 50% with 3% quality reduction often is.

The cost rule

Cost is a quality dimension. A feature that ships great quality at 10× the budget didn't ship — it leaked.

08Latency, streaming, and UX

The user experience of AI software is built on three measurements: time to first token, total response time, and stream smoothness. Each matters for different reasons.

Time to first token (TTFT)

The single most important latency number for user-facing AI. Once tokens start streaming, perceived wait time drops dramatically. Sub-1s TTFT feels instant; over 3s feels broken. Smaller models, prompt caching, and streaming inference all help.

Total response time

How long the full response takes. Matters less for streaming UX but matters absolutely for agentic flows where the user waits for completion before moving on. Multi-step agents need careful budgeting here — each step compounds.

Stream smoothness

Whether tokens stream consistently or in bursts. Smooth streams feel professional; choppy streams feel broken even when total time is identical. Network and infrastructure choices matter more than people expect.

A well-designed AI feature usually streams aggressively, shows progress indicators between agent steps, and budgets total response time at the design phase — not after users complain.

▣

Architecture review for your AI build

Senior AI engineer review. Model selection, evals, cost, latency — checked before you scale.

Book a scoping call →

09Production readiness checklist

Before shipping any AI feature to production, run this checklist. Most teams skip 3-5 of these and pay for it later.

Building with AI — Production Readiness

You have a versioned eval set with at least 50 representative inputs.
You have a separate safety / refusal eval covering edge cases.
You can run the full eval suite in CI on every meaningful change.
Cost per user action is measured and budgeted, not assumed.
Time to first token and total latency are measured and within targets.
Agent loops have explicit step limits and fallback behavior.
System prompts are versioned and reviewed like code.
Production logging captures inputs, outputs, models, tokens, latency, cost.
You have an alerting strategy for cost anomalies and quality regressions.
You have a rollback plan when a model or prompt change causes regression.
Sensitive data handling, PII redaction, and access controls are in place.
You've planned for what happens when the model or API is unavailable.

This is the floor for production AI software in 2026. Hobby projects can skip most of these. Software users depend on shouldn't.

10Where AI ARMY fits

AI ARMY's engineering work covers architecture review, custom agent builds, eval design, and production deployment for teams shipping AI software. Engagements vary from focused architecture reviews (1-2 weeks, scoped to validate a specific design before you scale) through full custom agent builds (4-12 weeks) to embedded engineering partnership for longer-term build programs.

If you're earlier in the journey and trying to figure out whether to build versus buy, the Custom AI Deployments pillar is a better starting point — it covers the build-vs-buy decision before the actual build patterns.

If you're already building and want a sanity check on architecture, evals, or cost before scaling, a scoping call surfaces the right scope in a single conversation.