AI Procurement — Token Pricing Guide 2026

AI Token Pricing:
Understand and Negotiate
Consumption Models

Input vs output token costs, context window economics, prompt caching, model routing, and how to negotiate enterprise AI contracts that don't explode your budget when usage scales.

Editorial Disclosure: Pricing data sourced from publicly published API pricing pages and verified enterprise deal benchmarks. Figures represent early 2026 rates and are subject to change.
Output tokens cost vs input tokens
50%
Cost reduction via prompt caching
17×
GPT-4o vs GPT-4o mini price gap
83%
GPT-4 class price drop since 2023

Token pricing is the single most misunderstood cost driver in enterprise AI deployments. Most organisations sign AI vendor contracts without understanding how token costs compound at scale — and discover their AI budget is 3–10× higher than forecast within the first year. This guide is part of our comprehensive AI procurement guide and gives enterprise buyers the quantitative foundation to model, negotiate, and manage token costs effectively.

1. What Are Tokens? A Practical Definition

Tokens are the fundamental unit of measurement for large language models. A token is roughly 3–4 characters of text, or approximately 0.75 words in English. The relationship is not precise — punctuation, whitespace, code, and non-English languages all tokenise differently.

Token Estimation Rules of Thumb 1,000 tokens ≈ 750 words ≈ 1.5 pages of A4 text
1,000 tokens ≈ 500–700 words of Python code
1,000 tokens ≈ 400–500 words of dense JSON/XML
100-page PDF ≈ 50,000–80,000 tokens (document-dependent)
1 hour of meeting transcript ≈ 15,000–25,000 tokens

Every API call consumes tokens in two directions: input tokens (everything sent to the model — your prompt, system instructions, conversation history, retrieved documents) and output tokens (everything the model generates in response). Both are billed, at different rates.

2. Input vs Output Token Pricing

Output tokens cost significantly more than input tokens across all major AI platforms. This asymmetry reflects the higher computational cost of generating text versus processing it. The ratio varies by model but typically ranges from 3:1 to 5:1 (output cost to input cost).

Expert Advisory

Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories — free matching, no commitment.

Get Matched with an Advisor → See Rankings →
ModelInput (per 1M)Output (per 1M)Output:Input Ratio
OpenAI GPT-4o$2.50$10.004:1
OpenAI GPT-4o mini$0.15$0.604:1
OpenAI o3-mini$1.10$4.404:1
Anthropic Claude 3.5 Sonnet$3.00$15.005:1
Anthropic Claude 3 Haiku$0.25$1.255:1
Google Gemini 1.5 Pro$1.25–$2.50*$5.00–$10.00*4:1
Google Gemini 2.0 Flash$0.10$0.404:1
Meta Llama 3 70B (Groq)$0.59$0.79~1.3:1
AWS Bedrock (Claude 3.5 Sonnet)$3.00$15.005:1

* Gemini 1.5 Pro pricing is tiered: lower rate for prompts under 128K tokens, higher rate above. Prices as of early 2026; subject to change.

Cost Modelling Insight

When modelling AI costs, do not assume a 50/50 input/output token split. Most enterprise use cases are input-heavy: a RAG query with a 4,000-token context window and 500-token response is 89% input, 11% output. The expensive output tokens are a relatively small share of total cost for retrieval-augmented use cases — but a large share for code generation and long-form content creation where output is proportionally larger.

3. Context Window Economics

The context window is the maximum number of tokens a model can process in a single API call (combined input and output). Large context windows are powerful — they allow models to reference entire documents, long conversation histories, or complex multi-file codebases. They are also expensive.

Context Window SizeWhat FitsInput Cost at $2.50/1M (GPT-4o)
8K tokens~6,000 words / 6 pages$0.02 per call
32K tokens~24,000 words / 25 pages$0.08 per call
128K tokens~96,000 words / 100 pages$0.32 per call
200K tokens~150,000 words / 160 pages$0.50 per call
1M tokens~750,000 words / an entire book$2.50 per call

The critical design decision: should you use large context windows, or retrieve and inject only relevant chunks? Retrieval-Augmented Generation (RAG) that injects 3,000–5,000 tokens of relevant context is typically 5–20× cheaper than stuffing the full 100K-token document into every call. For high-frequency applications, this architecture choice can reduce monthly AI costs by $50K–$500K.

Context Window Cost Trap: Conversation History

Chat applications that maintain full conversation history send all prior messages as input on every turn. A 20-turn conversation with 300 tokens per turn accumulates 6,000 tokens of history by turn 20 — meaning the 20th API call sends 6,000 tokens of history plus the new 300-token message. This means conversation token costs are front-loaded with manageable early turns but grow quadratically with conversation length. Implement conversation summarisation at 10–15 turns to reset the effective context window.

4. Prompt Caching: The Hidden Cost Reducer

All major AI platforms now offer prompt caching — a mechanism that caches repeated prefixes of your input, reducing the cost of re-processing identical system prompts and context on every API call.

Free Resource

Get the IT Negotiation Playbook — free

Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud — all covered.

PlatformCache Read RateCache Write RateCache Duration
OpenAI (GPT-4o)50% off input priceFull input price5–10 minutes (automatic)
Anthropic (Claude)90% off input price25% premium to write5 minutes (configurable)
Google (Gemini 1.5)75% off input priceFull input price60 minutes (configurable)
AWS BedrockFollows underlying modelFollows underlying modelFollows underlying model

For applications with large, consistent system prompts (e.g., a RAG system with a 10,000-token instruction set), prompt caching can reduce effective input token costs by 40–70%. The economics are especially strong for Anthropic's Claude, where cache reads cost only 10% of the full input rate.

Prompt Caching Savings Example Use case: Customer service bot with 8,000-token system prompt, 2,000-token RAG context
Per-query input: 10,000 tokens → $0.025 at GPT-4o rates
With 50% cache hit rate on system prompt (8K tokens): effective cost = $0.0125
At 10,000 queries/day: $250/day uncached → $125/day with caching
Annual saving: ~$45,500 from caching alone on this one application

5. Model Routing: Matching Cost to Task

Model routing is the practice of directing different query types to different AI models based on complexity, cost, and quality requirements. This is the single highest-impact optimisation available to enterprises with multiple concurrent AI workloads.

Task TypeRecommended Model TierReasonCost vs Premium Alternative
Simple FAQ lookupGPT-4o mini / Gemini 2.0 Flash / Claude HaikuBinary recall, no reasoning needed17–20× cheaper than GPT-4o
Document summarisationGPT-4o mini or equivalentStructured extraction, not reasoning17× cheaper
Sentiment / classificationGPT-4o mini or fine-tuned small modelPattern matching, simple output17–50× cheaper
Code generationGPT-4o / Claude 3.5 SonnetComplex reasoning, precision requiredFull price justified
Complex analysis / reasoningGPT-4o / Claude 3.5 Sonnet / o3-miniMulti-step reasoning requiredFull price justified
Legal/contract reviewClaude 3.5 Sonnet or GPT-4oNuance, instruction-following criticalFull price justified
Translation (major languages)GPT-4o mini or Gemini FlashHigh-quality small models match GPT-4o17× cheaper
Embeddings generationtext-embedding-3-small (OpenAI)Dedicated embedding model, high quality50–100× cheaper than generation

Enterprises implementing intelligent model routing typically reduce per-query costs by 50–70% with no measurable quality degradation on most tasks. The engineering investment (a routing classifier layer) is typically $50K–$150K — recovering itself within 1–3 months at scale. For detailed guidance on platform selection, see our AI platform contract negotiation guide.

6. Cross-Vendor Token Pricing Comparison

Token pricing varies substantially across the major AI platforms. The "best" platform depends on your specific model quality requirements, not just headline price.

VendorFlagship InputFlagship OutputBudget InputBudget Output
OpenAI$2.50 (GPT-4o)$10.00$0.15 (4o mini)$0.60
Anthropic$3.00 (Sonnet 3.5)$15.00$0.25 (Haiku 3)$1.25
Google$1.25 (Gemini 1.5 Pro)$5.00$0.10 (Gemini 2.0 Flash)$0.40
AWS Bedrock (Claude)$3.00$15.00$0.25$1.25
AWS Bedrock (Titan)$0.50$1.50$0.10$0.30
Azure OpenAI (GPT-4o)$2.50$10.00$0.15$0.60
Mistral Large$2.00$6.00$0.10 (Mistral 7B)$0.30
Meta Llama 3 70B (hosted)$0.50–$0.90$0.70–$0.90$0.07 (8B)$0.07

For workloads where quality parity exists, Google Gemini 2.0 Flash ($0.10 input / $0.40 output) represents the most cost-effective frontier model option in early 2026. Self-hosted open-source models (Meta Llama, Mistral) offer lower marginal per-token costs but require infrastructure investment and engineering overhead.

7. Building an Enterprise Consumption Model

Before any AI vendor negotiation, you need a defensible consumption model. Without one, you cannot calculate committed spend levels, forecast annual costs, or identify the right discount tier to target.

Monthly Cost Calculation Formula Monthly cost = (Queries/month)
× [(Avg input tokens × Input rate/1M) + (Avg output tokens × Output rate/1M)]
× (1 - Prompt cache hit rate)
× (Model mix weighted cost factor)

Example (10,000 queries/day, GPT-4o, 5K input + 500 output, 30% cache hit):
= 300,000 × [(5,000 × $0.0000025) + (500 × $0.00001)]
× 0.70
= 300,000 × [$0.0125 + $0.005] × 0.70
= 300,000 × $0.01225
= $3,675/month → ~$44,100/year

The Five Consumption Variables to Model

  • Query volume: Daily/monthly API calls. Include peak traffic (often 3–5× average for consumer-facing apps). Budget for 2–3× growth in year 2 as adoption increases.
  • Average input tokens: Break down by component — system prompt, conversation history, retrieved context, user message. RAG applications typically have large, variable input token counts.
  • Average output tokens: Model the output length distribution, not just the average. Long-tail outputs (reports, summaries, code) drive disproportionate cost.
  • Model mix: What percentage of queries go to premium vs budget models? A 70/30 split (70% to cheap models) dramatically changes your cost model.
  • Cache hit rate: Estimate based on your system prompt consistency and context reuse patterns. High-volume customer service bots may achieve 60–80% cache hit rates; ad-hoc research tools may be near 0%.

8. Negotiating Committed Spend Discounts

All major AI API providers offer volume discounts for committed annual spend. The discount structure varies significantly by vendor and is negotiable above minimum thresholds.

VendorCommitted Spend LevelTypical Discount vs ListCommitment Type
OpenAI$100K/year10–15%Prepay credits or annual commit
OpenAI$500K/year20–30%Enterprise agreement
OpenAI$1M+/year30–40%Custom enterprise agreement
Anthropic$250K/year15–20%Annual prepay
Anthropic$1M+/year25–35%Enterprise agreement
Google Vertex AICUD $100K+20–40%Committed Use Discount (1–3 year)
AWS BedrockVia EDP ($1M+)15–25% effectiveEnterprise Discount Programme
Azure OpenAIVia MACC ($1M+)15–30% effectiveDraws Azure committed spend

The most commercially efficient path for enterprises already on Azure or AWS: route AI workloads through Azure OpenAI (draws MACC) or AWS Bedrock (draws EDP) rather than buying direct from OpenAI or Anthropic. At scale, the effective discount from drawdown can exceed anything negotiable on a standalone AI contract. See our cloud enterprise discount negotiation guide for detailed MACC and EDP strategy.

9. Ten Tactics for Token Pricing Negotiations

Tactic 01
Build a Consumption Model Before Any Negotiation
Never enter an AI API negotiation without a documented consumption model. Your model should show monthly token volumes by model tier, growth projections for 12–36 months, and the committed spend level you're targeting. Vendors respond differently to buyers who demonstrate analytical sophistication — you're more likely to receive accurate discounting information and less likely to be oversold on prepay amounts that don't match your actual usage.
Tactic 02
Use Price Deflation History as Aggressive Leverage
AI API pricing has declined 80%+ since 2023 and continues to fall. If you've received any quote in the last 12 months, compare it against current published list pricing — you may already be overpaying before negotiating. Use the documented price trajectory to argue for steeper discounts, shorter lock-in terms, or MFN provisions that automatically pass on future price reductions.
Tactic 03
Negotiate Blended Rate Across Model Tiers
Most AI API pricing negotiates discounts at the model level. If you use multiple model tiers (e.g., 70% GPT-4o mini, 30% GPT-4o), negotiate a blended rate applied to total token spend regardless of which model is invoked. This simplifies billing, prevents tier arbitrage, and often achieves better effective discounts than negotiating each model tier individually.
Tactic 04
Lock In Prompt Caching Availability and Rates
Prompt caching is a major cost reduction mechanism, but vendors can change its pricing or availability. In your contract, explicitly confirm that prompt caching will remain available at the same or better rates for the contract term, and that negotiated discounts apply to both cached and non-cached token rates. Some vendors have priced prompt cache reads separately from main token rates — ensure your discount applies consistently to both.
Tactic 05
Demand Free Tier or Credits for Non-Production Environments
Developer environments, testing, QA, and model evaluation all consume tokens but generate no production value. Negotiate a free tier (typically $10K–$50K/year for large enterprise customers) for development and testing workloads, or apply a 50% discount rate for non-production API keys. This is standard in well-negotiated enterprise agreements and rarely offered proactively.
Tactic 06
Use Competitive Open-Source Alternatives as Price Anchors
Meta's Llama 3, Mistral, and other open-source models are now competitive with GPT-4o on many enterprise tasks. Self-hosted Llama 3 70B costs approximately $0.50–$1.00/million tokens in infrastructure — well below commercial API rates. Document that you are evaluating self-hosted deployment as a credible fallback. This shifts the negotiation dynamic: vendors must justify their pricing premium against a viable zero-marginal-cost alternative.
Tactic 07
Commit Below Forecast and Use Overage Protections
AI usage is notoriously difficult to forecast. Commit to 60–70% of your modelled consumption and negotiate overage rates at 10–15% above your committed rate (not list price) for usage above commitment. This protects you if consumption exceeds forecast without exposing you to the full list-price overage rates that apply to uncommitted usage. Build contractual flexibility to increase commitment at agreed-upon rates mid-year if growth exceeds projections.
Tactic 08
Negotiate Annual Rate Review Aligned to Market Benchmarks
Commit to multi-year contracts only with annual rate review rights. Specify that rates are reviewed each year against a published index (e.g., published list pricing for the equivalent tier) and adjusted downward if market rates fall more than 10% below your contracted rate. AI pricing deflation makes this provision extremely valuable — it's the commercial equivalent of a downward-only floating rate.
Tactic 09
Bundle Cloud Commitments with AI Spend
If you have AWS, Azure, or Google Cloud committed spend (EDP, MACC, or GCP CUDs), negotiate to have AI API spend draw down those commitments rather than being billed separately. At $1M+ AI spend on a vendor whose parent is a major cloud provider (Azure OpenAI, AWS Bedrock, Google Vertex AI), the effective discount from committed spend drawdown can be 20–40% better than standalone AI negotiations. This is the most underutilised commercial lever in enterprise AI procurement.
Tactic 10
Right to Audit Token Usage Calculations
Token billing is automated and opaque. Request contractual rights to audit token counting methodology, especially for: (1) how partial tokens are billed, (2) whether system prompt tokens are charged on every call, (3) how tool call tokens in agentic workflows are counted, and (4) whether streaming responses bill differently from synchronous calls. Token counting disputes are rare but occur — having audit rights is a prerequisite for any enterprise-scale commitment.

Need Help Modelling Your AI Token Costs?

Our advisors build consumption models for enterprise AI deployments and negotiate committed spend agreements with all major AI providers.

FAQ: AI Token Pricing

Why are output tokens more expensive than input tokens?
Generating tokens (output) is computationally more expensive than processing tokens (input). During inference, every output token requires a full forward pass through the model weighted by all prior context, while input processing can be parallelised. This fundamental GPU economics difference — autoregressive generation vs parallel attention — drives the 3–5:1 output-to-input price ratio across all major models.
How do I estimate my organisation's AI token consumption before deploying?
Run a structured 30-day pilot on representative use cases with full API instrumentation logging every call's token consumption. Alternatively, model each use case using the formula: (system prompt tokens) + (average context tokens) + (average user message tokens) = input; (average response length) = output. Apply expected query volume and scale to monthly/annual figures. Add 2× buffer for growth and peak traffic. The pilot approach is significantly more accurate than theoretical modelling.
Should I use pay-as-you-go or committed spend for AI APIs?
Pay-as-you-go for the first 3–6 months of any new AI deployment — the consumption model is too uncertain to commit. Once you have 3 months of production usage data, use that to model annual consumption and negotiate a committed spend agreement targeting 60–70% of expected usage. Committed spend discounts of 20–35% are material at scale. Never commit 100% of forecast consumption — AI usage spikes and forecast errors are common.
Can I negotiate token pricing mid-contract if rates drop significantly?
Without contractual rights to renegotiate, it is difficult. However, AI pricing has fallen so dramatically (80%+ since 2023) that most vendors will engage in repricing discussions with major enterprise customers to avoid losing the account. Your leverage is: (a) documented market rate comparison, (b) credible competitive alternatives, and (c) expansion commitments in exchange for rate reductions. Including an MFN or annual market review clause in your initial contract is far more effective than trying to renegotiate mid-term.