AI Token Pricing

Table of Contents

What Are Tokens? A Practical Definition
Input vs Output Token Pricing
Context Window Economics
Prompt Caching: The Hidden Cost Reducer
Model Routing: Matching Cost to Task
Cross-Vendor Token Pricing Comparison
Building a Consumption Model
Negotiating Committed Spend Discounts
10 Token Pricing Negotiation Tactics
FAQ

Token pricing is the single most misunderstood cost driver in enterprise AI deployments. Most organisations sign AI vendor contracts without understanding how token costs compound at scale — and discover their AI budget is 3–10× higher than forecast within the first year. This guide is part of our comprehensive AI procurement guide and gives enterprise buyers the quantitative foundation to model, negotiate, and manage token costs effectively.

1. What Are Tokens? A Practical Definition

Tokens are the fundamental unit of measurement for large language models. A token is roughly 3–4 characters of text, or approximately 0.75 words in English. The relationship is not precise. Punctuation, whitespace, code, and non-English languages all tokenise differently.

Token Estimation Rules of Thumb 1,000 tokens ≈ 750 words ≈ 1.5 pages of A4 text
1,000 tokens ≈ 500–700 words of Python code
1,000 tokens ≈ 400–500 words of dense JSON/XML
100-page PDF ≈ 50,000–80,000 tokens (document-dependent)
1 hour of meeting transcript ≈ 15,000–25,000 tokens

Every API call consumes tokens in two directions: input tokens (everything sent to the model, your prompt, system instructions, conversation history, retrieved documents) and output tokens (everything the model generates in response). Both are billed, at different rates.

2. Input vs Output Token Pricing

Output tokens cost significantly more than input tokens across all major AI platforms. This asymmetry reflects the higher computational cost of generating text versus processing it. The ratio varies by model but typically ranges from 3:1 to 5:1 (output cost to input cost).

Expert Advisory

Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories. Free matching, no commitment.

Get Matched with an Advisor → See Rankings →

Model	Input (per 1M)	Output (per 1M)	Output:Input Ratio
OpenAI GPT-4o	$2.50	$10.00	4:1
OpenAI GPT-4o mini	$0.15	$0.60	4:1
OpenAI o3-mini	$1.10	$4.40	4:1
Anthropic Claude 3.5 Sonnet	$3.00	$15.00	5:1
Anthropic Claude 3 Haiku	$0.25	$1.25	5:1
Google Gemini 1.5 Pro	$1.25–$2.50*	$5.00–$10.00*	4:1
Google Gemini 2.0 Flash	$0.10	$0.40	4:1
Meta Llama 3 70B (Groq)	$0.59	$0.79	~1.3:1
AWS Bedrock (Claude 3.5 Sonnet)	$3.00	$15.00	5:1

* Gemini 1.5 Pro pricing is tiered: lower rate for prompts under 128K tokens, higher rate above. Prices as of early 2026; subject to change.

Cost Modelling Insight

When modelling AI costs, do not assume a 50/50 input/output token split. Most enterprise use cases are input-heavy: a RAG query with a 4,000-token context window and 500-token response is 89% input, 11% output. The expensive output tokens are a relatively small share of total cost for retrieval-augmented use cases, but a large share for code generation and long-form content creation where output is proportionally larger.

3. Context Window Economics

The context window is the maximum number of tokens a model can process in a single API call (combined input and output). Large context windows are powerful — they allow models to reference entire documents, long conversation histories, or complex multi-file codebases. They are also expensive.

Context Window Size	What Fits	Input Cost at $2.50/1M (GPT-4o)
8K tokens	~6,000 words / 6 pages	$0.02 per call
32K tokens	~24,000 words / 25 pages	$0.08 per call
128K tokens	~96,000 words / 100 pages	$0.32 per call
200K tokens	~150,000 words / 160 pages	$0.50 per call
1M tokens	~750,000 words / an entire book	$2.50 per call

The critical design decision: should you use large context windows, or retrieve and inject only relevant chunks? Retrieval-Augmented Generation (RAG) that injects 3,000–5,000 tokens of relevant context is typically 5–20× cheaper than stuffing the full 100K-token document into every call. For high-frequency applications, this architecture choice can reduce monthly AI costs by $50K–$500K.

Context Window Cost Trap: Conversation History

Chat applications that maintain full conversation history send all prior messages as input on every turn. A 20-turn conversation with 300 tokens per turn accumulates 6,000 tokens of history by turn 20 — meaning the 20th API call sends 6,000 tokens of history plus the new 300-token message. This means conversation token costs are front-loaded with manageable early turns but grow quadratically with conversation length. Implement conversation summarisation at 10–15 turns to reset the effective context window.

4. Prompt Caching: The Hidden Cost Reducer

All major AI platforms now offer prompt caching, a mechanism that caches repeated prefixes of your input, reducing the cost of re-processing identical system prompts and context on every API call.

Free Resource

Get the IT Negotiation Playbook, free

Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud, all covered.

Platform	Cache Read Rate	Cache Write Rate	Cache Duration
OpenAI (GPT-4o)	50% off input price	Full input price	5–10 minutes (automatic)
Anthropic (Claude)	90% off input price	25% premium to write	5 minutes (configurable)
Google (Gemini 1.5)	75% off input price	Full input price	60 minutes (configurable)
AWS Bedrock	Follows underlying model	Follows underlying model	Follows underlying model

For applications with large, consistent system prompts (e.g., a RAG system with a 10,000-token instruction set), prompt caching can reduce effective input token costs by 40–70%. The economics are especially strong for Anthropic's Claude, where cache reads cost only 10% of the full input rate.

Prompt Caching Savings Example Use case: Customer service bot with 8,000-token system prompt, 2,000-token RAG context
Per-query input: 10,000 tokens → $0.025 at GPT-4o rates
With 50% cache hit rate on system prompt (8K tokens): effective cost = $0.0125
At 10,000 queries/day: $250/day uncached → $125/day with caching
Annual saving: ~$45,500 from caching alone on this one application

5. Model Routing: Matching Cost to Task

Model routing is the practice of directing different query types to different AI models based on complexity, cost, and quality requirements. This is the single highest-impact optimisation available to enterprises with multiple concurrent AI workloads.

Task Type	Recommended Model Tier	Reason	Cost vs Premium Alternative
Simple FAQ lookup	GPT-4o mini / Gemini 2.0 Flash / Claude Haiku	Binary recall, no reasoning needed	17–20× cheaper than GPT-4o
Document summarisation	GPT-4o mini or equivalent	Structured extraction, not reasoning	17× cheaper
Sentiment / classification	GPT-4o mini or fine-tuned small model	Pattern matching, simple output	17–50× cheaper
Code generation	GPT-4o / Claude 3.5 Sonnet	Complex reasoning, precision required	Full price justified
Complex analysis / reasoning	GPT-4o / Claude 3.5 Sonnet / o3-mini	Multi-step reasoning required	Full price justified
Legal/contract review	Claude 3.5 Sonnet or GPT-4o	Nuance, instruction-following critical	Full price justified
Translation (major languages)	GPT-4o mini or Gemini Flash	High-quality small models match GPT-4o	17× cheaper
Embeddings generation	text-embedding-3-small (OpenAI)	Dedicated embedding model, high quality	50–100× cheaper than generation

Enterprises implementing intelligent model routing typically reduce per-query costs by 50–70% with no measurable quality degradation on most tasks. The engineering investment (a routing classifier layer) is typically $50K–$150K recovering itself within 1–3 months at scale. For detailed guidance on platform selection, see our AI platform contract negotiation guide.

6. Cross-Vendor Token Pricing Comparison

Token pricing varies substantially across the major AI platforms. The "best" platform depends on your specific model quality requirements, not just headline price.

Vendor	Flagship Input	Flagship Output	Budget Input	Budget Output
OpenAI	$2.50 (GPT-4o)	$10.00	$0.15 (4o mini)	$0.60
Anthropic	$3.00 (Sonnet 3.5)	$15.00	$0.25 (Haiku 3)	$1.25
Google	$1.25 (Gemini 1.5 Pro)	$5.00	$0.10 (Gemini 2.0 Flash)	$0.40
AWS Bedrock (Claude)	$3.00	$15.00	$0.25	$1.25
AWS Bedrock (Titan)	$0.50	$1.50	$0.10	$0.30
Azure OpenAI (GPT-4o)	$2.50	$10.00	$0.15	$0.60
Mistral Large	$2.00	$6.00	$0.10 (Mistral 7B)	$0.30
Meta Llama 3 70B (hosted)	$0.50–$0.90	$0.70–$0.90	$0.07 (8B)	$0.07

For workloads where quality parity exists, Google Gemini 2.0 Flash ($0.10 input / $0.40 output) represents the most cost-effective frontier model option in early 2026. Self-hosted open-source models (Meta Llama, Mistral) offer lower marginal per-token costs but require infrastructure investment and engineering overhead.

7. Building an Enterprise Consumption Model

Before any AI vendor negotiation, you need a defensible consumption model. Without one, you cannot calculate committed spend levels, forecast annual costs, or identify the right discount tier to target.

Monthly Cost Calculation Formula Monthly cost = (Queries/month)
× [(Avg input tokens × Input rate/1M) + (Avg output tokens × Output rate/1M)]
× (1 - Prompt cache hit rate)
× (Model mix weighted cost factor)

Example (10,000 queries/day, GPT-4o, 5K input + 500 output, 30% cache hit):
= 300,000 × [(5,000 × $0.0000025) + (500 × $0.00001)]
× 0.70
= 300,000 × [$0.0125 + $0.005] × 0.70
= 300,000 × $0.01225
= $3,675/month → ~$44,100/year

The Five Consumption Variables to Model

Query volume: Daily/monthly API calls. Include peak traffic (often 3–5× average for consumer-facing apps). Budget for 2–3× growth in year 2 as adoption increases.
Average input tokens: Break down by component — system prompt, conversation history, retrieved context, user message. RAG applications typically have large, variable input token counts.
Average output tokens: Model the output length distribution, not just the average. Long-tail outputs (reports, summaries, code) drive disproportionate cost.
Model mix: What percentage of queries go to premium vs budget models? A 70/30 split (70% to cheap models) dramatically changes your cost model.
Cache hit rate: Estimate based on your system prompt consistency and context reuse patterns. High-volume customer service bots may achieve 60–80% cache hit rates; ad-hoc research tools may be near 0%.

8. Negotiating Committed Spend Discounts

All major AI API providers offer volume discounts for committed annual spend. The discount structure varies significantly by vendor and is negotiable above minimum thresholds.

Vendor	Committed Spend Level	Typical Discount vs List	Commitment Type
OpenAI	$100K/year	10–15%	Prepay credits or annual commit
OpenAI	$500K/year	20–30%	Enterprise agreement
OpenAI	$1M+/year	30–40%	Custom enterprise agreement
Anthropic	$250K/year	15–20%	Annual prepay
Anthropic	$1M+/year	25–35%	Enterprise agreement
Google Vertex AI	CUD $100K+	20–40%	Committed Use Discount (1–3 year)
AWS Bedrock	Via EDP ($1M+)	15–25% effective	Enterprise Discount Programme
Azure OpenAI	Via MACC ($1M+)	15–30% effective	Draws Azure committed spend

The most commercially efficient path for enterprises already on Azure or AWS: route AI workloads through Azure OpenAI (draws MACC) or AWS Bedrock (draws EDP) rather than buying direct from OpenAI or Anthropic. At scale, the effective discount from drawdown can exceed anything negotiable on a standalone AI contract. See our cloud enterprise discount negotiation guide for detailed MACC and EDP strategy.

9. Ten Tactics for Token Pricing Negotiations

Tactic 01

Build a Consumption Model Before Any Negotiation

Never enter an AI API negotiation without a documented consumption model. Your model should show monthly token volumes by model tier, growth projections for 12–36 months, and the committed spend level you're targeting. Vendors respond differently to buyers who demonstrate analytical sophistication — you're more likely to receive accurate discounting information and less likely to be oversold on prepay amounts that don't match your actual usage.

Tactic 02

Use Price Deflation History as Aggressive Leverage

AI API pricing has declined 80%+ since 2023 and continues to fall. If you've received any quote in the last 12 months, compare it against current published list pricing — you may already be overpaying before negotiating. Use the documented price trajectory to argue for steeper discounts, shorter lock-in terms, or MFN provisions that automatically pass on future price reductions.

Tactic 03

Negotiate Blended Rate Across Model Tiers

Most AI API pricing negotiates discounts at the model level. If you use multiple model tiers (e.g., 70% GPT-4o mini, 30% GPT-4o), negotiate a blended rate applied to total token spend regardless of which model is invoked. This simplifies billing, prevents tier arbitrage, and often achieves better effective discounts than negotiating each model tier individually.

Tactic 04

Lock In Prompt Caching Availability and Rates

Prompt caching is a major cost reduction mechanism, but vendors can change its pricing or availability. In your contract, explicitly confirm that prompt caching will remain available at the same or better rates for the contract term, and that negotiated discounts apply to both cached and non-cached token rates. Some vendors have priced prompt cache reads separately from main token rates. Ensure your discount applies consistently to both.

Tactic 05

Demand Free Tier or Credits for Non-Production Environments

Developer environments, testing, QA, and model evaluation all consume tokens but generate no production value. Negotiate a free tier (typically $10K–$50K/year for large enterprise customers) for development and testing workloads, or apply a 50% discount rate for non-production API keys. This is standard in well-negotiated enterprise agreements and rarely offered proactively.

Tactic 06

Use Competitive Open-Source Alternatives as Price Anchors

Meta's Llama 3, Mistral, and other open-source models are now competitive with GPT-4o on many enterprise tasks. Self-hosted Llama 3 70B costs approximately $0.50–$1.00/million tokens in infrastructure, well below commercial API rates. Document that you are evaluating self-hosted deployment as a credible fallback. This shifts the negotiation dynamic: vendors must justify their pricing premium against a viable zero-marginal-cost alternative.

Tactic 07

Commit Below Forecast and Use Overage Protections

AI usage is notoriously difficult to forecast. Commit to 60–70% of your modelled consumption and negotiate overage rates at 10–15% above your committed rate (not list price) for usage above commitment. This protects you if consumption exceeds forecast without exposing you to the full list-price overage rates that apply to uncommitted usage. Build contractual flexibility to increase commitment at agreed-upon rates mid-year if growth exceeds projections.

Tactic 08

Negotiate Annual Rate Review Aligned to Market Benchmarks

Commit to multi-year contracts only with annual rate review rights. Specify that rates are reviewed each year against a published index (e.g., published list pricing for the equivalent tier) and adjusted downward if market rates fall more than 10% below your contracted rate. AI pricing deflation makes this provision extremely valuable — it's the commercial equivalent of a downward-only floating rate.

Tactic 09

Bundle Cloud Commitments with AI Spend

If you have AWS, Azure, or Google Cloud committed spend (EDP, MACC, or GCP CUDs), negotiate to have AI API spend draw down those commitments rather than being billed separately. At $1M+ AI spend on a vendor whose parent is a major cloud provider (Azure OpenAI, AWS Bedrock, Google Vertex AI), the effective discount from committed spend drawdown can be 20–40% better than standalone AI negotiations. This is the most underutilised commercial lever in enterprise AI procurement.

Tactic 10

Right to Audit Token Usage Calculations

Token billing is automated and opaque. Request contractual rights to audit token counting methodology, especially for: (1) how partial tokens are billed, (2) whether system prompt tokens are charged on every call, (3) how tool call tokens in agentic workflows are counted, and (4) whether streaming responses bill differently from synchronous calls. Token counting disputes are rare but occur — having audit rights is a prerequisite for any enterprise-scale commitment.

Need Help Modelling Your AI Token Costs?

Our advisors build consumption models for enterprise AI deployments and negotiate committed spend agreements with all major AI providers.

Get a Cost Analysis → Top AI Negotiation Firms

FAQ: AI Token Pricing

Why are output tokens more expensive than input tokens?

Generating tokens (output) is computationally more expensive than processing tokens (input). During inference, every output token requires a full forward pass through the model weighted by all prior context, while input processing can be parallelised. This fundamental GPU economics difference — autoregressive generation vs parallel attention, drives the 3–5:1 output-to-input price ratio across all major models.

How do I estimate my organisation's AI token consumption before deploying?

Run a structured 30-day pilot on representative use cases with full API instrumentation logging every call's token consumption. Alternatively, model each use case using the formula: (system prompt tokens) + (average context tokens) + (average user message tokens) = input; (average response length) = output. Apply expected query volume and scale to monthly/annual figures. Add 2× buffer for growth and peak traffic. The pilot approach is significantly more accurate than theoretical modelling.

Should I use pay-as-you-go or committed spend for AI APIs?

Pay-as-you-go for the first 3–6 months of any new AI deployment — the consumption model is too uncertain to commit. Once you have 3 months of production usage data, use that to model annual consumption and negotiate a committed spend agreement targeting 60–70% of expected usage. Committed spend discounts of 20–35% are material at scale. Never commit 100% of forecast consumption — AI usage spikes and forecast errors are common.

Can I negotiate token pricing mid-contract if rates drop significantly?

Without contractual rights to renegotiate, it is difficult. However, AI pricing has fallen so dramatically (80%+ since 2023) that most vendors will engage in repricing discussions with major enterprise customers to avoid losing the account. Your leverage is: (a) documented market rate comparison, (b) credible competitive alternatives, and (c) expansion commitments in exchange for rate reductions. Including an MFN or annual market review clause in your initial contract is far more effective than trying to renegotiate mid-term.

AI Token Pricing:
Understand and Negotiate
Consumption Models

1. What Are Tokens? A Practical Definition

2. Input vs Output Token Pricing

3. Context Window Economics

Context Window Cost Trap: Conversation History

4. Prompt Caching: The Hidden Cost Reducer

5. Model Routing: Matching Cost to Task

6. Cross-Vendor Token Pricing Comparison

7. Building an Enterprise Consumption Model

The Five Consumption Variables to Model

8. Negotiating Committed Spend Discounts

9. Ten Tactics for Token Pricing Negotiations

Need Help Modelling Your AI Token Costs?

FAQ: AI Token Pricing

Get the IT Negotiation Playbook — free

AI Token Pricing:Understand and NegotiateConsumption Models

1. What Are Tokens? A Practical Definition

2. Input vs Output Token Pricing

3. Context Window Economics

Context Window Cost Trap: Conversation History

4. Prompt Caching: The Hidden Cost Reducer

5. Model Routing: Matching Cost to Task

6. Cross-Vendor Token Pricing Comparison

7. Building an Enterprise Consumption Model

The Five Consumption Variables to Model

8. Negotiating Committed Spend Discounts

9. Ten Tactics for Token Pricing Negotiations

Need Help Modelling Your AI Token Costs?

FAQ: AI Token Pricing

Stay sharp

The Enterprise Negotiation Playbook

VMware / Broadcom Licensing & Negotiation Hub

Calculating the ROI of IT Negotiation Consultants

Contract Review

Microsoft Copilot Negotiation Guide

How to Benchmark AI Platform Pricing: 2026 Enterprise Guide

Google Vertex AI Pricing

Get the IT Negotiation Playbook — free

AI Token Pricing:
Understand and Negotiate
Consumption Models