Compare token costs across GPT-4o, Claude, Gemini, and open-source models. Uncover hidden costs, master enterprise discount negotiation, and build a data-driven TCO model to control your AI spending before it controls your budget.
AI platform pricing is fundamentally unlike traditional enterprise software licensing. There's no per-seat model, no multi-year discount ladder, and no standard contract. Instead, you pay per token consumed—but token pricing varies 5 to 7 times across platforms, consumption patterns are unpredictable, and hidden costs (context windows, fine-tuning, embeddings, inference infrastructure) can easily triple your apparent bill. This is part of our comprehensive AI software procurement guide, which covers licensing, contract negotiation, and cost optimization across the entire AI stack.
Enterprise buyers have never needed this level of pricing granularity before. But as AI consumption grows from pilot to production, the difference between a poorly benchmarked deployment and an optimised one can be millions annually. This guide provides the benchmarking framework that separates strategic buyers from those reacting to surprise invoices.
Traditional software benchmarking focuses on per-user cost and contract terms. AI pricing benchmarking must account for five dimensions that interact in non-obvious ways:
Unlike a per-user licence (which is static once assigned), token consumption depends on model choice, prompt engineering, retrieval augmented generation (RAG) depth, reasoning model selections, and batch vs. real-time processing. A poorly optimised prompt can consume 3× more input tokens than a well-engineered one. A model switch from Claude to Llama can swing costs 40–60% in either direction.
OpenAI charges per input token and output token separately. AWS Bedrock uses a per-token model but bundles pricing with other services. Google Vertex AI uses monthly capacity commitments. Anthropic offers per-token and enterprise contracts. There is no universal unit of comparison, which means benchmark comparisons require building a custom cost model for your workload.
Token price is only the beginning. Context window length affects per-request cost (longer context = more input tokens). Fine-tuning compute is priced separately from inference. Embeddings, batch APIs, and vision processing add to the bill. Inference infrastructure (GPU time, storage, monitoring) is often a larger expense than the models themselves.
Volume discounts range from non-existent (OpenAI for most customers) to 30–50% (Anthropic, Google for large commitments). Discount triggers vary wildly: some require $1M+ annual commitments; others offer discounts at $100K. Negotiating without benchmarks means accepting the first discount offered.
Between Q3 2023 and Q3 2024, OpenAI dropped GPT-4 prices by 50%. Claude 2 pricing fell 40% within 6 months of release. Open-source models (Llama) saw effective price cuts of 85% due to higher availability on cheaper infrastructure. Benchmark data older than 3 months is unreliable.
Effective AI pricing benchmarking requires: (1) current pricing from at least 3 vendors, (2) a realistic workload cost model, (3) infrastructure cost estimates, (4) hidden cost inventory, and (5) enterprise discount baseline data by vendor and commitment level.
The following table captures Q1 2026 pricing for the most commonly deployed AI models in enterprise settings. Pricing is subject to change; verify with vendors before entering contracts. All prices are for on-demand inference (not batch, not enterprise contracts). Input and output token prices are per 1 million tokens.
Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories — free matching, no commitment.
| Platform | Model | Input $/1M | Output $/1M | Context Window | Enterprise Discount |
|---|---|---|---|---|---|
| Azure OpenAI | GPT-4o | $5.00 | $15.00 | 128K | 15–35% |
| Azure OpenAI | GPT-4o mini | $0.15 | $0.60 | 128K | 15–35% |
| Bedrock / Anthropic | Claude 3.5 Sonnet | $3.00 | $15.00 | 200K | 10–30% |
| Bedrock / Anthropic | Claude 3 Haiku | $0.80 | $4.00 | 200K | 10–30% |
| Vertex AI | Gemini 1.5 Pro | $7.50 | $30.00 | 1M | 10–25% |
| Vertex AI | Gemini 1.5 Flash | $0.075 | $0.30 | 1M | 10–25% |
| Self-Hosted | Llama 3.1 70B | $0.60–$2.00 | $0.60–$2.00 | 128K | Infra negotiable |
| Bedrock | Cohere Command R+ | $3.00 | $15.00 | 128K | 5–20% |
Critical Notes: Prices fluctuate monthly. Output tokens are always more expensive than input tokens (typically 3–5× higher). Context window length affects per-request costs significantly: a 200K context request uses 200K input tokens just for the context, regardless of query length. Gemini's 1M context window is a pricing trap if most of your requests use far less than 1M tokens—you're paying for unused capacity.
The per-token price is the headline number, but it captures only 40–60% of true AI platform costs in production. The remainder comes from six hidden cost categories that vendors rarely highlight.
Every token in the context window counts toward your input token bill. If you're using RAG (retrieving documents into context) and each request pulls 10KB of context, you're paying for that context on every request. With Gemini 1.5 Pro at $7.50 per 1M input tokens, a 100K context window used 1,000 times daily costs $750/day just in context overhead. Optimising context length and retrieval strategy is often the highest-ROI optimization.
Building a RAG system requires generating embeddings for your document corpus. OpenAI's text-embedding-3-large costs $0.02 per 1M tokens. If you have 100M tokens of documents, embedding costs $2,000. Refreshing embeddings quarterly adds $8,000 annually. For some applications, embedding cost exceeds inference cost.
Fine-tuning a model requires GPU time separate from the model itself. OpenAI charges $8 per 1M input tokens for fine-tuning GPT-4o. If you fine-tune on 50M tokens of proprietary data, that's a $400 one-time cost. Fine-tuning ROI requires the throughput gains and cost reductions to exceed the tuning cost—which is not always the case.
Running inference at scale requires infrastructure costs that vendors hide in "pricing simplicity." GPU instances for batch processing, API gateway costs, logging and monitoring infrastructure, and data storage for inference logs can easily exceed model costs. A $1M/year Claude spend might require $400K in supporting infrastructure.
Output tokens cost 3–5× more than input tokens. An application that generates 500-word responses (approximately 667 tokens) has an output-to-input ratio that dramatically shifts the cost model. For some applications (search, classification), output is minimal. For others (code generation, content creation), output dominates costs.
Production AI systems require detailed logging of prompts, completions, latency, and cost allocation. Third-party observability tools (Weights & Biases, Fiddler, Arize) cost $10K–$100K+/year depending on volume. Building internal logging infrastructure requires 2–4 FTE engineering months.
Building a reliable total cost of ownership (TCO) model requires a 5-step process. Skip any step and your benchmark is incomplete.
Get the IT Negotiation Playbook — free
Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud — all covered.
Document: (1) transaction volume (requests per day/month), (2) average prompt size (tokens), (3) average context size, (4) required response length, (5) latency requirements, (6) model tier required (reasoning vs. standard), (7) fine-tuning or embeddings needed. Without this, you're benchmarking a theoretical workload, not yours.
For each model candidate, calculate: (1) input tokens = (average prompt tokens + context tokens) × transaction volume, (2) output tokens = average response tokens × transaction volume, (3) overhead tokens = buffer for retries/error handling (~5–10%). For a 1M request/month workload with 500-token context and 200-token responses:
Add: (1) API gateway and networking costs, (2) logging infrastructure (CloudWatch, DataDog, etc.), (3) vector database for embeddings (Pinecone, Weaviate, Milvus), (4) database for storing results/audit logs, (5) monitoring and alerting tools. For a 1M request/month system, expect $3K–$15K/month in infrastructure.
Include: (1) embedding generation cost for initial corpus + quarterly refreshes, (2) fine-tuning if required, (3) observability tools, (4) contingency buffer (15–20% for price increases and unexpected usage). These often total 40–60% of token costs.
Request enterprise discount tiers from each vendor. Most offer tiered pricing: $500K commitment gets 15% off, $1M gets 25% off, $5M gets 35% off. Plug your annual token cost into each discount tier and compare the net cost.
Annual AI Cost = (Input tokens × Input price + Output tokens × Output price) × (1 – Enterprise discount %) + Infrastructure costs + Hidden costs + Contingency buffer
Enterprise discount negotiation is where most buyers leave money on the table. The following benchmarks show what's negotiable by platform as of Q1 2026.
Discount Range: 15–35% off on-demand pricing. Commitment Level: $500K–$10M annually. Terms: Typically 12-month contracts with yearly true-ups. Microsoft's CSP channel sometimes offers better discounts than direct sales. Negotiation Leverage: Switching costs to Gemini or Claude are moderate; OpenAI has limited negotiating power if you have competitive alternatives.
Discount Range: 5–20% off on-demand pricing. Commitment Level: $250K–$2M annually via Savings Plans. Terms: Flexible; can be combined with other AWS Savings Plans. Negotiation Leverage: Bedrock pricing is bundled with broader AWS relationships. Enterprise Discount Programme (EDP) negotiations are conducted at the AWS account level, not per-service.
Discount Range: 10–25% off on-demand pricing. Commitment Level: $500K–$5M annually. Terms: Offers monthly commitment options (more flexible than annual). Negotiation Leverage: Google is aggressive on AI pricing to gain market share; their discounts are the most negotiable if you have a large commitment and competitive alternatives.
Discount Range: 10–30% off on-demand pricing. Commitment Level: $250K–$10M+. Terms: Flexible commitment structures; can negotiate monthly true-ups. Negotiation Leverage: Anthropic is rebuilding sales capacity; they have incentive to negotiate on large deals. Model quality and safety features are differentiators they defend on.
Discount Range: 5–15% off Bedrock pricing (which is already 10% higher than Anthropic direct). Commitment Level: $100K–$2M via AWS commitment. Note: Using Bedrock adds roughly 10% markup vs. Anthropic direct, but provides AWS infrastructure integration and support. Direct Anthropic contracts are cheaper if you can manage the relationship.
Discount Range: Model is free; negotiation focuses on infrastructure. Commitment Level: Infrastructure contracts ($100K–$10M+ annually). Terms: Reserved capacity discounts via cloud providers (30–40% for 3-year RIs on compute). Negotiation Leverage: Very high. You control the relationship and can negotiate directly with cloud providers.
| Platform | $500K Annual | $1M Annual | $5M+ Annual | Contract Flexibility |
|---|---|---|---|---|
| Azure OpenAI | 15% | 20% | 30–35% | 12-month minimum |
| AWS Bedrock | 5% | 10% | 15–20% | 1-year SP |
| Vertex AI | 12% | 18% | 25% | Monthly flexible |
| Anthropic Direct | 15% | 20% | 28–30% | Highly flexible |
| Self-Hosted Infra | 25–35% | 35–50% | 50%+ | Infrastructure negotiable |
Third-party calculators (TokenCost.co, MLPricingGuide.com) aggregate vendor pricing and allow quick comparisons. Strengths: fast, visual, updated monthly. Weaknesses: pricing data lags 1–2 months, doesn't account for your specific workload, no enterprise discount integration. Use these for quick sanity checks, not for final procurement decisions.
Azure has a price calculator, Google has Vertex AI cost estimator, Anthropic has a token counter. Strengths: authoritative pricing, enterprise discount integration in some cases. Weaknesses: opaque about assumptions, don't easily compare across platforms. Use vendor calculators for final verification only.
Building your own TCO model in Excel/Sheets is the most reliable approach for mid-to-large enterprises. Requires: (1) historical usage data or realistic estimates, (2) pricing from each vendor's rate card, (3) infrastructure cost projections, (4) hidden cost inventory. Spreadsheet models are repeatable and can be updated monthly as vendor pricing changes.
Tools like Weights & Biases, Fiddler, and Arize provide cost tracking and optimization recommendations. Once your system is in production, these become essential for detecting cost anomalies and identifying optimization opportunities. Implementation cost is $10K–$50K initially, but ROI is typically 3–6 months for large deployments.
Benchmarks are negotiating weapons only if wielded strategically. The following 8 tactics convert pricing data into contract savings.
Download our AI Procurement Checklist to benchmark vendors, negotiate contracts, and avoid common traps that cost enterprises millions.
This is part of our comprehensive AI software procurement cluster. Explore related articles on AI platform selection, contract negotiation, and cost optimization:
Our negotiation consulting team has benchmarked over 200 enterprise AI deployments. We'll validate your pricing against market benchmarks and negotiate better terms.