Cloud Spot Strategy — Cost Optimization Guide

Cloud Spot Instance Strategy: Cut Compute Costs by 70–90% in 2026

Master spot and preemptible instances across AWS, Azure, and GCP. Learn fault-tolerant architecture patterns, interruption handling, and Kubernetes integration to maximise savings without sacrificing reliability.

Consult an Expert
Editorial Disclosure: This guide is based on current cloud provider pricing (as of 2026), publicly available interruption data, and production deployment patterns. Spot pricing fluctuates; always verify rates in your target regions. For enterprise negotiation of committed spend, consider consulting a cloud cost optimization firm.
90%
Max Spot Discount
$0.003/hr
AWS Spot c5.xlarge (US-East)
30%
Batch Job Cost Saving
5min
Interruption Notice Window

What Are Spot Instances?

Spot instances are short-lived cloud compute resources sold at a discount to regular on-demand pricing. Cloud providers use spot pricing to sell excess capacity: when demand is low, you get deep discounts (up to 90%); when demand spikes, your instance may be interrupted with minimal notice. This trading of availability for cost is the core value proposition.

AWS Spot Instances have been available since 2009 and represent spare EC2 capacity. AWS provides 2–5 minutes' notice before reclaiming the instance (via EC2 Spot Instance Interruption Notices).

Azure Spot VMs are Azure's equivalent, available since 2018. They offer the same 2-minute eviction notice and are fully integrated with virtual machine scale sets (VMSS) for orchestrated deployments.

GCP Spot VMs and Preemptible VMs are Google's offering. Preemptible instances provide up to 90% discounts with a guaranteed 30-second termination notice. Spot VMs (newer) offer even longer runtime (no 24-hour limit) and better pricing stability, though with less notice.

Key Insight

The fundamental trade-off: you save 70–90% on compute but accept interruption risk. This makes spot ideal for batch jobs, CI/CD pipelines, and non-critical compute—but unsuitable for stateful databases or real-time customer-facing workloads without significant architectural changes.

Discount Levels Across Clouds

Spot pricing varies by instance family, region, and time. Here's a snapshot of typical discounts as of Q1 2026:

Expert Advisory

Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories — free matching, no commitment.

Cloud Provider Instance Type Region On-Demand/hr Spot Price/hr Discount %
AWS c5.xlarge us-east-1 $0.085 $0.026 69%
AWS m5.large us-west-2 $0.096 $0.013 86%
Azure Standard_D4s_v3 East US $0.192 $0.058 70%
Azure Standard_D2s_v3 West Europe $0.096 $0.009 91%
GCP n1-standard-4 us-central1 $0.152 $0.030 80%
GCP n2-standard-2 us-east1 $0.101 $0.011 89%

Regional variation is significant. The same instance family in different regions can have 15–40% variance in both on-demand and spot pricing. Cheaper regions like us-central1 (GCP) or us-east-1 (AWS) offer better spot economics.

Understanding Interruption Risk

Interruptions are the cost of the discount. Typical interruption rates vary by instance family and time of day:

Instance Family Typical Interruption Rate Interruption Notice Max Continuous Runtime
General Purpose (t3, t4g, m5, m6i) 2–5% 2–5 minutes (AWS) Unlimited
Compute Optimized (c5, c6i) 3–6% 2–5 minutes Unlimited
Memory Intensive (r5, r6i) 4–7% 2–5 minutes Unlimited
GCP Preemptible (older generation) 15–25% 30 seconds 24 hours hard limit
GCP Spot (newer) 5–15% 30 seconds Unlimited

AWS instances enjoy relatively stable interruption rates (2–6% depending on family), while GCP Preemptible instances experience higher volatility. GCP's newer Spot VMs bridge the gap with unlimited runtime and variable interruption notice (30 seconds to several minutes depending on capacity pressure).

Interruption Gotcha

Interruption rates spike during peak hours (8am–10pm local time) and during AWS/Azure/GCP maintenance windows. If your batch job must complete within 4 hours, target off-peak times and diversify across AZs to reduce cascade risk.

Workloads Suited for Spot

Ideal Spot Candidates

  • Batch Processing: MapReduce, data ETL, video transcoding. Resume-able jobs can checkpoint progress and restart on new instances.
  • CI/CD Pipelines: Build agents, test runners. Jobs are stateless and expected to be ephemeral.
  • Machine Learning Training: Model training with frequent checkpoints. TensorFlow/PyTorch supports interrupt-aware checkpointing.
  • Stateless Web Tier: Web servers behind load balancers. If one spot instance dies, traffic shifts to others.
  • Data Pipelines: Kafka consumers, stream processing with state stores. Kubernetes jobs handle restart logic.
  • Scheduled Analytics: Hourly/daily aggregations, reporting. Spot clusters can scale down after completion.
Free Resource

Get the IT Negotiation Playbook — free

Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud — all covered.

Poor Spot Candidates

  • Relational Databases: MySQL, PostgreSQL, Oracle. Data consistency requires graceful shutdown; interruptions cause data loss or corruption.
  • Stateful Applications: Session stores, in-memory caches without replication. Interruption means lost state.
  • Real-Time Customer APIs: APIs with strict latency SLAs. Interruptions cause user-visible failures.
  • Long-Running Transactions: Queries lasting 8+ hours. Interruption rollback is expensive.
  • License-Bound Software: Some enterprise software charges per-instance; frequent restart penalties apply.
Best Practice

Use a mixed strategy: On-Demand for stateful/critical workloads, Spot for resilient batch/compute. Most mature cloud shops run 60–70% Spot for non-prod and 40–50% Spot for prod (with Spot + On-Demand mixed deployments).

AWS Spot Strategy

Spot Fleet & Spot Instance Pools

AWS Spot Fleet is the primary abstraction for managing Spot instances at scale. Instead of requesting individual instances, you define a launch template with a target capacity (in vCPU, memory, or instance count), and Spot Fleet distributes requests across multiple instance types and AZs.

Key advantages: Spot Fleet automatically balances across instance families (e.g., c5.xlarge, c5.2xlarge, c6i.xlarge) to maximize fulfillment probability. If c5.xlarge is scarce, it falls back to c6i.xlarge. This diversification reduces single-family interruption impact.

Spot Instance Advisor

AWS provides the Spot Instance Advisor, a tool that ranks instance types by interruption rate. When configuring a Spot Fleet, reference this data to prioritize stable families. General purpose (t3, m5, m6i) typically have lower interruption rates than older generation instances (t2, m4).

Availability Zone Diversification

Spread requests across 3+ AZs (e.g., us-east-1a, us-east-1b, us-east-1c). AWS capacity constraints are AZ-specific; diversification means if us-east-1a hits capacity, instances in us-east-1b/1c can still launch. Spot Fleet configuration example:

AWS Spot Fleet Best Practice

LaunchTemplateConfigs: Define 6–8 instance type + AZ combinations (e.g., c5.xlarge in 1a/1b/1c, m5.xlarge in 1a/1b/1c). AllocationStrategy: Use "capacity-optimized" to place instances in pools with optimal available capacity. TargetCapacity: Set to 100 vCPU for a cluster. Fleet will distribute across pools.

Azure Spot Strategy

Azure Spot VMs & Eviction Policies

Azure Spot VMs integrate seamlessly with Virtual Machine Scale Sets (VMSS). When deploying a web tier on VMSS, configure:

  • Priority: Spot – Request Spot capacity.
  • Eviction Policy: Deallocate – On interruption, deallocate (stop) rather than delete. VM can be restarted during off-peak hours. Useful for stateless workloads expecting recovery.
  • Eviction Policy: Delete – Immediately terminate. Use for immutable workloads that scale horizontally (e.g., web tier behind a load balancer).

VMSS Autoscaling with Spot

Configure VMSS with a mix of Spot and On-Demand instances. For example:

  • 60% Spot (cheaper baseline).
  • 40% On-Demand (stability for critical traffic).

VMSS autoscale rules trigger on CPU/memory metrics. As spot instances are evicted, scale-up rules provision replacement on-demand instances, ensuring service continuity.

Regional Considerations

Azure Spot pricing varies significantly by region. West Europe and UK South typically offer 75–85% discounts; US East offers 65–80%. Storage egress costs differ too; factor into total cost model.

GCP Spot Strategy

Spot VMs vs. Preemptible VMs

Preemptible VMs (older offering) have a 24-hour runtime limit and a guaranteed 30-second termination notice. They offer up to 90% discount.

Spot VMs (launched 2021, recommended) have no runtime limit, dynamic interruption notice (30 seconds to several minutes), and more stable pricing. Spot is the preferred path forward; Google is moving away from Preemptible.

Managed Instance Groups (MIG) & Auto-Healing

Deploy Spot VMs in a Managed Instance Group with autohealing enabled. Configure an instance template, desired capacity (e.g., 100 instances), and healthcheck URL. When a Spot instance is interrupted:

  1. GCP sends termination notice (30 seconds to several minutes).
  2. Shutdown scripts drain connections (graceful termination).
  3. Instance is deleted.
  4. MIG autoscaler provisions a replacement instance from the pool.

Pricing Stability & CUD Stacking

GCP Spot pricing is historically more stable than AWS Spot (fewer sudden spikes). You can stack Spot discounts with Committed Use Discounts (CUDs) on non-Spot instances within the same group, further reducing blended costs.

Fault-Tolerant Architecture Patterns for Spot

Checkpoint & Resume Pattern

For long-running batch jobs, implement checkpoint logic:

  1. Periodic Checkpointing: Every 10–30 minutes, write job state to persistent storage (S3, GCS, Azure Blob Storage).
  2. Graceful Shutdown Handler: Listen for termination signals (SIGTERM on Linux). On interruption notice, finalize current work unit and write final checkpoint.
  3. Resume on New Instance: New spot instance reads last checkpoint and continues from that point.

Example: A Spark batch job processing 1 million records. Checkpoint every 100k records. If interrupted at record 250k, restart instance resumes at record 250k, not from 0.

Graceful Shutdown Handlers

Spot interruption delivers a signal 2–5 minutes before termination (AWS) or 30 seconds (GCP). Your application should:

  • Stop accepting new requests.
  • Drain in-flight requests (wait for active connections to close).
  • Flush logs, metrics, and state to persistent storage.
  • Exit cleanly.

Kubernetes example: Use a PreStop lifecycle hook that runs a sleep and drain script before the container is killed.

Spot + On-Demand Mixed Deployments

Deploy a service with both Spot and On-Demand instances. For example, a web tier with 100 instances:

  • 60 Spot instances (cost: ~$6/hr at $0.1/hr each).
  • 40 On-Demand instances (cost: ~$40/hr at $1/hr each).
  • Total: ~$46/hr vs. $100/hr (all on-demand). Savings: 54%.

If 5 Spot instances are interrupted, load balancer shifts traffic to the remaining 55 Spot + 40 On-Demand (total 95), maintaining SLA until autoscaling provisions new Spot instances.

Architectural Insight

Spot is not a 100% cost-free discount. You're trading capital savings for operational complexity: monitoring, graceful shutdown, checkpointing, and orchestration. Factor in engineering effort when evaluating ROI.

Kubernetes & Spot

Node Pools for Spot Instances

Most Kubernetes clusters run multiple node pools: one for on-demand (critical workloads) and one for spot (batch jobs, non-critical services). In GKE, create a node pool with:

  • Node Taints: cloud.google.com/gke-preemptible=true:NoExecute to prevent non-spot pods from scheduling on spot nodes.
  • Node Affinity: Workloads with spot tolerance use nodeAffinity selectors to prefer spot pools.
  • Pod Disruption Budgets (PDB): Limit concurrent pod evictions during node termination.

Spot Interruption Handlers

aws-node-termination-handler (for EKS) monitors EC2 Spot interruption notices and:

  1. Detects termination event 2–3 minutes before shutdown.
  2. Cordons the node (marks as unschedulable).
  3. Gracefully evicts all pods with a grace period (default 120 seconds).
  4. Drains the node; Kubernetes reschedules pods to other nodes.

Similar tools exist for Azure (Azure Pod Identity) and GCP (workload identity).

Karpenter: Next-Gen Spot Orchestration

Karpenter (CNCF project) automates node provisioning for Kubernetes clusters, optimizing for both cost and availability. Key features:

  • Consolidation: Automatically removes underutilized nodes, reducing spend.
  • Spot Diversification: Spreads workloads across instance types and AZs to minimize interruption impact.
  • Fallback to On-Demand: If Spot capacity unavailable, Karpenter automatically falls back to on-demand (with configurable ratios).
  • Right-Sizing: Provisions the smallest instance type that fits pod requirements, saving on per-instance overhead.

Karpenter replaces the Kubernetes Cluster Autoscaler, offering superior spot economics with minimal operational overhead.

Spot + Savings Plans / Reserved Instances Stacking Strategy

Many teams ask: "Can I combine Spot discounts with Savings Plans or Reserved Instances?" The short answer is partially, and it depends on the cloud provider.

AWS: Savings Plans + Spot

AWS Savings Plans provide 20–40% discounts on on-demand pricing, and are independent of Spot. You can layer them:

  • Buy a 1-year Compute Savings Plan for your baseline on-demand workloads.
  • Use Spot for variable/batch workloads on top.
  • Example: 60 vCPU baseline with 1-year Savings Plan (~$12k/year), + 40 vCPU Spot for peaks (~$2k/year). Total: ~$14k vs. $25k (all on-demand). Savings: 44%.

Azure: Reserved Instances + Spot

Azure Reserved Instances provide 25–55% discounts on on-demand pricing. RIs are instance-size specific. You can:

  • Buy RIs for your baseline (guaranteed) workloads.
  • Layer Spot for additional capacity.
  • Blended savings are substantial but depend on RI allocation accuracy. Overprovisioning RIs wastes capital.

GCP: CUDs + Spot

GCP Committed Use Discounts (CUDs) provide 25–70% discounts on on-demand pricing. CUDs and Spot are mutually exclusive on the same instance but can be blended:

  • Buy 1-year CUD for 60 vCPU baseline (e.g., n2-standard-8).
  • Use Spot for additional capacity or bursty workloads.
  • Blended discounts can exceed 60% when optimized.
Commitment + Spot Best Practice

Size commitments conservatively. Overcommitting wastes capital; undercommitting increases on-demand overage costs. Use 12–24 months of historical usage to right-size, then layer Spot for variable demand. Typical maturity: 50% CUD/RI + 30% Spot + 20% On-Demand flex.

Cost Modelling Example: 100-Node Batch Cluster

Scenario: A data engineering team runs a 100-node batch cluster for daily ETL (6 hours/day). Each node is c5.2xlarge in us-east-1.

Current On-Demand Baseline

  • 100 nodes × c5.2xlarge × $0.34/hr × 6 hrs/day × 250 working days/year = $51,000/year.

Scenario 1: 100% Spot

  • Spot price: c5.2xlarge = $0.11/hr (68% discount).
  • 100 nodes × $0.11/hr × 6 hrs/day × 250 days = $16,500/year.
  • Savings: $34,500/year (68%).
  • Risk: Interruptions during batch window could delay jobs. Mitigation: checkpointing, retry logic.

Scenario 2: 70% Spot + 30% On-Demand (Mixed)

  • 70 nodes Spot @ $0.11/hr + 30 nodes On-Demand @ $0.34/hr = (70 × 0.11) + (30 × 0.34) = $17.90/hr.
  • 100 nodes × $0.179/hr average × 6 hrs/day × 250 days = $26,850/year.
  • Savings: $24,150/year (47%).
  • Risk: Lower. If 5–10 Spot nodes are interrupted, on-demand buffer absorbs delay.

Scenario 3: 60% Spot + 1-Year Savings Plan (40% baseline)

  • 40 nodes via Savings Plan @ $0.24/hr (30% discount) + 60 nodes Spot @ $0.11/hr.
  • Cost: (40 × 0.24) + (60 × 0.11) = $15.00/hr.
  • Annual: $15/hr × 6 hrs/day × 250 days = $22,500/year.
  • Savings: $28,500/year (56%).
  • Upfront cost: Savings Plan = 40 vCPU × $0.24/hr × 8760 hrs = ~$84,000 prepaid.
  • Risk & benefit: Balanced. Guaranteed baseline; Spot for peak. Best for predictable workloads.
Strategy Annual Cost Savings vs. On-Demand Interruption Risk Upfront Capex
On-Demand (baseline) $51,000 None $0
100% Spot $16,500 68% ($34,500) High $0
70% Spot + 30% On-Demand $26,850 47% ($24,150) Moderate $0
60% Spot + 1-Yr Savings Plan $22,500 56% ($28,500) Low–Moderate ~$84,000

8 Tactics for Maximising Spot Savings

Tactic 1
Diversify Across Instance Types & Families
Don't rely on a single instance type. Spot capacity is fluid; if c5.xlarge is scarce, c6i.xlarge or c7g.xlarge may be cheaper. Configure Spot Fleets (AWS) or MIGs (GCP) with 6–8 instance family combinations. Increases fulfillment probability by 30–40%.
Tactic 2
Target Off-Peak Windows
Interruption rates spike 8am–10pm local time. Run batch jobs during off-peak (11pm–7am). Interruption probability drops 50–70%. Schedule daily ETL for 2am–8am, not during business hours.
Tactic 3
Implement Graceful Shutdown & Checkpointing
Deploy spot interruption handlers and checkpoint logic. For Kubernetes, use aws-node-termination-handler or Karpenter. For batch, write state every 10–30 minutes. Reduces job failure risk from 20% to 2–3% on 100-node clusters.
Tactic 4
Spread Across Availability Zones
AWS/Azure/GCP capacity constraints are AZ-specific. Deploy across 3+ AZs. If us-east-1a goes scarce, us-east-1b/1c will have available capacity. Multi-AZ Spot Fleets achieve 20–30% higher fulfillment vs. single-AZ.
Tactic 5
Monitor & Right-Size Instance Selection
Use AWS Spot Instance Advisor, Azure pricing API, or GCP insights to track interruption trends by family. Right-size your fleet monthly. Shift workload from high-interrupt families to stable ones (c5/c6i over c4/c3). Saves 5–15% on interruption overhead.
Tactic 6
Combine Spot with On-Demand & Committed Discounts
Use on-demand for 20–30% of capacity (critical baseline), Spot for 60–70% (variable), and optionally Savings Plans/CUDs for predictable baseline. Blended discounts reach 50–60% vs. 100% on-demand, with lower risk than 100% Spot.
Tactic 7
Use Spot Pricing APIs for Real-Time Decisions
AWS, Azure, and GCP expose Spot pricing via APIs. Build a "buy decision" service: if c5.xlarge Spot is >40% of on-demand, use on-demand instead. Prevents overpaying during price spikes. Reduces opex by 8–12%.
Tactic 8
Audit & Optimize Workload Fit Quarterly
Review workload interruption logs quarterly. If batch jobs are failing at 15% rate due to interruptions, increase on-demand mix or reduce batch window. If CI/CD agents are idle 40% of the time, downsize instance types. Iterative optimization saves 5–10% annually.

Frequently Asked Questions

Can I use Spot for production databases?
Not without significant architectural change. Traditional databases (MySQL, PostgreSQL) expect persistent, long-lived storage. Spot interruptions cause ungraceful shutdowns and data corruption. Instead, use Spot for read replicas in non-critical regions, or use managed database services (RDS, Cloud SQL, Azure Database) on-demand with automatic failover.
What's the difference between AWS Spot and Azure Spot?
Both offer 70–90% discounts and 2–5 minute interruption notice. Main differences: AWS Spot integrates with EC2/ECS/EKS natively; Azure Spot integrates tightly with Virtual Machine Scale Sets (VMSS). AWS has longer history (since 2009) and richer tooling (Spot Instance Advisor, Spot Fleet). Azure Spot is simpler to set up in VMSS but less feature-rich.
How do I avoid vendor lock-in with Spot?
Spot features are cloud-specific (AWS Spot Fleet, Azure VMSS, GCP MIG). To avoid lock-in, use orchestration layers: Kubernetes (multi-cloud), Terraform/Pulumi (IaC), or cloud-agnostic tools like Karpenter. Abstract spot provisioning behind a common interface; migrating to another cloud means changing config, not code.
Should I commit to Reserved Instances if I'm using Spot?
Yes, if your baseline is stable. Reserve 30–50% of capacity for guaranteed baseline (on-demand or RI), then layer Spot for variable demand. This hybrid approach offers 45–60% savings with low risk. For unstable workloads, 100% Spot is viable if checkpointing is solid.
How do I estimate ROI for Spot adoption?
Calculate: (Current monthly on-demand spend) × (Spot discount %) × (% workload suited for Spot) - (cost of engineering effort). Example: $100k/month × 70% discount × 60% suitable workload - $20k engineering = $22k/month saved. ROI payback: 1 month if you hire contractors, 3–6 months with internal team. Include ongoing operational overhead (monitoring, incident response).

Ready to optimise your cloud spend with Spot strategies?

Explore our cloud negotiation consulting firms or contact an expert.

Get Expert Help

Related Articles in the FinOps Cluster

Ready to Cut Compute Costs by 50–70%?

Spot instance strategies are proven to deliver massive savings, but execution requires planning. Partner with cloud negotiation experts to implement a multi-cloud Spot strategy tailored to your workloads.