Cloud Spot Instance Strategy: Cut Compute Costs by 70

Contents

What Are Spot Instances?
Discount Levels Across Clouds
Understanding Interruption Risk
Workloads Suited for Spot
AWS Spot Strategy
Azure Spot Strategy
GCP Spot Strategy
Fault-Tolerant Architecture
Kubernetes & Spot
Spot + Commitments
Cost Modelling Example
8 Spot Maximisation Tactics
FAQ

What Are Spot Instances?

Spot instances are short-lived cloud compute resources sold at a discount to regular on-demand pricing. Cloud providers use spot pricing to sell excess capacity: when demand is low, you get deep discounts (up to 90%); when demand spikes, your instance may be interrupted with minimal notice. This trading of availability for cost is the core value proposition.

AWS Spot Instances have been available since 2009 and represent spare EC2 capacity. AWS provides 2–5 minutes' notice before reclaiming the instance (via EC2 Spot Instance Interruption Notices).

Azure Spot VMs are Azure's equivalent, available since 2018. They offer the same 2-minute eviction notice and are fully integrated with virtual machine scale sets (VMSS) for orchestrated deployments.

GCP Spot VMs and Preemptible VMs are Google's offering. Preemptible instances provide up to 90% discounts with a guaranteed 30-second termination notice. Spot VMs (newer) offer even longer runtime (no 24-hour limit) and better pricing stability, though with less notice.

Key Insight

The fundamental trade-off: you save 70–90% on compute but accept interruption risk. This makes spot ideal for batch jobs, CI/CD pipelines, and non-critical compute—but unsuitable for stateful databases or real-time customer-facing workloads without significant architectural changes.

Discount Levels Across Clouds

Spot pricing varies by instance family, region, and time. Here's a snapshot of typical discounts as of Q1 2026:

Expert Advisory

Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories, free matching, no commitment.

Get Matched with an Advisor → See Rankings →

Cloud Provider	Instance Type	Region	On-Demand/hr	Spot Price/hr	Discount %
AWS	c5.xlarge	us-east-1	$0.085	$0.026	69%
AWS	m5.large	us-west-2	$0.096	$0.013	86%
Azure	Standard_D4s_v3	East US	$0.192	$0.058	70%
Azure	Standard_D2s_v3	West Europe	$0.096	$0.009	91%
GCP	n1-standard-4	us-central1	$0.152	$0.030	80%
GCP	n2-standard-2	us-east1	$0.101	$0.011	89%

Regional variation is significant. The same instance family in different regions can have 15–40% variance in both on-demand and spot pricing. Cheaper regions like us-central1 (GCP) or us-east-1 (AWS) offer better spot economics.

Understanding Interruption Risk

Interruptions are the cost of the discount. Typical interruption rates vary by instance family and time of day:

Instance Family	Typical Interruption Rate	Interruption Notice	Max Continuous Runtime
General Purpose (t3, t4g, m5, m6i)	2–5%	2–5 minutes (AWS)	Unlimited
Compute Optimized (c5, c6i)	3–6%	2–5 minutes	Unlimited
Memory Intensive (r5, r6i)	4–7%	2–5 minutes	Unlimited
GCP Preemptible (older generation)	15–25%	30 seconds	24 hours hard limit
GCP Spot (newer)	5–15%	30 seconds	Unlimited

AWS instances enjoy relatively stable interruption rates (2–6% depending on family), while GCP Preemptible instances experience higher volatility. GCP's newer Spot VMs bridge the gap with unlimited runtime and variable interruption notice (30 seconds to several minutes depending on capacity pressure).

Interruption Gotcha

Interruption rates spike during peak hours (8am–10pm local time) and during AWS/Azure/GCP maintenance windows. If your batch job must complete within 4 hours, target off-peak times and diversify across AZs to reduce cascade risk.

Workloads Suited for Spot

Ideal Spot Candidates

Batch Processing: MapReduce, data ETL, video transcoding. Resume-able jobs can checkpoint progress and restart on new instances.
CI/CD Pipelines: Build agents, test runners. Jobs are stateless and expected to be ephemeral.
Machine Learning Training: Model training with frequent checkpoints. TensorFlow/PyTorch supports interrupt-aware checkpointing.
Stateless Web Tier: Web servers behind load balancers. If one spot instance dies, traffic shifts to others.
Data Pipelines: Kafka consumers, stream processing with state stores. Kubernetes jobs handle restart logic.
Scheduled Analytics: Hourly/daily aggregations, reporting. Spot clusters can scale down after completion.

Free Resource

Get the IT Negotiation Playbook — free

Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud — all covered.

Poor Spot Candidates

Relational Databases: MySQL, PostgreSQL, Oracle. Data consistency requires graceful shutdown; interruptions cause data loss or corruption.
Stateful Applications: Session stores, in-memory caches without replication. Interruption means lost state.
Real-Time Customer APIs: APIs with strict latency SLAs. Interruptions cause user-visible failures.
Long-Running Transactions: Queries lasting 8+ hours. Interruption rollback is expensive.
License-Bound Software: Some enterprise software charges per-instance; frequent restart penalties apply.

Best Practice

Use a mixed strategy: On-Demand for stateful/critical workloads, Spot for resilient batch/compute. Most mature cloud shops run 60–70% Spot for non-prod and 40–50% Spot for prod (with Spot + On-Demand mixed deployments).

AWS Spot Strategy

Spot Fleet & Spot Instance Pools

AWS Spot Fleet is the primary abstraction for managing Spot instances at scale. Instead of requesting individual instances, you define a launch template with a target capacity (in vCPU, memory, or instance count), and Spot Fleet distributes requests across multiple instance types and AZs.

Key advantages: Spot Fleet automatically balances across instance families (e.g., c5.xlarge, c5.2xlarge, c6i.xlarge) to maximize fulfillment probability. If c5.xlarge is scarce, it falls back to c6i.xlarge. This diversification reduces single-family interruption impact.

Spot Instance Advisor

AWS provides the Spot Instance Advisor, a tool that ranks instance types by interruption rate. When configuring a Spot Fleet, reference this data to prioritize stable families. General purpose (t3, m5, m6i) typically have lower interruption rates than older generation instances (t2, m4).

Availability Zone Diversification

Spread requests across 3+ AZs (e.g., us-east-1a, us-east-1b, us-east-1c). AWS capacity constraints are AZ-specific; diversification means if us-east-1a hits capacity, instances in us-east-1b/1c can still launch. Spot Fleet configuration example:

AWS Spot Fleet Best Practice

LaunchTemplateConfigs: Define 6–8 instance type + AZ combinations (e.g., c5.xlarge in 1a/1b/1c, m5.xlarge in 1a/1b/1c). AllocationStrategy: Use "capacity-optimized" to place instances in pools with optimal available capacity. TargetCapacity: Set to 100 vCPU for a cluster. Fleet will distribute across pools.

Azure Spot Strategy

Azure Spot VMs & Eviction Policies

Azure Spot VMs integrate seamlessly with Virtual Machine Scale Sets (VMSS). When deploying a web tier on VMSS, configure:

Priority: Spot – Request Spot capacity.
Eviction Policy: Deallocate – On interruption, deallocate (stop) rather than delete. VM can be restarted during off-peak hours. Useful for stateless workloads expecting recovery.
Eviction Policy: Delete – Immediately terminate. Use for immutable workloads that scale horizontally (e.g., web tier behind a load balancer).

VMSS Autoscaling with Spot

Configure VMSS with a mix of Spot and On-Demand instances. For example:

60% Spot (cheaper baseline).
40% On-Demand (stability for critical traffic).

VMSS autoscale rules trigger on CPU/memory metrics. As spot instances are evicted, scale-up rules provision replacement on-demand instances, ensuring service continuity.

Regional Considerations

Azure Spot pricing varies significantly by region. West Europe and UK South typically offer 75–85% discounts; US East offers 65–80%. Storage egress costs differ too; factor into total cost model.

GCP Spot Strategy

Spot VMs vs. Preemptible VMs

Preemptible VMs (older offering) have a 24-hour runtime limit and a guaranteed 30-second termination notice. They offer up to 90% discount.

Spot VMs (launched 2021, recommended) have no runtime limit, dynamic interruption notice (30 seconds to several minutes), and more stable pricing. Spot is the preferred path forward; Google is moving away from Preemptible.

Managed Instance Groups (MIG) & Auto-Healing

Deploy Spot VMs in a Managed Instance Group with autohealing enabled. Configure an instance template, desired capacity (e.g., 100 instances), and healthcheck URL. When a Spot instance is interrupted:

GCP sends termination notice (30 seconds to several minutes).
Shutdown scripts drain connections (graceful termination).
Instance is deleted.
MIG autoscaler provisions a replacement instance from the pool.

Pricing Stability & CUD Stacking

GCP Spot pricing is historically more stable than AWS Spot (fewer sudden spikes). You can stack Spot discounts with Committed Use Discounts (CUDs) on non-Spot instances within the same group, further reducing blended costs.

Fault-Tolerant Architecture Patterns for Spot

Checkpoint & Resume Pattern

For long-running batch jobs, implement checkpoint logic:

Periodic Checkpointing: Every 10–30 minutes, write job state to persistent storage (S3, GCS, Azure Blob Storage).
Graceful Shutdown Handler: Listen for termination signals (SIGTERM on Linux). On interruption notice, finalize current work unit and write final checkpoint.
Resume on New Instance: New spot instance reads last checkpoint and continues from that point.

Example: A Spark batch job processing 1 million records. Checkpoint every 100k records. If interrupted at record 250k, restart instance resumes at record 250k, not from 0.

Graceful Shutdown Handlers

Spot interruption delivers a signal 2–5 minutes before termination (AWS) or 30 seconds (GCP). Your application should:

Stop accepting new requests.
Drain in-flight requests (wait for active connections to close).
Flush logs, metrics, and state to persistent storage.
Exit cleanly.

Kubernetes example: Use a PreStop lifecycle hook that runs a sleep and drain script before the container is killed.

Spot + On-Demand Mixed Deployments

Deploy a service with both Spot and On-Demand instances. For example, a web tier with 100 instances:

60 Spot instances (cost: ~$6/hr at $0.1/hr each).
40 On-Demand instances (cost: ~$40/hr at $1/hr each).
Total: ~$46/hr vs. $100/hr (all on-demand). Savings: 54%.

If 5 Spot instances are interrupted, load balancer shifts traffic to the remaining 55 Spot + 40 On-Demand (total 95), maintaining SLA until autoscaling provisions new Spot instances.

Architectural Insight

Spot is not a 100% cost-free discount. You're trading capital savings for operational complexity: monitoring, graceful shutdown, checkpointing, and orchestration. Factor in engineering effort when evaluating ROI.

Kubernetes & Spot

Node Pools for Spot Instances

Most Kubernetes clusters run multiple node pools: one for on-demand (critical workloads) and one for spot (batch jobs, non-critical services). In GKE, create a node pool with:

Node Taints: cloud.google.com/gke-preemptible=true:NoExecute to prevent non-spot pods from scheduling on spot nodes.
Node Affinity: Workloads with spot tolerance use nodeAffinity selectors to prefer spot pools.
Pod Disruption Budgets (PDB): Limit concurrent pod evictions during node termination.

Spot Interruption Handlers

aws-node-termination-handler (for EKS) monitors EC2 Spot interruption notices and:

Detects termination event 2–3 minutes before shutdown.
Cordons the node (marks as unschedulable).
Gracefully evicts all pods with a grace period (default 120 seconds).
Drains the node; Kubernetes reschedules pods to other nodes.

Similar tools exist for Azure (Azure Pod Identity) and GCP (workload identity).

Karpenter: Next-Gen Spot Orchestration

Karpenter (CNCF project) automates node provisioning for Kubernetes clusters, optimizing for both cost and availability. Key features:

Consolidation: Automatically removes underutilized nodes, reducing spend.
Spot Diversification: Spreads workloads across instance types and AZs to minimize interruption impact.
Fallback to On-Demand: If Spot capacity unavailable, Karpenter automatically falls back to on-demand (with configurable ratios).
Right-Sizing: Provisions the smallest instance type that fits pod requirements, saving on per-instance overhead.

Karpenter replaces the Kubernetes Cluster Autoscaler, offering superior spot economics with minimal operational overhead.

Spot + Savings Plans / Reserved Instances Stacking Strategy

Many teams ask: "Can I combine Spot discounts with Savings Plans or Reserved Instances?" The short answer is partially, and it depends on the cloud provider.

AWS: Savings Plans + Spot

AWS Savings Plans provide 20–40% discounts on on-demand pricing, and are independent of Spot. You can layer them:

Buy a 1-year Compute Savings Plan for your baseline on-demand workloads.
Use Spot for variable/batch workloads on top.
Example: 60 vCPU baseline with 1-year Savings Plan (~$12k/year), + 40 vCPU Spot for peaks (~$2k/year). Total: ~$14k vs. $25k (all on-demand). Savings: 44%.

Azure: Reserved Instances + Spot

Azure Reserved Instances provide 25–55% discounts on on-demand pricing. RIs are instance-size specific. You can:

Buy RIs for your baseline (guaranteed) workloads.
Layer Spot for additional capacity.
Blended savings are substantial but depend on RI allocation accuracy. Overprovisioning RIs wastes capital.

GCP: CUDs + Spot

GCP Committed Use Discounts (CUDs) provide 25–70% discounts on on-demand pricing. CUDs and Spot are mutually exclusive on the same instance but can be blended:

Buy 1-year CUD for 60 vCPU baseline (e.g., n2-standard-8).
Use Spot for additional capacity or bursty workloads.
Blended discounts can exceed 60% when optimized.

Commitment + Spot Best Practice

Size commitments conservatively. Overcommitting wastes capital; undercommitting increases on-demand overage costs. Use 12–24 months of historical usage to right-size, then layer Spot for variable demand. Typical maturity: 50% CUD/RI + 30% Spot + 20% On-Demand flex.

Cost Modelling Example: 100-Node Batch Cluster

Scenario: A data engineering team runs a 100-node batch cluster for daily ETL (6 hours/day). Each node is c5.2xlarge in us-east-1.

Current On-Demand Baseline

100 nodes × c5.2xlarge × $0.34/hr × 6 hrs/day × 250 working days/year = $51,000/year.

Scenario 1: 100% Spot

Spot price: c5.2xlarge = $0.11/hr (68% discount).
100 nodes × $0.11/hr × 6 hrs/day × 250 days = $16,500/year.
Savings: $34,500/year (68%).
Risk: Interruptions during batch window could delay jobs. Mitigation: checkpointing, retry logic.

Scenario 2: 70% Spot + 30% On-Demand (Mixed)

70 nodes Spot @ $0.11/hr + 30 nodes On-Demand @ $0.34/hr = (70 × 0.11) + (30 × 0.34) = $17.90/hr.
100 nodes × $0.179/hr average × 6 hrs/day × 250 days = $26,850/year.
Savings: $24,150/year (47%).
Risk: Lower. If 5–10 Spot nodes are interrupted, on-demand buffer absorbs delay.

Scenario 3: 60% Spot + 1-Year Savings Plan (40% baseline)

40 nodes via Savings Plan @ $0.24/hr (30% discount) + 60 nodes Spot @ $0.11/hr.
Cost: (40 × 0.24) + (60 × 0.11) = $15.00/hr.
Annual: $15/hr × 6 hrs/day × 250 days = $22,500/year.
Savings: $28,500/year (56%).
Upfront cost: Savings Plan = 40 vCPU × $0.24/hr × 8760 hrs = ~$84,000 prepaid.
Risk & benefit: Balanced. Guaranteed baseline; Spot for peak. Best for predictable workloads.

Strategy	Annual Cost	Savings vs. On-Demand	Interruption Risk	Upfront Capex
On-Demand (baseline)	$51,000	—	None	$0
100% Spot	$16,500	68% ($34,500)	High	$0
70% Spot + 30% On-Demand	$26,850	47% ($24,150)	Moderate	$0
60% Spot + 1-Yr Savings Plan	$22,500	56% ($28,500)	Low–Moderate	~$84,000

8 Tactics for Maximising Spot Savings

Tactic 1

Diversify Across Instance Types & Families

Don't rely on a single instance type. Spot capacity is fluid; if c5.xlarge is scarce, c6i.xlarge or c7g.xlarge may be cheaper. Configure Spot Fleets (AWS) or MIGs (GCP) with 6–8 instance family combinations. Increases fulfillment probability by 30–40%.

Tactic 2

Target Off-Peak Windows

Interruption rates spike 8am–10pm local time. Run batch jobs during off-peak (11pm–7am). Interruption probability drops 50–70%. Schedule daily ETL for 2am–8am, not during business hours.

Tactic 3

Implement Graceful Shutdown & Checkpointing

Deploy spot interruption handlers and checkpoint logic. For Kubernetes, use aws-node-termination-handler or Karpenter. For batch, write state every 10–30 minutes. Reduces job failure risk from 20% to 2–3% on 100-node clusters.

Tactic 4

Spread Across Availability Zones

AWS/Azure/GCP capacity constraints are AZ-specific. Deploy across 3+ AZs. If us-east-1a goes scarce, us-east-1b/1c will have available capacity. Multi-AZ Spot Fleets achieve 20–30% higher fulfillment vs. single-AZ.

Tactic 5

Monitor & Right-Size Instance Selection

Use AWS Spot Instance Advisor, Azure pricing API, or GCP insights to track interruption trends by family. Right-size your fleet monthly. Shift workload from high-interrupt families to stable ones (c5/c6i over c4/c3). Saves 5–15% on interruption overhead.

Tactic 6

Combine Spot with On-Demand & Committed Discounts

Use on-demand for 20–30% of capacity (critical baseline), Spot for 60–70% (variable), and optionally Savings Plans/CUDs for predictable baseline. Blended discounts reach 50–60% vs. 100% on-demand, with lower risk than 100% Spot.

Tactic 7

Use Spot Pricing APIs for Real-Time Decisions

AWS, Azure, and GCP expose Spot pricing via APIs. Build a "buy decision" service: if c5.xlarge Spot is >40% of on-demand, use on-demand instead. Prevents overpaying during price spikes. Reduces opex by 8–12%.

Tactic 8

Audit & Optimize Workload Fit Quarterly

Review workload interruption logs quarterly. If batch jobs are failing at 15% rate due to interruptions, increase on-demand mix or reduce batch window. If CI/CD agents are idle 40% of the time, downsize instance types. Iterative optimization saves 5–10% annually.

Frequently Asked Questions

Can I use Spot for production databases?

Not without significant architectural change. Traditional databases (MySQL, PostgreSQL) expect persistent, long-lived storage. Spot interruptions cause ungraceful shutdowns and data corruption. Instead, use Spot for read replicas in non-critical regions, or use managed database services (RDS, Cloud SQL, Azure Database) on-demand with automatic failover.

What's the difference between AWS Spot and Azure Spot?

Both offer 70–90% discounts and 2–5 minute interruption notice. Main differences: AWS Spot integrates with EC2/ECS/EKS natively; Azure Spot integrates tightly with Virtual Machine Scale Sets (VMSS). AWS has longer history (since 2009) and richer tooling (Spot Instance Advisor, Spot Fleet). Azure Spot is simpler to set up in VMSS but less feature-rich.

How do I avoid vendor lock-in with Spot?

Spot features are cloud-specific (AWS Spot Fleet, Azure VMSS, GCP MIG). To avoid lock-in, use orchestration layers: Kubernetes (multi-cloud), Terraform/Pulumi (IaC), or cloud-agnostic tools like Karpenter. Abstract spot provisioning behind a common interface; migrating to another cloud means changing config, not code.

Should I commit to Reserved Instances if I'm using Spot?

Yes, if your baseline is stable. Reserve 30–50% of capacity for guaranteed baseline (on-demand or RI), then layer Spot for variable demand. This hybrid approach offers 45–60% savings with low risk. For unstable workloads, 100% Spot is viable if checkpointing is solid.

How do I estimate ROI for Spot adoption?

Calculate: (Current monthly on-demand spend) × (Spot discount %) × (% workload suited for Spot) - (cost of engineering effort). Example: $100k/month × 70% discount × 60% suitable workload - $20k engineering = $22k/month saved. ROI payback: 1 month if you hire contractors, 3–6 months with internal team. Include ongoing operational overhead (monitoring, incident response).

Ready to optimise your cloud spend with Spot strategies?

Explore our cloud negotiation consulting firms or contact an expert.

Get Expert Help

Cloud Spot Instance Strategy: Cut Compute Costs by 70–90% in 2026