Cloud Cost Optimization — Kubernetes

Kubernetes Cost Optimization:
Cluster Right-Sizing Guide

Kubernetes clusters routinely waste 40–60% of their provisioned compute through over-provisioned resource requests, idle node capacity, and poorly tuned autoscaling. This guide covers the complete stack of K8s cost optimisation — from pod-level resource right-sizing to node pool strategy, Spot integration, and managed service cost management on EKS, AKS, and GKE.

50%
Typical K8s Compute Waste
60%
Savings with Spot Node Pools
40%
Savings from Right-Sizing
30%
Typical VPA Improvement

This guide is part of the Cloud Cost Optimization: Enterprise FinOps Guide. Kubernetes has become the dominant platform for enterprise container workloads, but its flexible scheduling model creates a systematic cost problem: resource requests are set conservatively, nodes are over-provisioned to ensure pod scheduling succeeds, and idle capacity accumulates invisibly across clusters. Understanding the full K8s cost stack — from individual container resource requests to cluster-level autoscaling and managed service pricing — is prerequisite to meaningful optimisation. For commitment instrument strategy covering EKS, AKS, and GKE nodes, see the Reserved Instances vs Savings Plans guide.

The K8s Cost Architecture

Kubernetes costs operate at four distinct levels, each requiring separate optimisation strategies. At the container level, CPU and memory requests determine how much of a node's capacity is "reserved" for a pod — even if the pod never actually uses that capacity. At the pod level, replica counts, resource limits, and scheduling constraints determine pack density on nodes. At the node level, instance type selection, autoscaling boundaries, and Spot vs on-demand mix determine the cost of raw compute. At the cluster level, managed control plane fees, networking costs, storage, and observability tooling add overhead that compounds at scale.

The Request vs Actual Usage Gap

The root cause of most Kubernetes cost waste is the gap between resource requests (what pods ask for) and actual resource usage (what pods consume). In typical enterprise clusters, CPU requests exceed actual CPU usage by 3–5x and memory requests exceed actual usage by 2–3x. This gap creates clusters that are "full" according to the scheduler — no new pods can be scheduled — while the underlying nodes sit at 20–30% actual CPU utilisation. Right-sizing resource requests is the single highest-leverage Kubernetes cost optimisation action.

Pod and Container Right-Sizing

Pod right-sizing is the process of setting CPU and memory requests to values that accurately reflect actual resource consumption, with appropriate headroom for traffic spikes and garbage collection events. The standard approach is to collect 14–30 days of Prometheus metrics (or equivalent), identify the P95 CPU and memory utilisation for each container, and set requests to P95 + a safety margin (typically 10–20% for CPU, 20–30% for memory).

Expert Advisory

Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories — free matching, no commitment.

Vertical Pod Autoscaler (VPA)

The Kubernetes Vertical Pod Autoscaler automates resource request right-sizing by analysing historical usage and recommending (or automatically applying) updated CPU and memory request values. VPA operates in three modes: Off (recommendations only, no automatic updates), Initial (requests set only at pod creation), and Auto (requests updated dynamically, requiring pod restarts). For production workloads, VPA in recommendation mode is safe to run immediately; Auto mode requires careful testing as pod restarts can impact availability.

VPA recommendations typically identify 30–50% CPU over-provisioning in enterprise clusters. The most common over-provisioned containers are Java applications (which are often given 2–4 CPU requests but rarely use more than 0.5–1 CPU) and microservices with conservative initial estimates that were never revisited after deployment.

VPA and HPA Conflict

VPA and Horizontal Pod Autoscaler (HPA) cannot safely operate on the same resource simultaneously. If HPA is scaling replicas based on CPU utilisation, and VPA is simultaneously changing CPU requests, the scaling signals interfere — VPA reduces requests, HPA sees lower utilisation per pod, and the system oscillates. The safe configuration: use HPA for CPU-based horizontal scaling and VPA for memory-based right-sizing only, or use a controller like KEDA that avoids the conflict entirely.

Autoscaling Strategy

Kubernetes autoscaling operates at two levels: Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on CPU, memory, or custom metrics; Cluster Autoscaler (CA) scales the number of nodes based on pending pod scheduling requirements. Optimising both layers is essential to minimise costs while maintaining availability.

Horizontal Pod Autoscaler Tuning

The most common HPA misconfiguration is setting target CPU utilisation too low — typically 50% — out of fear of latency spikes. This keeps pods running at half their capacity, requiring twice as many replicas (and nodes) as necessary. For most stateless web services and APIs, a target CPU utilisation of 70–80% is appropriate; response time at this level is typically indistinguishable from 50%. Set appropriate minReplicas to ensure availability during scale-up lag, and use KEDA (Kubernetes Event-Driven Autoscaling) for workloads driven by queue depth or custom metrics rather than CPU.

Cluster Autoscaler Optimisation

Cluster Autoscaler (CA) adds nodes when pods are pending due to insufficient capacity and removes nodes when they have been underutilised for a configurable period (default: 10 minutes). Key tuning parameters for cost optimisation: set scale-down-utilisation-threshold to 0.5 (default is 0.5, but many clusters run with higher values); reduce scale-down-delay-after-add to 10 minutes for environments where workloads are predictably bursty; enable expander=least-waste to prefer filling existing nodes over adding new ones. Karpenter (AWS-native) and Azure Karpenter provider offer faster, more cost-aware node provisioning than standard CA, with native Spot instance integration.

Node Pool Optimisation

Node pool design is one of the highest-impact K8s cost levers and one of the most frequently ignored. Most clusters start with a single node pool using a single instance type, which is convenient but suboptimal: different workload profiles (CPU-intensive, memory-intensive, batch, stateful) have different optimal instance types, and mixing them in a single pool means all workloads pay the price of the most conservative choice.

Free Resource

Get the IT Negotiation Playbook — free

Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud — all covered.

Workload Type Recommended Instance Family Key Characteristic Pool Strategy
General-purpose web/APIAWS m7i, Azure D-series, GCP N2Balanced CPU/memory ratioOn-demand + Spot mix
CPU-intensive (ML inference, encoding)AWS c7i, Azure F-series, GCP C3High CPU:memory ratioOn-demand for SLAs
Memory-intensive (caching, in-memory DB)AWS r7i, Azure E-series, GCP M3High memory:CPU ratioOn-demand (memory risk)
Batch / ML trainingAWS p4/p5, Azure NC-series, GCP A3GPU accelerationSpot preferred (70%+ savings)
System/monitoring podsAWS t3/t4g, Azure B-series, GCP E2Burstable, low baselineSmall on-demand pool

Spot and Preemptible Node Pools

Spot Instances (AWS), Azure Spot VMs, and GCP Spot VMs offer 60–91% discounts versus on-demand pricing for the same instance types. For Kubernetes workloads, Spot node pools are one of the most impactful cost levers available — but require architecture accommodations to handle preemption gracefully.

Spot-Compatible Workload Patterns

The following workload types are well-suited to Spot/preemptible nodes: batch processing jobs, ML model training, CI/CD pipeline runners, development and staging environments, stateless microservices with multiple replicas (where losing one node doesn't impact availability), and data processing pipelines with checkpointing. Workloads that are not suitable for Spot: stateful single-instance databases, leader-elected controllers, services with strict single-digit millisecond latency SLAs, and anything where a 30-second shutdown notice cannot be gracefully handled.

The optimal architecture for Spot-enabled clusters uses a mixed node pool strategy: a small on-demand baseline pool (sized for minimum viable capacity) plus a Spot pool for burst and batch capacity. Node affinity and taints/tolerations route appropriate workloads to each pool. Kubernetes Pod Disruption Budgets (PDBs) ensure rolling Spot replacements don't take down more than a defined percentage of a deployment's replicas simultaneously.

Are your Kubernetes clusters costing more than they should?

Independent cloud cost advisors can audit your K8s cost profile and identify your highest-value optimisation opportunities.

Get Matched →

Namespace Quotas and LimitRanges

ResourceQuotas and LimitRanges are Kubernetes admission control mechanisms that constrain resource consumption at the namespace level. Without them, a single team can inadvertently (or deliberately) consume unlimited cluster resources — degrading performance for all other tenants and causing uncontrolled cost growth.

ResourceQuotas set hard caps on total CPU requests, memory requests, CPU limits, memory limits, and object counts (pods, services, PVCs) per namespace. LimitRanges set default and maximum resource requests and limits for individual containers in a namespace, ensuring that pods without explicit resource declarations are assigned defaults rather than running unconstrained. Together, these mechanisms enforce the resource governance that makes cluster cost attribution and budgeting possible. Without LimitRanges, developers frequently deploy pods with no resource requests — which the scheduler treats as having zero resource requirements, leading to nodes that appear to have spare capacity but are actually overloaded.

EKS, AKS, and GKE Cost Comparison

Dimension AWS EKS Azure AKS GCP GKE
Control plane cost$0.10/hr per cluster ($73/mo)Free (Standard tier)Free (1 zonal cluster), $0.10/hr (Standard/Autopilot)
Spot node supportNative Spot, KarpenterSpot node pools, KarpenterSpot pools, GKE Autopilot Spot
Managed autoscalingCluster Autoscaler or KarpenterCluster Autoscaler or KarpenterGKE Autopilot (fully managed)
RI/commitment coverageEC2 Savings Plans cover nodesAzure RIs + Savings PlansCUDs cover node compute
Fargate/serverless nodesEKS Fargate (higher per-vCPU cost)AKS Virtual Nodes (ACI)GKE Autopilot (automated)
Cost visibility toolingContainer Insights + Cost ExplorerContainer Insights + Cost ManagementGKE Cost Allocation (native namespace)

K8s Cost Tooling

Several open-source and commercial tools provide Kubernetes-specific cost visibility that cloud-native billing tools lack. OpenCost (CNCF project) provides real-time, namespace-level cost allocation based on node pricing and pod resource consumption — integrating with Prometheus and supporting all three major managed K8s services. Kubecost (commercial; built on OpenCost) adds cost efficiency scores, request right-sizing recommendations, and cluster health metrics. Goldilocks (open-source, by Fairwinds) automates VPA recommendation generation and presents them in a dashboard, making right-sizing analysis accessible without Prometheus expertise.

At the cloud provider level, GKE has the most mature native cost attribution — GKE Cost Allocation in the GCP Billing Console provides namespace and label-based cost breakdown without additional tooling. AWS Container Insights and Azure Monitor provide cluster metrics but require additional work to produce meaningful cost-per-namespace or cost-per-workload views.

Frequently Asked Questions

What is a realistic savings target for Kubernetes cost optimisation?
In enterprise clusters that have not been actively optimised, 30–50% cost reduction is achievable through a combination of resource request right-sizing (15–25% typical), Spot node pool introduction (additional 10–20% depending on eligible workload percentage), and autoscaling tuning (5–15%). Namespace quota enforcement and idle cluster cleanup (dev/staging clusters left running overnight or over weekends) can add another 10–15% on top. The total available savings depend heavily on the current maturity of the cluster's resource management practices.
Should we use GKE Autopilot or Standard mode for cost optimisation?
GKE Autopilot eliminates node management overhead and charges only for pod-requested resources (not node capacity) — making it naturally more cost-efficient for workloads with highly variable resource requirements. Standard mode gives more control over node pool configuration, Spot integration, and custom autoscaling logic, which typically delivers better optimisation for mature FinOps teams. For teams without dedicated Kubernetes operational expertise, Autopilot is often the lower-waste option. For teams with FinOps maturity and mixed workloads, Standard mode with well-tuned node pools typically delivers lower cost at scale.
How do we apply cloud commitment discounts to Kubernetes node costs?
Cloud commitment instruments (AWS Savings Plans, Azure RIs, GCP CUDs) apply to the underlying EC2/Azure VM/Compute Engine instances running as Kubernetes nodes — not to the K8s construct itself. AWS Compute Savings Plans are particularly well-suited to K8s node pools because they cover any instance type in any region, accommodating Karpenter's dynamic instance selection. For Azure AKS, Azure Savings Plans for Compute or Reserved VM Instances cover node VMs. For GCP GKE Standard, node CUDs apply to the Compute Engine VMs. See the RI vs Savings Plans guide for detailed commitment instrument selection guidance.
What is the cost impact of running multiple small clusters vs one large cluster?
Multiple small clusters increase overhead costs: on AWS EKS, each cluster costs $73/month in control plane fees; even five clusters add $365/month in base overhead. More significantly, small clusters have lower pack efficiency — a cluster with 10 nodes cannot achieve the same bin-packing density as a 100-node cluster. Multi-tenant large clusters with proper namespace isolation (quotas, RBAC, network policies) typically achieve 15–30% better node utilisation than equivalent capacity spread across multiple clusters. The exceptions are regulatory/compliance requirements that mandate cluster isolation, and clusters in different regions for latency or data sovereignty reasons.

Cut Your Kubernetes Compute Costs

Connect with an independent cloud cost advisor who can audit your K8s cluster, identify right-sizing opportunities, and design your optimal node pool strategy.