Kubernetes clusters routinely waste 40–60% of their provisioned compute through over-provisioned resource requests, idle node capacity, and poorly tuned autoscaling. This guide covers the complete stack of K8s cost optimisation — from pod-level resource right-sizing to node pool strategy, Spot integration, and managed service cost management on EKS, AKS, and GKE.
This guide is part of the Cloud Cost Optimization: Enterprise FinOps Guide. Kubernetes has become the dominant platform for enterprise container workloads, but its flexible scheduling model creates a systematic cost problem: resource requests are set conservatively, nodes are over-provisioned to ensure pod scheduling succeeds, and idle capacity accumulates invisibly across clusters. Understanding the full K8s cost stack — from individual container resource requests to cluster-level autoscaling and managed service pricing — is prerequisite to meaningful optimisation. For commitment instrument strategy covering EKS, AKS, and GKE nodes, see the Reserved Instances vs Savings Plans guide.
Kubernetes costs operate at four distinct levels, each requiring separate optimisation strategies. At the container level, CPU and memory requests determine how much of a node's capacity is "reserved" for a pod — even if the pod never actually uses that capacity. At the pod level, replica counts, resource limits, and scheduling constraints determine pack density on nodes. At the node level, instance type selection, autoscaling boundaries, and Spot vs on-demand mix determine the cost of raw compute. At the cluster level, managed control plane fees, networking costs, storage, and observability tooling add overhead that compounds at scale.
The root cause of most Kubernetes cost waste is the gap between resource requests (what pods ask for) and actual resource usage (what pods consume). In typical enterprise clusters, CPU requests exceed actual CPU usage by 3–5x and memory requests exceed actual usage by 2–3x. This gap creates clusters that are "full" according to the scheduler — no new pods can be scheduled — while the underlying nodes sit at 20–30% actual CPU utilisation. Right-sizing resource requests is the single highest-leverage Kubernetes cost optimisation action.
Pod right-sizing is the process of setting CPU and memory requests to values that accurately reflect actual resource consumption, with appropriate headroom for traffic spikes and garbage collection events. The standard approach is to collect 14–30 days of Prometheus metrics (or equivalent), identify the P95 CPU and memory utilisation for each container, and set requests to P95 + a safety margin (typically 10–20% for CPU, 20–30% for memory).
Want independent help negotiating better terms? We rank the top advisory firms across 14 vendor categories — free matching, no commitment.
The Kubernetes Vertical Pod Autoscaler automates resource request right-sizing by analysing historical usage and recommending (or automatically applying) updated CPU and memory request values. VPA operates in three modes: Off (recommendations only, no automatic updates), Initial (requests set only at pod creation), and Auto (requests updated dynamically, requiring pod restarts). For production workloads, VPA in recommendation mode is safe to run immediately; Auto mode requires careful testing as pod restarts can impact availability.
VPA recommendations typically identify 30–50% CPU over-provisioning in enterprise clusters. The most common over-provisioned containers are Java applications (which are often given 2–4 CPU requests but rarely use more than 0.5–1 CPU) and microservices with conservative initial estimates that were never revisited after deployment.
VPA and Horizontal Pod Autoscaler (HPA) cannot safely operate on the same resource simultaneously. If HPA is scaling replicas based on CPU utilisation, and VPA is simultaneously changing CPU requests, the scaling signals interfere — VPA reduces requests, HPA sees lower utilisation per pod, and the system oscillates. The safe configuration: use HPA for CPU-based horizontal scaling and VPA for memory-based right-sizing only, or use a controller like KEDA that avoids the conflict entirely.
Kubernetes autoscaling operates at two levels: Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on CPU, memory, or custom metrics; Cluster Autoscaler (CA) scales the number of nodes based on pending pod scheduling requirements. Optimising both layers is essential to minimise costs while maintaining availability.
The most common HPA misconfiguration is setting target CPU utilisation too low — typically 50% — out of fear of latency spikes. This keeps pods running at half their capacity, requiring twice as many replicas (and nodes) as necessary. For most stateless web services and APIs, a target CPU utilisation of 70–80% is appropriate; response time at this level is typically indistinguishable from 50%. Set appropriate minReplicas to ensure availability during scale-up lag, and use KEDA (Kubernetes Event-Driven Autoscaling) for workloads driven by queue depth or custom metrics rather than CPU.
Cluster Autoscaler (CA) adds nodes when pods are pending due to insufficient capacity and removes nodes when they have been underutilised for a configurable period (default: 10 minutes). Key tuning parameters for cost optimisation: set scale-down-utilisation-threshold to 0.5 (default is 0.5, but many clusters run with higher values); reduce scale-down-delay-after-add to 10 minutes for environments where workloads are predictably bursty; enable expander=least-waste to prefer filling existing nodes over adding new ones. Karpenter (AWS-native) and Azure Karpenter provider offer faster, more cost-aware node provisioning than standard CA, with native Spot instance integration.
Node pool design is one of the highest-impact K8s cost levers and one of the most frequently ignored. Most clusters start with a single node pool using a single instance type, which is convenient but suboptimal: different workload profiles (CPU-intensive, memory-intensive, batch, stateful) have different optimal instance types, and mixing them in a single pool means all workloads pay the price of the most conservative choice.
Get the IT Negotiation Playbook — free
Used by 4,200+ IT directors and procurement leads. Oracle, Microsoft, SAP, Cloud — all covered.
| Workload Type | Recommended Instance Family | Key Characteristic | Pool Strategy |
|---|---|---|---|
| General-purpose web/API | AWS m7i, Azure D-series, GCP N2 | Balanced CPU/memory ratio | On-demand + Spot mix |
| CPU-intensive (ML inference, encoding) | AWS c7i, Azure F-series, GCP C3 | High CPU:memory ratio | On-demand for SLAs |
| Memory-intensive (caching, in-memory DB) | AWS r7i, Azure E-series, GCP M3 | High memory:CPU ratio | On-demand (memory risk) |
| Batch / ML training | AWS p4/p5, Azure NC-series, GCP A3 | GPU acceleration | Spot preferred (70%+ savings) |
| System/monitoring pods | AWS t3/t4g, Azure B-series, GCP E2 | Burstable, low baseline | Small on-demand pool |
Spot Instances (AWS), Azure Spot VMs, and GCP Spot VMs offer 60–91% discounts versus on-demand pricing for the same instance types. For Kubernetes workloads, Spot node pools are one of the most impactful cost levers available — but require architecture accommodations to handle preemption gracefully.
The following workload types are well-suited to Spot/preemptible nodes: batch processing jobs, ML model training, CI/CD pipeline runners, development and staging environments, stateless microservices with multiple replicas (where losing one node doesn't impact availability), and data processing pipelines with checkpointing. Workloads that are not suitable for Spot: stateful single-instance databases, leader-elected controllers, services with strict single-digit millisecond latency SLAs, and anything where a 30-second shutdown notice cannot be gracefully handled.
The optimal architecture for Spot-enabled clusters uses a mixed node pool strategy: a small on-demand baseline pool (sized for minimum viable capacity) plus a Spot pool for burst and batch capacity. Node affinity and taints/tolerations route appropriate workloads to each pool. Kubernetes Pod Disruption Budgets (PDBs) ensure rolling Spot replacements don't take down more than a defined percentage of a deployment's replicas simultaneously.
Are your Kubernetes clusters costing more than they should?
Independent cloud cost advisors can audit your K8s cost profile and identify your highest-value optimisation opportunities.
ResourceQuotas and LimitRanges are Kubernetes admission control mechanisms that constrain resource consumption at the namespace level. Without them, a single team can inadvertently (or deliberately) consume unlimited cluster resources — degrading performance for all other tenants and causing uncontrolled cost growth.
ResourceQuotas set hard caps on total CPU requests, memory requests, CPU limits, memory limits, and object counts (pods, services, PVCs) per namespace. LimitRanges set default and maximum resource requests and limits for individual containers in a namespace, ensuring that pods without explicit resource declarations are assigned defaults rather than running unconstrained. Together, these mechanisms enforce the resource governance that makes cluster cost attribution and budgeting possible. Without LimitRanges, developers frequently deploy pods with no resource requests — which the scheduler treats as having zero resource requirements, leading to nodes that appear to have spare capacity but are actually overloaded.
| Dimension | AWS EKS | Azure AKS | GCP GKE |
|---|---|---|---|
| Control plane cost | $0.10/hr per cluster ($73/mo) | Free (Standard tier) | Free (1 zonal cluster), $0.10/hr (Standard/Autopilot) |
| Spot node support | Native Spot, Karpenter | Spot node pools, Karpenter | Spot pools, GKE Autopilot Spot |
| Managed autoscaling | Cluster Autoscaler or Karpenter | Cluster Autoscaler or Karpenter | GKE Autopilot (fully managed) |
| RI/commitment coverage | EC2 Savings Plans cover nodes | Azure RIs + Savings Plans | CUDs cover node compute |
| Fargate/serverless nodes | EKS Fargate (higher per-vCPU cost) | AKS Virtual Nodes (ACI) | GKE Autopilot (automated) |
| Cost visibility tooling | Container Insights + Cost Explorer | Container Insights + Cost Management | GKE Cost Allocation (native namespace) |
Several open-source and commercial tools provide Kubernetes-specific cost visibility that cloud-native billing tools lack. OpenCost (CNCF project) provides real-time, namespace-level cost allocation based on node pricing and pod resource consumption — integrating with Prometheus and supporting all three major managed K8s services. Kubecost (commercial; built on OpenCost) adds cost efficiency scores, request right-sizing recommendations, and cluster health metrics. Goldilocks (open-source, by Fairwinds) automates VPA recommendation generation and presents them in a dashboard, making right-sizing analysis accessible without Prometheus expertise.
At the cloud provider level, GKE has the most mature native cost attribution — GKE Cost Allocation in the GCP Billing Console provides namespace and label-based cost breakdown without additional tooling. AWS Container Insights and Azure Monitor provide cluster metrics but require additional work to produce meaningful cost-per-namespace or cost-per-workload views.
Connect with an independent cloud cost advisor who can audit your K8s cluster, identify right-sizing opportunities, and design your optimal node pool strategy.