Spot and Preemptible Nodes: When They Save You 70%, and When They Bite
Spot capacity is the biggest single discount in cloud computing, often 60-90% off. It's also the easiest way to cause an outage if you put the wrong workload on it. Here's the dividing line.
By The FeckBills team
Spot and Preemptible Nodes: When They Save You 70%, and When They Bite
Spot (formerly preemptible) nodes are the single most dramatic discount available in cloud computing, routinely 60-91% off on-demand prices. They're also the fastest way to turn a quiet Tuesday into an incident if you put the wrong thing on them. The discount is real; so is the catch.
The deal you're actually signing up for
A spot VM is spare capacity the cloud will reclaim with little warning (GCP gives a ~30-second grace period; the node can vanish at any time, and legacy preemptibles also have a hard 24-hour cap). In exchange, you pay a fraction of the price.
The mental model that matters: spot is for workloads that can lose a node and not care. If a pod dying and rescheduling elsewhere is a non-event, spot is free money. If it's a customer-facing outage, it's a trap.
Great fits for spot
- Stateless web/API workers behind a load balancer with enough replicas that losing one is invisible.
- Batch and data-processing jobs that checkpoint and can resume.
- CI/CD runners and build farms.
- Dev and staging environments, almost always.
- Anything horizontally scalable with a sane PodDisruptionBudget.
Bad fits for spot
- Stateful singletons: a primary database, a leader-elected controller with slow failover.
- Long jobs with no checkpointing that lose hours of work on eviction.
- Latency-critical paths where a reschedule storm causes user-visible errors.
- Anything where "it'll come back in a minute" is unacceptable.
Doing it safely
- Mix node pools. Run a small on-demand pool for the must-stay-up pods and a larger spot pool for everything else. Use
nodeSelector/affinity and taints/tolerations to place workloads deliberately. - Set PodDisruptionBudgets so evictions stay graceful and you never lose too many replicas at once.
- Spread replicas across zones and nodes with topology spread constraints, so one preemption can't take out your whole service.
- Handle the shutdown signal. Drain connections on
SIGTERM; don't accept new work during the grace period.
The honest economics
Don't model spot at the full headline discount across your whole cluster, because you can't put everything on it. Model it on the fraction of your workloads that genuinely tolerate interruption. Even at 40-50% of your compute moved to spot, the savings on a five-figure monthly bill are substantial, and you've added resilience by forcing yourself to fix PDBs and replica spread along the way.
How FeckBills helps
FeckBills helps you find the workloads that are already behaving like spot candidates (horizontally scaled, stateless, interruption-tolerant), so you can move them with confidence instead of guessing. It quantifies the reclaimable spend so you can prioritise the migrations that actually move the needle.
Scan your cluster and see which workloads are ready for spot.