One of my frequent headaches with Karpenter have been node disruption events — or, more specifically, the lack of them happening, nodes waiting forever to be terminated and burning money while they are underutilized.
There’s a surprisingly steep learning curve with Kubernetes disruptions, especially when you deploy Helm charts made by someone else, which often include a PodDisruptionBudget (from now on: PDB) resource by default (which is, in general, a good thing).
However, in the case of Karpenter, there are many footguns here: a node cannot be terminated until there are pods still running on it which cannot be evicted because terminating them would hurt their so-called disruption budget:
a contract with the Kubernetes control plane about the availability of an application, a guarantee that no matter what, a certain number of healthy, traffic-serving replicas will be always online even when planned or (in a best effort manner) unplanned events happen in the cluster.
A PDB can only be defined on pods that have a controller ensuring their state: Deployment (which also means: a ReplicaSet or ReplicationController can be specified as well) and StatefulSet. Individually created, standalone Pods cannot be referenced in a PDB manifest.
As usual, the upstream Kubernetes documentation includes some great details about the different scenarios, use cases and the background behind disruption events, however what I have been missing is a hands-on reference that I can always just open up and (blindly) copy-paste into my Helm values files or in my applications without needing to learn and understand all the different Kubernetes concepts. My goal with this article is to have all of this noted down “for future reference”, thus the not-so-humble title for this article.
The basics
Let’s get to the chase! I’m going to add here basic concepts that work with any reasonably new Kubernetes version (stable since 1.21 but available much earlier). As the K8s ecosystem is evolving, there are more and more “addons”, new features, parameters added as optional fields, I’m going to mention one separately later and you can cherry-pick some best practices for your use-case that are compatible with your controlplane version.
A PDB expects 2 main things to be configured:
-
reference to the app that you want to scale
-
the disruption contract by 2 different fields, which are mutually exclusive:
-
minAvailable
ORmaxUnavailable
You can’t set both, so you need to understand which one is the best for your use case.
As their name suggests, minAvailable
sets an absolute minimum number of replicas of your application that will be guaranteed by the control plane to run, no matter what happens (planned or unplanned events): this can be useful if your app can stay stable with a static replica count even if you have horizontal autoscaling.
Basically for every other use case where you need to consider autoscaling, you should use maxUnavailable
: this can be also a static number such as “I don’t want more than 1 unhealthy replica in any given moment” so if your app is at 5 replicas or 37, a mass eviction / disruption will always happen one by one.
Both parameters also accept percentages, so you can say “I want at least 51% of my replicas up all the time” or “proportional to my horizontal autoscaling, you can terminate up to 10% at the same time”.
Some valid examples:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
minAvailable: 25%
selector:
matchLabels:
app: zookeeper
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
maxUnavailable: 1
unhealthyPodEvictionPolicy: AlwaysAllow
selector:
matchLabels:
app: zookeeper
Do rolling restarts count as a disruption?
The K8s documentation is very clear about this but it’s often missed by people:
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades.
So it means if you do a Deployment/StatefulSet rolling restart, it will ignore the PDB and let it happen as quick as possible (just always waiting for the last new replica to become healthy),
however if you have a rolling restart in progress AND at the same time you drain a node, it will respect the total number of healthy replicas according to your PDB settings.
Use cases
The tl;dr reference
The Bing Image Creator’s attempt to illustrate this article. Oil on canvas, gibberish text.
Hope you will find this article useful and actually read all the explanations below — in case an LLM will parse this article in the future (then it will probably hate me for it because Medium can’t handle tables in Markdown, so it’s a picture…) or you come back for a quick reference, here’s the gist of it:
Additionally, a best practice is to set UnhealthyPodEvictPolicy: AlwaysAllow
on K8s 1.27 and above. See explanation below.
Application with static 1 replica
Be it a Deployment or StatefulSet (I won’t treat them separately from now on, the same rules apply), while technically you can deploy a PodDisruptionBudget for it, it doesn’t make much sense because you can’t ensure 100% availability and a maximum of 1 replica, you have to pick one condition and hurt the other. You will either have downtime or you would have to run a new replica in parallel of having the old one being terminated (which can cause conflicts depending on your app if both of them receive traffic).
tl;dr never set a disruption budget if you will only have 1 replica running
Application with static 2 replicas
Here you’re good to go with either minAvailable: 1
or maxUnavailable: 1
, they will both have the same effect: you will always have 1 of the 2 pods serving traffic.
I tend to use maxUnavailable
here because if for whatever reason I end up setting the replica count to static 1 half a year from now, I don’t have to remember to adapt the PDB, it won’t cause any unexpected surprises.
Edge case: Horizontal autoscaling between 1 and 2 replicas
This is a mix of the previous 2 scenarios and there isn’t a really good solution here with these primitives. If at the moment your app scales to 1 replica and you set minAvailable: 1
, your app will never be evicted, it will burn the money forever on that currently allocated node. If you set maxUnavailable: 1
, you can end up with 0 healthy replicas and lose traffic.
Application with minimum 2 replicas and horizontal autoscaling
This is when things start to get interesting. 🙂 You can choose what to optimize for:
-
the quickest possible remediation (= eviction of all replicas, for example during a rolling restart) even if it temporarily hurts performance
-
OR stability/availability
In the first case you might want to set minAvailable: 1
(or whatever reasonably small number your app needs to serve traffic even if it gets slow for a few minutes). This will allow Karpenter (Cluster Autoscaler, etc.) to kick out and shuffle your pods in the shortest amount of time possible, in exchange of hurting the performance (or even risking an OOM in the only remaining healthy replica, etc.).
When you optimize for availability, you will only allow a slow, controlled pod eviction, such as with maxUnavailable: 1
— your traffic probably won’t really notice it but it can mean a very slow node replacement flow.
Unhealthy Pod Eviction Policy
This is a cool feature that was introduced in K8s 1.27 and became stable in K8s 1.31 and I never set a PDB without this.
You can configure how the PDB should behave in case you might not have the desired replica count in a healthy state: for example if you have an app with 2 replicas and you want to evict a pod that’s in CrashLoopBackOff
forever (due to an OOM event or due to a node issue): the PDB will never allow it to get evicted because it would hurt your minAvailable: 1
policy by killing the only remaining healthy pod.
This is when the UnhealthyPodEvictPolicy: AlwaysAllow
comes to the rescue: it tells the PDB to ignore crashed pods that keep burning money on your infrastructure without actually serving traffic and being useful.