Kubernetes v1.33: Job's Backoff Limit Per Index Goes GA

In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general
availability (GA). This blog describes the Backoff Limit Per Index feature and
its benefits.

About backoff limit per index

When you run workloads on Kubernetes, you must consider scenarios where Pod
failures can affect the completion of your workloads. Ideally, your workload
should tolerate transient failures and continue running.

To achieve failure tolerance in a Kubernetes Job, you can set the
spec.backoffLimit field. This field specifies the total number of tolerated
failures.

However, for workloads where every index is considered independent, like
embarassingly parallel
workloads – the spec.backoffLimit field is often not flexible enough.
For example, you may choose to run multiple suites of integration tests by
representing each suite as an index within an Indexed Job.
In that setup, a fast-failing index (test suite) is likely to consume your
entire budget for tolerating Pod failures, and you might not be able to run the
other indexes.

In order to address this limitation, Kubernetes introduced backoff limit per index,
which allows you to control the number of retries per index.

How backoff limit per index works

To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated
Pod failures per index with the spec.backoffLimitPerIndex field. When you set
this field, the Job executes all indexes by default.

Additionally, to fine-tune the error handling:

Specify the cap on the total number of failed indexes by setting the
spec.maxFailedIndexes field. When the limit is exceeded the entire Job is
terminated.
Define a short-circuit to detect a failed index by using the FailIndex action in the
Pod Failure Policy
mechanism.

When the number of tolerated failures is exceeded, the Job marks that index as
failed and lists it in the Job’s status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per
index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
 rules:
 - action: Ignore
 onPodConditions:
 - type: DisruptionTarget
 - action: FailIndex
 onExitCodes:
 operator: In
 values: [ 42 ]

In this example, the Job handles Pod failures as follows:

Ignores any failed Pods that have the built-in
disruption condition,
called DisruptionTarget. These Pods don’t count towards Job backoff limits.
Fails the index corresponding to the failed Pod if any of the failed Pod’s
containers finished with the exit code 42 – based on the matching “FailIndex”
rule.
Retries the first failure of any index, unless the index failed due to the
matching FailIndex rule.
Fails the entire Job if the number of failed indexes exceeded 5 (set by the
spec.maxFailedIndexes field).

Learn more

Read the blog post on the closely related feature of Pod Failure Policy Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA
For a hands-on guide to using Pod failure policy, including the use of FailIndex, see
Handling retriable and non-retriable pod failures with Pod failure policy
Read the documentation for
Backoff limit per index and
Pod failure policy
Read the KEP for the Backoff Limits Per Index For Indexed Jobs

Get involved

This work was sponsored by the Kubernetes
batch working group
in close collaboration with the
SIG Apps community.

If you are interested in working on new features in the space we recommend
subscribing to our Slack
channel and attending the regular community meetings.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Kubernetes v1.33: Job’s Backoff Limit Per Index Goes GA

About backoff limit per index

How backoff limit per index works

Example

Learn more

Get involved

Introducing Gateway API Inference Extension

Start Sidecar First: How To Avoid Snags

Gateway API v1.3.0: Advancements in Request Mirroring, CORS, Gateway…