Kubefeeds Team A dedicated and highly skilled team at Kubefeeds, driven by a passion for Kubernetes and Cloud-Native technologies, delivering innovative solutions with expertise and enthusiasm.

Kubernetes v1.33: Job’s Backoff Limit Per Index Goes GA

1 min read

In Kubernetes v1.33, the Backoff Limit Per Index feature reaches general
availability (GA). This blog describes the Backoff Limit Per Index feature and
its benefits.

About backoff limit per index

When you run workloads on Kubernetes, you must consider scenarios where Pod
failures can affect the completion of your workloads. Ideally, your workload
should tolerate transient failures and continue running.

To achieve failure tolerance in a Kubernetes Job, you can set the
spec.backoffLimit field. This field specifies the total number of tolerated
failures.

However, for workloads where every index is considered independent, like
embarassingly parallel
workloads – the spec.backoffLimit field is often not flexible enough.
For example, you may choose to run multiple suites of integration tests by
representing each suite as an index within an Indexed Job.
In that setup, a fast-failing index (test suite) is likely to consume your
entire budget for tolerating Pod failures, and you might not be able to run the
other indexes.

In order to address this limitation, Kubernetes introduced backoff limit per index,
which allows you to control the number of retries per index.

How backoff limit per index works

To use Backoff Limit Per Index for Indexed Jobs, specify the number of tolerated
Pod failures per index with the spec.backoffLimitPerIndex field. When you set
this field, the Job executes all indexes by default.

Additionally, to fine-tune the error handling:

  • Specify the cap on the total number of failed indexes by setting the
    spec.maxFailedIndexes field. When the limit is exceeded the entire Job is
    terminated.
  • Define a short-circuit to detect a failed index by using the FailIndex action in the
    Pod Failure Policy
    mechanism.

When the number of tolerated failures is exceeded, the Job marks that index as
failed and lists it in the Job’s status.failedIndexes field.

Example

The following Job spec snippet is an example of how to combine backoff limit per
index with the Pod Failure Policy feature:

completions: 10
parallelism: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
podFailurePolicy:
 rules:
 - action: Ignore
 onPodConditions:
 - type: DisruptionTarget
 - action: FailIndex
 onExitCodes:
 operator: In
 values: [ 42 ]

In this example, the Job handles Pod failures as follows:

  • Ignores any failed Pods that have the built-in
    disruption condition,
    called DisruptionTarget. These Pods don’t count towards Job backoff limits.
  • Fails the index corresponding to the failed Pod if any of the failed Pod’s
    containers finished with the exit code 42 – based on the matching “FailIndex”
    rule.
  • Retries the first failure of any index, unless the index failed due to the
    matching FailIndex rule.
  • Fails the entire Job if the number of failed indexes exceeded 5 (set by the
    spec.maxFailedIndexes field).

Learn more

Get involved

This work was sponsored by the Kubernetes
batch working group
in close collaboration with the
SIG Apps community.

If you are interested in working on new features in the space we recommend
subscribing to our Slack
channel and attending the regular community meetings.

Kubefeeds Team A dedicated and highly skilled team at Kubefeeds, driven by a passion for Kubernetes and Cloud-Native technologies, delivering innovative solutions with expertise and enthusiasm.
Ask Kubeex
Chatbot