Introduction
With the release of Kubernetes 1.31, the Pod Failure Policy for Jobs has graduated to General Availability (GA). This new feature provides enhanced control over how Kubernetes handles pod failures within Jobs, allowing for more efficient and cost-effective management of workloads.
Understanding Pod Failure Policy
Running workloads on Kubernetes can lead to pod failures due to various reasons. For workloads such as Jobs, it is essential to handle transient, retriable failures without halting the entire process. Traditionally, the backoffLimit
field in Kubernetes Jobs allowed you to specify the number of pod failures to tolerate before stopping the Job. However, setting a high backoffLimit
can result in excessive restarts and increased operating costs, especially for large-scale Jobs.
The Pod Failure Policy extends the backoff limit mechanism, offering more granular control to immediately terminate a Job upon a non-retriable pod failure and ignore retriable errors without inflating the backoffLimit
value. This policy is particularly useful for scenarios like running workloads on cost-effective spot instances, where pod failures due to node shutdowns can be gracefully ignored.
How Pod Failure Policy Works
A Pod Failure Policy is defined within the Job specification as a list of rules. Each rule specifies conditions and corresponding actions based on container exit codes or pod conditions. The actions can be:
- Ignore: Does not count the failure towards the
backoffLimit
. - FailJob: Terminates the entire Job and all running pods.
- FailIndex: Terminates the specific index of the failed pod, useful with the backoff limit per index feature.
- Count: Counts the failure towards the
backoffLimit
(default behavior).
Example Specification
Below is an example of a Pod Failure Policy:
apiVersion: batch/v1
kind: Job
metadata:
name: example-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: example-container
image: example-image
backoffLimit: 3
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailJob
onPodConditions:
- type: ConfigIssue
- action: FailJob
onExitCodes:
operator: In
values: [42]
In this example:
- Pods with the
DisruptionTarget
condition are ignored and do not count towards the Job’s backoff limits. - The Job fails if a pod has a
ConfigIssue
condition, which might be added by a custom controller or webhook. - The Job also fails if any container exits with the code 42.
- All other pod failures count towards the
backoffLimit
.
Important Note
When using the Pod Failure Policy, ensure the Job’s pod template is set with restartPolicy: Never
. This setting prevents race conditions between the kubelet and Job controller when counting pod failures.