With the release of Kubernetes 1.31, the Pod Failure Policy for Jobs has graduated to General Availability (GA). This new feature provides enhanced control over how Kubernetes handles pod failures within Jobs, allowing for more efficient and cost-effective management of workloads.
Running workloads on Kubernetes can lead to pod failures due to various reasons. For workloads such as Jobs, it is essential to handle transient, retriable failures without halting the entire process. Traditionally, the backoffLimit
field in Kubernetes Jobs allowed you to specify the number of pod failures to tolerate before stopping the Job. However, setting a high backoffLimit
can result in excessive restarts and increased operating costs, especially for large-scale Jobs.
The Pod Failure Policy extends the backoff limit mechanism, offering more granular control to immediately terminate a Job upon a non-retriable pod failure and ignore retriable errors without inflating the backoffLimit
value. This policy is particularly useful for scenarios like running workloads on cost-effective spot instances, where pod failures due to node shutdowns can be gracefully ignored.
A Pod Failure Policy is defined within the Job specification as a list of rules. Each rule specifies conditions and corresponding actions based on container exit codes or pod conditions. The actions can be:
backoffLimit
.backoffLimit
(default behavior).Below is an example of a Pod Failure Policy:
apiVersion: batch/v1
kind: Job
metadata:
name: example-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: example-container
image: example-image
backoffLimit: 3
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
- action: FailJob
onPodConditions:
- type: ConfigIssue
- action: FailJob
onExitCodes:
operator: In
values: [42]
In this example:
DisruptionTarget
condition are ignored and do not count towards the Job’s backoff limits.ConfigIssue
condition, which might be added by a custom controller or webhook.backoffLimit
.When using the Pod Failure Policy, ensure the Job’s pod template is set with restartPolicy: Never
. This setting prevents race conditions between the kubelet and Job controller when counting pod failures.
Disclaimer:
The views expressed and the content shared in all published articles on this website are solely those of the respective authors, and they do not necessarily reflect the views of the author’s employer or the techbeatly platform. We strive to ensure the accuracy and validity of the content published on our website. However, we cannot guarantee the absolute correctness or completeness of the information provided. It is the responsibility of the readers and users of this website to verify the accuracy and appropriateness of any information or opinions expressed within the articles. If you come across any content that you believe to be incorrect or invalid, please contact us immediately so that we can address the issue promptly.
Tags: kubernets · Pod Failure Policy
Gineesh Madapparambath
Gineesh Madapparambath is the founder of techbeatly and he is the co-author of The Kubernetes Bible, Second Edition. and the author of 𝗔𝗻𝘀𝗶𝗯𝗹𝗲 𝗳𝗼𝗿 𝗥𝗲𝗮𝗹-𝗟𝗶𝗳𝗲 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗼𝗻.
He has worked as a Systems Engineer, Automation Specialist, and content author. His primary focus is on Ansible Automation, Containerisation (OpenShift & Kubernetes), and Infrastructure as Code (Terraform).
(aka Gini Gangadharan - iamgini.com)
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Leave a Reply