Site icon techbeatly

How to Use Pod Failure Policy in Kubernetes


With the release of Kubernetes 1.31, the Pod Failure Policy for Jobs has graduated to General Availability (GA). This new feature provides enhanced control over how Kubernetes handles pod failures within Jobs, allowing for more efficient and cost-effective management of workloads.

Understanding Pod Failure Policy

Running workloads on Kubernetes can lead to pod failures due to various reasons. For workloads such as Jobs, it is essential to handle transient, retriable failures without halting the entire process. Traditionally, the backoffLimit field in Kubernetes Jobs allowed you to specify the number of pod failures to tolerate before stopping the Job. However, setting a high backoffLimit can result in excessive restarts and increased operating costs, especially for large-scale Jobs.

The Pod Failure Policy extends the backoff limit mechanism, offering more granular control to immediately terminate a Job upon a non-retriable pod failure and ignore retriable errors without inflating the backoffLimit value. This policy is particularly useful for scenarios like running workloads on cost-effective spot instances, where pod failures due to node shutdowns can be gracefully ignored.

How Pod Failure Policy Works

A Pod Failure Policy is defined within the Job specification as a list of rules. Each rule specifies conditions and corresponding actions based on container exit codes or pod conditions. The actions can be:

Example Specification

Below is an example of a Pod Failure Policy:

apiVersion: batch/v1
kind: Job
  name: example-job
      restartPolicy: Never
      - name: example-container
        image: example-image
  backoffLimit: 3
    - action: Ignore
      - type: DisruptionTarget
    - action: FailJob
      - type: ConfigIssue
    - action: FailJob
        operator: In
        values: [42]

In this example:

Important Note

When using the Pod Failure Policy, ensure the Job’s pod template is set with restartPolicy: Never. This setting prevents race conditions between the kubelet and Job controller when counting pod failures.

Exit mobile version