How to Use Pod Failure Policy in Kubernetes

Gineesh Madapparambath

10 months ago

Introduction

With the release of Kubernetes 1.31, the Pod Failure Policy for Jobs has graduated to General Availability (GA). This new feature provides enhanced control over how Kubernetes handles pod failures within Jobs, allowing for more efficient and cost-effective management of workloads.

Understanding Pod Failure Policy

Running workloads on Kubernetes can lead to pod failures due to various reasons. For workloads such as Jobs, it is essential to handle transient, retriable failures without halting the entire process. Traditionally, the backoffLimit field in Kubernetes Jobs allowed you to specify the number of pod failures to tolerate before stopping the Job. However, setting a high backoffLimit can result in excessive restarts and increased operating costs, especially for large-scale Jobs.

The Pod Failure Policy extends the backoff limit mechanism, offering more granular control to immediately terminate a Job upon a non-retriable pod failure and ignore retriable errors without inflating the backoffLimit value. This policy is particularly useful for scenarios like running workloads on cost-effective spot instances, where pod failures due to node shutdowns can be gracefully ignored.

How Pod Failure Policy Works

A Pod Failure Policy is defined within the Job specification as a list of rules. Each rule specifies conditions and corresponding actions based on container exit codes or pod conditions. The actions can be:

Ignore: Does not count the failure towards the backoffLimit.
FailJob: Terminates the entire Job and all running pods.
FailIndex: Terminates the specific index of the failed pod, useful with the backoff limit per index feature.
Count: Counts the failure towards the backoffLimit (default behavior).

Example Specification

Below is an example of a Pod Failure Policy:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: example-container
        image: example-image
  backoffLimit: 3
  podFailurePolicy:
    rules:
    - action: Ignore
      onPodConditions:
      - type: DisruptionTarget
    - action: FailJob
      onPodConditions:
      - type: ConfigIssue
    - action: FailJob
      onExitCodes:
        operator: In
        values: [42]

In this example:

Pods with the DisruptionTarget condition are ignored and do not count towards the Job’s backoff limits.
The Job fails if a pod has a ConfigIssue condition, which might be added by a custom controller or webhook.
The Job also fails if any container exits with the code 42.
All other pod failures count towards the backoffLimit.

Important Note

When using the Pod Failure Policy, ensure the Job’s pod template is set with restartPolicy: Never. This setting prevents race conditions between the kubelet and Job controller when counting pod failures.

Introduction

Understanding Pod Failure Policy

How Pod Failure Policy Works

Example Specification

Important Note

Share this to your network: