A Practical Guide to Modifying Pod Resources in Suspended Kubernetes Jobs (Beta)

Introduction

Batch processing and machine learning workloads often require dynamic resource adjustments based on cluster availability. Kubernetes v1.36 brings a much-anticipated beta feature: the ability to modify container resource requests and limits in the pod template of a suspended Job. This capability, first introduced as alpha in v1.35, gives queue controllers and cluster administrators the flexibility to adjust CPU, memory, GPU, and extended resource specifications on a Job while it is suspended—before it starts or resumes running. In this guide, we'll walk you through how to leverage this feature to make your batch operations more resilient and efficient.

A Practical Guide to Modifying Pod Resources in Suspended Kubernetes Jobs (Beta)

What You Need

A Kubernetes cluster running version v1.36 or later (beta) with the MutableJobPodTemplateResources feature gate enabled (it is enabled by default in v1.36).
kubectl command-line tool configured to communicate with your cluster.
A basic understanding of Kubernetes Jobs and resource requests/limits.
Optional: A queue controller (like Kueue) that can manage Job resource adjustments automatically.

Step-by-Step Guide

Step 1: Create a Suspended Job with Initial Resource Requirements

Start by defining a Job that is paused from the beginning. Set spec.suspend: true to keep it from spawning pods until you decide the right resource configuration. Here's an example YAML for a machine learning training job that initially requests 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
      restartPolicy: Never

Save the file as job-suspended.yaml and create it with kubectl apply -f job-suspended.yaml. The Job is now registered but no pods are running.

Step 2: Assess Cluster Capacity and Decide on Resource Changes

Once the Job is suspended, you or your queue controller can evaluate the current cluster state. For instance, if only 2 GPUs are available instead of the requested 4, you can adjust the resource specifications without losing the Job's metadata, status, or history. This is a major improvement over the previous immutable behavior, which would have required deleting and recreating the Job.

Step 3: Modify Pod Resource Requests and Limits in the Suspended Job

To update the resource values, use kubectl patch or edit the Job resource directly. For example, to reduce GPU count from 4 to 2 (and adjust CPU and memory accordingly), run:

kubectl patch job training-job-example-abcd123 --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value": "4"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/memory", "value": "16Gi"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "4"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "16Gi"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/example-hardware-vendor.com~1gpu", "value": "2"},
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/example-hardware-vendor.com~1gpu", "value": "2"}
]'

Note: The tilde (~1) in the JSON patch path escapes the forward slash in the extended resource name. Alternatively, you can use kubectl edit job training-job-example-abcd123 and modify the YAML directly.

Step 4: Verify the Resource Modifications

Check that the changes were applied correctly by describing the Job:

kubectl describe job training-job-example-abcd123

You should see the updated resource values under the Pod Template section. The Job remains suspended at this point, so no pods have been created yet.

Step 5: Resume the Job to Launch Pods with the Adjusted Resources

Once the resource fields match the current cluster capacity, set spec.suspend to false to start the Job:

kubectl patch job training-job-example-abcd123 --type='json' -p='[{"op": "replace", "path": "/spec/suspend", "value": false}]'

Now the Job will create its pods using the updated resource requests and limits. You can monitor pod creation with kubectl get pods -l job-name=training-job-example-abcd123.

Step 6 (Optional): Automate with a Queue Controller

For larger deployments, consider using a queue controller like Kueue to automatically adjust resources based on cluster state. Such controllers can integrate with the Kubernetes API to modify suspended Jobs before resuming them. The architecture remains the same: suspend the Job, modify resources, then resume. The controller handles the decision logic and API calls.

Tips for Using Mutable Pod Resources in Suspended Jobs

Immutable fields remain: Only resource requests and limits in the pod template can be changed while the Job is suspended. Other fields (e.g., container image, command) are still immutable after creation.
Resource changes only take effect before pods are created: Once the Job is resumed and pods start, further modifications to the pod template will not affect already-running pods. If you need to change resources mid-execution, consider using Vertical Pod Autoscaler (VPA) in update mode.
Extended resources work too: You can modify any resource type, including custom extended resources (like GPUs) that your cluster advertises, as long as they are properly registered.
CronJob integration: For periodic workloads managed by a CronJob, each Job instance can be individually adjusted before it runs. This allows you to scale down resource usage during heavy cluster load instead of failing the run.
Version compatibility: This feature requires Kubernetes v1.36 or later (beta). If you use an older version, you may need to enable the MutableJobPodTemplateResources feature gate explicitly in v1.35. Check your cluster version with kubectl version.
Monitor job history: Because you no longer have to delete and recreate Jobs, you preserve the Job's history and status—useful for audit trails and debugging.

By following these steps, you can make your Kubernetes batch processing more adaptive to fluctuating cluster resources. Whether you're managing a small cluster manually or leveraging automated queue controllers, this beta feature saves time and reduces operational complexity.

Tags: