Kubernetes

This guide covers deploying PRIME-RL training infrastructure on Kubernetes clusters using the provided Helm chart.

Prerequisites

Kubernetes cluster with GPU nodes
NVIDIA GPU Operator installed
Helm 3.x installed
Storage class that supports ReadWriteMany (e.g., NFS, CephFS, or cloud provider storage)

Verify Prerequisites

# Check Helm installation
helm version

# Check GPU operator
kubectl get pods -n gpu-operator

# Check available storage classes
kubectl get storageclass

Quick Start

1. Deploy

# Deploy with a release name
helm install my-exp ./k8s/prime-rl -f ./k8s/prime-rl/examples/reverse-text.yaml

# Or with defaults (no example-specific config)
helm install my-exp ./k8s/prime-rl --set trainer.replicas=3 --set inference.replicas=2

2. Verify deployment

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=my-exp

# Should show 3 pods:
# my-exp-orchestrator-0
# my-exp-inference-0
# my-exp-trainer-0

3. Run training

# Exec into trainer
kubectl exec -it my-exp-trainer-0 -- bash

# Inside the pod, run training
cd /data
uv run trainer @ /app/examples/reverse_text/configs/train.toml

4. Monitor progress

# Get logs
kubectl logs my-exp-trainer-0

# Follow logs in real-time
kubectl logs -f my-exp-trainer-0

Available Examples

The chart includes pre-configured values for each example:

reverse-text (Small - 1 GPU)

helm install my-exp ./k8s/prime-rl -f ./k8s/prime-rl/examples/reverse-text.yaml

Model: Qwen3-0.6B
GPUs: 1 per component
Runs on consumer GPUs (RTX 3090/4090)
Note: You can use any release name - the chart automatically configures service URLs

Configuration

Storage Configuration

By default, the chart creates a 1TB PVC with NFS storage. To customize:

# custom-values.yaml
storage:
  storageClassName: my-storage-class
  size: 500Gi

Deploy with custom storage:

helm install my-release ./k8s/prime-rl -f custom-values.yaml

GPU Configuration

Adjust GPU count per component:

# custom-gpu.yaml
inference:
  gpu:
    count: 4  # Use 4 GPUs for inference

trainer:
  gpu:
    count: 2  # Use 2 GPUs for training

Resource Limits

Customize memory and CPU:

# custom-resources.yaml
trainer:
  resources:
    requests:
      memory: "64Gi"
      cpu: "16"
    limits:
      memory: "128Gi"
      cpu: "32"

Secrets (Optional)

For W&B and HuggingFace authentication:

# Create secret
kubectl create secret generic prime-rl-secrets \
  --from-literal=wandb-api-key=YOUR_WANDB_KEY \
  --from-literal=hf-token=YOUR_HF_TOKEN

# Enable in values
helm install my-release ./k8s/prime-rl \
  --set config.secrets.enabled=true \
  --set config.secrets.name=prime-rl-secrets

Common Operations

Deploy a new experiment

# With example config
helm install my-exp ./k8s/prime-rl -f ./k8s/prime-rl/examples/reverse-text.yaml

# With custom settings
helm install my-exp ./k8s/prime-rl --set trainer.replicas=10 --set inference.replicas=5

Exec into pods

# Exec into trainer-0
kubectl exec -it my-exp-trainer-0 -- bash

# Exec into specific trainer pod
kubectl exec -it my-exp-trainer-3 -- bash

# Exec into inference
kubectl exec -it my-exp-inference-0 -- bash

View logs

# Get logs from trainer-0
kubectl logs my-exp-trainer-0

# Follow logs in real-time
kubectl logs -f my-exp-trainer-2

# Get logs from all trainers
kubectl logs -l app.kubernetes.io/instance=my-exp,role=trainer

List all pods

# List pods for specific experiment
kubectl get pods -l app.kubernetes.io/instance=my-exp

# List all prime-rl pods
kubectl get pods -l app=prime-rl

Architecture

Components

The chart deploys three main components (all using StatefulSets):

Orchestrator (StatefulSet) - Coordinates training workflow
- Always 1 replica: prime-rl-orchestrator-0
- No GPU required
- Communicates with trainer and inference
Inference (StatefulSet) - Runs vLLM inference server
- Scalable replicas with stable pod names: prime-rl-inference-0, prime-rl-inference-1, …
- Each pod gets predictable DNS: prime-rl-inference-0.prime-rl-inference-headless.default.svc.cluster.local
- Requires GPU(s)
- Serves model predictions
Trainer (StatefulSet) - Runs SFT or RL training
- Scalable replicas with stable pod names: prime-rl-trainer-0, prime-rl-trainer-1, …
- Each pod gets predictable DNS: prime-rl-trainer-0.prime-rl-trainer-headless.default.svc.cluster.local
- Requires GPU(s)
- Updates model weights on shared storage

Why StatefulSets for all components?

Consistent naming: All pods have predictable names (orchestrator-0, trainer-0, trainer-1, …)
Stable networking: Each pod gets its own DNS hostname via headless service
Required for distributed training: PyTorch/vLLM need to discover peers by stable hostname
Clean naming: No random pod suffixes, easier to identify and debug

Shared Storage

All components mount the same PVC at /data for:

Model checkpoint sharing
Training data
Experiment outputs

This is required for coordinating weight updates between trainer and inference.

Environment Variables

Each pod has these K8s environment variables set:

$POD_NAME - Full pod name (e.g., my-exp-trainer-3)
$POD_IP - Pod IP address
$STATEFUL_REPLICAS - Total number of replicas for that component
$HEADLESS_SERVICE - DNS name for peer discovery (e.g., my-exp-trainer-headless.default.svc.cluster.local)
$INFERENCE_URL - Full URL to the first inference pod (available in orchestrator and trainer pods)

For distributed training, extract the rank from the pod name:

# Extract ordinal from pod name
RANK=$(echo $POD_NAME | grep -o '[0-9]*$')  # e.g., "my-exp-trainer-3" -> "3"

# Use in torchrun
torchrun \
  --nnodes=$STATEFUL_REPLICAS \
  --node-rank=$RANK \
  --nproc-per-node=8 \
  --rdzv-endpoint=my-exp-trainer-0.$HEADLESS_SERVICE:29501 \
  src/prime_rl/trainer/sft/train.py @ configs/train.toml

Troubleshooting

Can’t access shared storage

Verify PVC is bound:

kubectl get pvc prime-rl-shared-data
# STATUS should be "Bound"

Check mount inside pod:

kubectl exec -it prime-rl-trainer-xxx -- df -h /data

Pod stuck in Pending

Check if GPU resources are available:

kubectl describe pod my-exp-trainer-0

Look for events like Insufficient nvidia.com/gpu.

Inference server not responding

Check if the inference pod is ready:

kubectl get pods -l role=inference
kubectl logs my-exp-inference-0

Uninstalling

# Remove the Helm release
helm uninstall my-exp

# Delete PVC (data will be lost!)
kubectl delete pvc prime-rl-shared-data

Getting Started

Lab

Libraries

Compute

Prerequisites

Verify Prerequisites

Quick Start

1. Deploy

2. Verify deployment

3. Run training

4. Monitor progress

Available Examples

reverse-text (Small - 1 GPU)

Configuration

Storage Configuration

GPU Configuration

Resource Limits

Secrets (Optional)

Common Operations

Deploy a new experiment

Exec into pods

View logs

List all pods

Architecture

Components

Shared Storage

Environment Variables

Troubleshooting

Can’t access shared storage

Pod stuck in Pending

Inference server not responding

Uninstalling

Getting Started

Lab

Libraries

Compute

​Prerequisites

​Verify Prerequisites

​Quick Start

​1. Deploy

​2. Verify deployment

​3. Run training

​4. Monitor progress

​Available Examples

​reverse-text (Small - 1 GPU)

​Configuration

​Storage Configuration

​GPU Configuration

​Resource Limits

​Secrets (Optional)

​Common Operations

​Deploy a new experiment

​Exec into pods

​View logs

​List all pods

​Architecture

​Components

​Shared Storage

​Environment Variables

​Troubleshooting

​Can’t access shared storage

​Pod stuck in Pending

​Inference server not responding

​Uninstalling

Prerequisites

Verify Prerequisites

Quick Start

1. Deploy

2. Verify deployment

3. Run training

4. Monitor progress

Available Examples

reverse-text (Small - 1 GPU)

Configuration

Storage Configuration

GPU Configuration

Resource Limits

Secrets (Optional)

Common Operations

Deploy a new experiment

Exec into pods

View logs

List all pods

Architecture

Components

Shared Storage

Environment Variables

Troubleshooting

Can’t access shared storage

Pod stuck in Pending

Inference server not responding

Uninstalling