Prerequisites
- Kubernetes cluster with GPU nodes
- NVIDIA GPU Operator installed
- Helm 3.x installed
- Storage class that supports
ReadWriteMany(e.g., NFS, CephFS, or cloud provider storage)
Verify Prerequisites
Quick Start
1. Deploy
2. Verify deployment
3. Run training
4. Monitor progress
Available Examples
The chart includes pre-configured values for each example:reverse-text (Small - 1 GPU)
- Model: Qwen3-0.6B
- GPUs: 1 per component
- Runs on consumer GPUs (RTX 3090/4090)
- Note: You can use any release name - the chart automatically configures service URLs
Configuration
Storage Configuration
By default, the chart creates a 1TB PVC with NFS storage. To customize:GPU Configuration
Adjust GPU count per component:Resource Limits
Customize memory and CPU:Secrets (Optional)
For W&B and HuggingFace authentication:Common Operations
Deploy a new experiment
Exec into pods
View logs
List all pods
Architecture
Components
The chart deploys three main components (all using StatefulSets):-
Orchestrator (StatefulSet) - Coordinates training workflow
- Always 1 replica:
prime-rl-orchestrator-0 - No GPU required
- Communicates with trainer and inference
- Always 1 replica:
-
Inference (StatefulSet) - Runs vLLM inference server
- Scalable replicas with stable pod names:
prime-rl-inference-0,prime-rl-inference-1, … - Each pod gets predictable DNS:
prime-rl-inference-0.prime-rl-inference-headless.default.svc.cluster.local - Requires GPU(s)
- Serves model predictions
- Scalable replicas with stable pod names:
-
Trainer (StatefulSet) - Runs SFT or RL training
- Scalable replicas with stable pod names:
prime-rl-trainer-0,prime-rl-trainer-1, … - Each pod gets predictable DNS:
prime-rl-trainer-0.prime-rl-trainer-headless.default.svc.cluster.local - Requires GPU(s)
- Updates model weights on shared storage
- Scalable replicas with stable pod names:
- Consistent naming: All pods have predictable names (
orchestrator-0,trainer-0,trainer-1, …) - Stable networking: Each pod gets its own DNS hostname via headless service
- Required for distributed training: PyTorch/vLLM need to discover peers by stable hostname
- Clean naming: No random pod suffixes, easier to identify and debug
Shared Storage
All components mount the same PVC at/data for:
- Model checkpoint sharing
- Training data
- Experiment outputs
Environment Variables
Each pod has these K8s environment variables set:$POD_NAME- Full pod name (e.g.,my-exp-trainer-3)$POD_IP- Pod IP address$STATEFUL_REPLICAS- Total number of replicas for that component$HEADLESS_SERVICE- DNS name for peer discovery (e.g.,my-exp-trainer-headless.default.svc.cluster.local)$INFERENCE_URL- Full URL to the first inference pod (available in orchestrator and trainer pods)
Troubleshooting
Can’t access shared storage
Verify PVC is bound:Pod stuck in Pending
Check if GPU resources are available:Insufficient nvidia.com/gpu.