MultiRunManager object is a global singleton that manages the parameters and components for multiple concurrent training runs within a single trainer process.
This allows multiple orchestrator deployments to share the same trainer.
When max_concurrent_runs > 1, the trainer can train multiple runs in parallel. Each run:
- Has its own LoRA adapter parameters
- Has its own optimizer and scheduler
- Saves its own checkpoints
- Tracks its own training progress (step, tokens, samples)
- Loads its own orchestrator configuration
MultiRunManager object provides:
- Bidirectional mapping between run IDs (e.g.,
run_abc123) and run indices (0, 1, 2, …) - Progress tracking per run (step count, total tokens, total samples)
- Configuration management for orchestrator configs
- Distributed synchronization across ranks via the PyTorch distributed store
- LoRA module registration for multi-adapter parameter management
- Creation hooks for initializing per-run resources (optimizers, schedulers)
- Run eviction for removing runs that are misbehaving
Initialization and run discovery
TheMultiRunManager singleton is set up at the start of training:
run_*. Each run must contain a valid orchestrator config at {run_dir}/control/orch.toml before they are added to the active runs otherwise they are ignored. When the maximum number of runs is reached, new run_* directories will not be picked up until old ones are deleted.
discover_runs() method (master only):
- Scans the output directory for
run_*directories - Filters out evicted runs (those with
control/evicted.txt) - Detects new runs and deleted runs
- Calls
forgotten_hookfor deleted runs (master only) - Loads and validates the orchestrator config for each new run
- Updates internal mappings and data structures
- Calls
discovered_hookfor new runs (master only)
synchronize_state() method (all ranks):
- Master broadcasts run state to all ranks via the distributed store
- Non-master ranks catch up by calling internal
_delete_run_data/_create_run_data - All ranks execute
deletion_hookfor deleted runs - All ranks execute
creation_hookfor new runs (e.g., optimizer setup, LoRA parameter reset)
Run Eviction
The master proc on the trainer can evict a run using theevict_run(idx: int, reason: str) method.
This is useful when the trainer detects an issue with a run that requires it to be stopped (e.g., invalid data, resource constraints, or policy violations).
evict_run() method (master only):
- Writes the eviction reason to
{run_dir}/control/evicted.txt - Logs a warning with the eviction details
- The run is not immediately removed from the manager’s data structures
- The next
discover_runs()call will filter out the evicted run (it checks forevicted.txt) - The run will then be treated as deleted, triggering forgotten/deletion hooks
- The run index is returned to the unused pool
- The orchestrator checks for
evicted.txtat the start of each iteration in its main loop - If found, it raises a
RuntimeErrorwith the eviction reason, causing the orchestrator to exit - This surfaces the eviction reason to the user
LoRA Module Registration
LoRA modules register themselves withMultiRunManager for parameter management:
MultiRunManager object then exposes:
Hooks
TheMultiRunManager object supports several types of hooks for different lifecycle events.
Deletion hooks are always called before creation hooks.