when moving from single-GPU to multi-GPU PyTorch training
Distributed Training Setup AI Prompt
This prompt converts a single-GPU PyTorch script into a distributed training setup using DistributedDataParallel. It covers launch configuration, process group initialization, model wrapping, distributed sampling, checkpointing, and rank-aware logging for scalable multi-GPU or multi-node training.
Convert this single-GPU training script to distributed training using PyTorch DDP (DistributedDataParallel). 1. Launcher setup: - Use torchrun (not deprecated torch.distributed.launch) - Support both single-node multi-GPU and multi-node setups - Environment variable initialization (MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE) 2. Process group initialization: - dist.init_process_group with nccl backend (GPU) or gloo (CPU) - Set device based on LOCAL_RANK 3. Model wrapping: - Move model to device before wrapping with DDP - Use find_unused_parameters=False if possible (faster) - Sync BatchNorm: convert_sync_batchnorm if using BatchNorm layers 4. DataLoader modifications: - DistributedSampler with shuffle=True for training, shuffle=False for val - Set sampler.set_epoch(epoch) each epoch for proper shuffling - Divide effective batch size by world_size 5. Gradient synchronization: - Gradients automatically synced by DDP — do not manually all_reduce - Use model.no_sync() context manager for gradient accumulation 6. Checkpointing: - Save/load only on rank 0 - Unwrap model with model.module before saving state_dict 7. Logging: - Only log on rank 0 to avoid duplicate output Return: full DDP training script with a torchrun launch command example.
When to use this prompt
when you need a correct torchrun-based DDP template
when training must scale across nodes with proper samplers and rank handling
when checkpointing and logging should only happen on rank 0
What the AI should return
A full DDP training script with launcher assumptions, distributed samplers, checkpoint logic, and an example torchrun command.
How to use this prompt
Open your data context
Load your dataset, notebook, or working environment so the AI can operate on the actual project context.
Copy the prompt text
Use the copy button above and paste the prompt into the AI assistant or prompt input area.
Review the output critically
Check whether the result matches your data, assumptions, and desired format before moving on.
Chain into the next prompt
Once you have the first result, continue deeper with related prompts in Training Pipelines.
Frequently asked questions
What does the Distributed Training Setup prompt do?+
It gives you a structured training pipelines starting point for ml engineer work and helps you move faster without starting from a blank page.
Who is this prompt for?+
It is designed for ml engineer workflows and marked as intermediate, so it works well as a guided starting point for that level of experience.
What type of prompt is this?+
Distributed Training Setup is a single prompt. You can copy it as-is, adapt it, or use it as one step inside a larger workflow.
Can I use this outside MLJAR Studio?+
Yes. The prompt text works in other AI tools too, but MLJAR Studio is the best fit when you want local execution, visible Python code, and reusable notebooks.
What should I open next?+
Natural next steps from here are Custom Loss Function, Dataset Pipeline Builder, Experiment Tracking Setup.