ML EngineerTraining PipelinesIntermediateSingle prompt

Distributed Training Setup AI Prompt

This prompt converts a single-GPU PyTorch script into a distributed training setup using DistributedDataParallel. It covers launch configuration, process group initialization, model wrapping, distributed sampling, checkpointing, and rank-aware logging for scalable multi-GPU or multi-node training. Copy this prompt template, run it in your AI tool, and use related prompts to continue the workflow.

Prompt text

Convert this single-GPU training script to distributed training using PyTorch DDP (DistributedDataParallel).

1. Launcher setup:
   - Use torchrun (not deprecated torch.distributed.launch)
   - Support both single-node multi-GPU and multi-node setups
   - Environment variable initialization (MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE)

2. Process group initialization:
   - dist.init_process_group with nccl backend (GPU) or gloo (CPU)
   - Set device based on LOCAL_RANK

3. Model wrapping:
   - Move model to device before wrapping with DDP
   - Use find_unused_parameters=False if possible (faster)
   - Sync BatchNorm: convert_sync_batchnorm if using BatchNorm layers

4. DataLoader modifications:
   - DistributedSampler with shuffle=True for training, shuffle=False for val
   - Set sampler.set_epoch(epoch) each epoch for proper shuffling
   - Divide effective batch size by world_size

5. Gradient synchronization:
   - Gradients automatically synced by DDP — do not manually all_reduce
   - Use model.no_sync() context manager for gradient accumulation

6. Checkpointing:
   - Save/load only on rank 0
   - Unwrap model with model.module before saving state_dict

7. Logging:
   - Only log on rank 0 to avoid duplicate output

Return: full DDP training script with a torchrun launch command example.

When to use this prompt

Use case 01

when moving from single-GPU to multi-GPU PyTorch training

Use case 02

when you need a correct torchrun-based DDP template

Use case 03

when training must scale across nodes with proper samplers and rank handling

Use case 04

when checkpointing and logging should only happen on rank 0

What the AI should return

A full DDP training script with launcher assumptions, distributed samplers, checkpoint logic, and an example torchrun command.

How to use this prompt

Open your data context

Load your dataset, notebook, or working environment so the AI can operate on the actual project context.

Copy the prompt text

Use the copy button above and paste the prompt into the AI assistant or prompt input area.

Review the output critically

Check whether the result matches your data, assumptions, and desired format before moving on.

Chain into the next prompt

Once you have the first result, continue deeper with related prompts in Training Pipelines.

Frequently asked questions

What does the Distributed Training Setup prompt do?+

It gives you a structured training pipelines starting point for ml engineer work and helps you move faster without starting from a blank page.

Who is this prompt for?+

It is designed for ml engineer workflows and marked as intermediate, so it works well as a guided starting point for that level of experience.

What type of prompt is this?+

Distributed Training Setup is a single prompt. You can copy it as-is, adapt it, or use it as one step inside a larger workflow.

Can I use this outside MLJAR Studio?+

Yes. The prompt text works in other AI tools too, but MLJAR Studio is the best fit when you want local execution, visible Python code, and reusable notebooks.

What should I open next?+

Natural next steps from here are Custom Loss Function, Dataset Pipeline Builder, Experiment Tracking Setup.

Run this prompt on your data

MLJAR Studio runs prompt-driven workflows locally, keeps the generated Python visible, and turns the result into a reusable notebook.

Try Studio free

Desktop app · Windows, macOS, Linux

Prompt metadata

Role: ML Engineer
Category: Training Pipelines
Level: Intermediate
Type: Single prompt
Works with: Any AI tool with data access
License: Free to use

Related AI prompts

Custom Loss Function

Training Pipelines · Advanced

Dataset Pipeline Builder

Training Pipelines · Beginner

Experiment Tracking Setup

Training Pipelines · Intermediate

Gradient Accumulation

Training Pipelines · Intermediate

Explore more

ML Engineer library

AI prompts for machine learning engineers focused on training pipelines, model deployment, inference optimization, production systems, scalable ML architecture, and shipping models to users.

Browse all ML Engineer prompts

Browse Training Pipelines prompts