GPU 训练集群

1. 多卡并行训练方式

方式 原理 显存需求 通信量
DataParallel (DP) 每个 GPU 完整模型,梯度同步 高(复制 N 份)
DistributedDataParallel (DDP) 每 GPU 部分模型,梯度 AllReduce 较低
DeepSpeed ZeRO 分片存储optimizer/梯度/参数
FSDP (FullySharded) 参数分片,通信重叠 最低

2. DeepSpeed 实战

2.1 安装与配置


pip install deepspeed
ds_report

2.2 ds_config.json


{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "fp16": {"enabled": "auto"},
  "bf16": {"enabled": "auto"},
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {"device": "cpu"},
    "offload_param": {"device": "cpu"},
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "gradient_clipping": 1.0,
  "steps_per_print": 10
}

2.3 启动训练


deepspeed --num_gpus=8 \
  train.py \
  --model_name_or_path deepseek-ai/DeepSeek-V3 \
  --dataset train_data.jsonl \
  --deepspeed ds_config.json \
  --output_dir outputs

3. NCCL 配置(多机多卡)


# 环境变量
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=PIX
export NCCL_IB_TIMEOUT=22

# 多机训练(2机 x 8卡)
deepspeed --num_gpus=8 \
  --num_nodes=2 \
  --node_rank=0 \
  --master_addr=10.0.0.1 \
  train.py \
  --deepspeed ds_config.json

4. GPU 监控


# 单卡状态
nvidia-smi

# 详细监控(DCGM)
dcgmi dmon -e

# 显存泄漏检测
python -c "import torch
for i in range(10):
    model = torch.nn.Linear(1024, 1024).cuda()
    del model
    torch.cuda.empty_cache()
    print(f'Iteration {i}: {torch.cuda.memory_allocated()/1e9:.2f} GB')"

5. 下一步