LLM 训练故障排查手册：30 个真实案例与解决方案

大规模模型训练中，故障排查是必备技能。本文汇总 30 个真实案例，涵盖显存、通信、数值稳定性等常见问题。

一、显存问题

案例 1：OOM 错误

现象：

RuntimeError: CUDA out of memory. 
Tried to allocate 2.00 GiB (GPU 0; 80.00 GiB total capacity)

排查步骤：

检查 batch size 是否过大

检查模型是否完整加载到 GPU

检查是否有显存泄漏

解决方案：

python
# 1. 减小 batch size
batch_size = batch_size // 2

# 2. 使用梯度累积
accumulation_steps = 4

# 3. 启用激活值重计算
model.gradient_checkpointing_enable()

# 4. 使用混合精度
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

案例 2：显存泄漏

现象：训练过程中显存持续增长

排查：

python
# 监控显存
import torch
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved()/1e9:.2f} GB")

常见原因：

未 detach 的 tensor 累积
损失值未.item() 直接累加
DataLoader 的 pin_memory 问题

解决方案：

python
# 正确做法
total_loss = 0
for batch in dataloader:
    loss = model(batch)
    total_loss += loss.item()  # 使用 .item()
    loss.backward()
    
    # 清理中间变量
    del loss
    torch.cuda.empty_cache()

二、梯度问题

案例 3：梯度爆炸

现象：

Loss becomes NaN or Inf

排查：

python
# 检查梯度范数
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        if grad_norm > 1000:
            print(f"{name}: {grad_norm}")

解决方案：

python
# 1. 梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 2. 降低学习率
optimizer.param_groups[0]['lr'] *= 0.1

# 3. 使用梯度累积
accumulation_steps = 8

案例 4：梯度消失

现象：梯度接近 0，模型不收敛

排查：

python
# 检查各层梯度
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: {param.grad.abs().mean().item()}")

解决方案：

python
# 1. 使用 LayerNorm
model = LayerNorm(model)

# 2. 残差连接
output = x + sublayer(x)

# 3. 调整初始化
nn.init.xavier_uniform_(param)

三、分布式训练问题

案例 5：NCCL 超时

现象：

RuntimeError: NCCL error: unhandled system error

排查：

bash
# 检查网络连通性
ping -c 4 node2
nccl-tests -b 8 -e 8G -f 128

解决方案：

bash
# 1. 增加超时时间
export NCCL_TIMEOUT=3600

# 2. 禁用 IB
export NCCL_IB_DISABLE=1

# 3. 指定网络接口
export NCCL_SOCKET_IFNAME=eth0

案例 6：梯度不同步

现象：不同 GPU 上的梯度不一致

排查：

python
# 检查梯度同步
import torch.distributed as dist

def check_gradient_sync(model):
    for name, param in model.named_parameters():
        grad = param.grad.clone()
        dist.all_reduce(grad)
        grad /= dist.get_world_size()
        
        if not torch.allclose(param.grad, grad):
            print(f"Gradient mismatch: {name}")

解决方案：

python
# 确保使用 DDP
from torch.nn.parallel import DistributedDataParallel

model = DistributedDataParallel(
    model,
    device_ids=[local_rank],
    find_unused_parameters=False  # 确保所有参数都参与训练
)

四、数值稳定性问题

案例 7：Loss 为 NaN

现象：训练几步后 Loss 变成 NaN

排查顺序：

检查输入数据是否有 NaN

检查学习率是否过大

检查是否有除零操作

解决方案：

python
# 1. 检查输入
assert not torch.isnan(input).any(), "Input contains NaN"

# 2. 降低学习率
lr = 1e-5

# 3. 添加数值稳定项
loss = -torch.log(probs + 1e-8)

# 4. 使用混合精度时添加 Loss Scaling
scaler = GradScaler()
with autocast():
    loss = model(input)
scaler.scale(loss).backward()

案例 8：精度溢出

现象：FP16 训练时出现 Inf

解决方案：

python
# 1. 调整 Loss Scaling
scaler = GradScaler(init_scale=2.**15, growth_factor=2.0)

# 2. 关键层使用 FP32
model.layer_norm.to(torch.float32)

# 3. 禁用某些层的混合精度
with torch.cuda.amp.autocast(enabled=False):
    output = sensitive_layer(input)

五、性能问题

案例 9：GPU 利用率低

现象：GPU 利用率 <50%

排查：

bash
# 使用 nvidia-smi 监控
watch -n 1 nvidia-smi

# 使用 nvtop 查看历史
nvtop

常见原因：

CPU 数据加载慢
通信等待时间长
batch size 过小

解决方案：

python
# 1. 优化 DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,  # 多进程加载
    pin_memory=True,
    prefetch_factor=4
)

# 2. 增加 batch size
batch_size = 128

# 3. 使用梯度累积
accumulation_steps = 4

案例 10：训练速度慢

现象：每步训练时间远超预期

排查：

python
# 使用 PyTorch Profiler
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    for _ in range(10):
        train_step()

print(prof.key_averages().table(sort_by="cuda_time_total"))

优化方向：

算子融合

使用更高效的实现

减少不必要的同步

六、数据问题

案例 11：数据分布不均

现象：某些类别准确率低

排查：

python
# 检查类别分布
from collections import Counter
label_counts = Counter(labels)
print(label_counts)

解决方案：

python
# 1. 重采样
from torch.utils.data import WeightedRandomSampler

weights = [1.0 / label_counts[label] for label in labels]
sampler = WeightedRandomSampler(weights, len(weights))

# 2. 类别权重
criterion = nn.CrossEntropyLoss(weight=class_weights)

案例 12：数据预处理慢

现象：数据加载成为瓶颈

解决方案：

python
# 1. 使用多进程
dataloader = DataLoader(
    dataset,
    num_workers=8,
    prefetch_factor=4
)

# 2. 缓存预处理结果
@lru_cache(maxsize=1000)
def preprocess(text):
    # 预处理逻辑
    pass

# 3. 使用更快的库
import albumentations as A
transform = A.Compose([...])

七、检查清单

训练前检查

[ ] GPU 驱动和 CUDA 版本兼容
[ ] 显存充足（模型 + 梯度 + 优化器状态）
[ ] 数据已预处理并验证
[ ] 学习率调度器配置正确
[ ] Checkpoint 目录可写

训练中监控

[ ] Loss 曲线正常下降
[ ] GPU 利用率 >70%
[ ] 显存使用稳定
[ ] 梯度范数在合理范围
[ ] 无 NaN/Inf 出现

故障排查流程

复现问题：确定问题是否可稳定复现

缩小范围：用最小代码复现问题

查看日志：检查错误信息和堆栈

对照基线：与正常训练对比差异

逐个排查：按优先级排查可能原因

八、工具推荐

工具	用途	链接
nvidia-smi	GPU 监控	NVIDIA 官方
nvtop	GPU 历史监控	开源
PyTorch Profiler	性能分析	PyTorch 内置
Weights & Biases	实验追踪	wandb.ai
DeepSpeed	训练优化	微软开源

核心观点：故障排查需要系统性的方法和丰富的经验。建立检查清单和监控体系，可以大幅减少排查时间。

_持续更新中，欢迎补充案例。_