torch.utils.checkpoint的使用技巧和注意事项

发布时间：2024-01-05 01:15:16

torch.utils.checkpoint是PyTorch中的一个工具，用于减小显存占用。对于比较大的模型，在forward过程中，可能会产生很多中间变量，导致显存占用非常大。checkpoint可以将forward过程分成多个小部分执行，中间结果只保存必要的部分，从而节省显存。

checkpoint的使用技巧和注意事项如下：

1. 使用@torch.utils.checkpoint.checkpoint修饰forward函数：通过在forward函数上添加@torch.utils.checkpoint.checkpoint修饰符，可以实现checkpoint的功能。例如：

import torch
import torch.utils.checkpoint as cp

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
    
    def forward(self, x):
        # 正常的forward过程
        x = x + 1
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        return x
    
    @cp.checkpoint()
    def forward_with_checkpoint(self, x):
        # 使用checkpoint的forward函数
        x = x + 1
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        return x

2. 指定需要保存的中间变量：默认情况下，checkpoint只会保存forward函数中的输出张量，对于其他的中间变量，默认不会保存。可以通过在forward函数中手动指定需要保存的中间变量，从而减小显存占用。例如：

import torch
import torch.utils.checkpoint as cp

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.avgpool = torch.nn.AdaptiveAvgPool2d(1)
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        # 保存中间变量
        feats = self.avgpool(x)
        x = self.layer3(x)
        return feats, x
    
    @cp.checkpoint()
    def forward_with_checkpoint(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        # 保存中间变量
        feats = self.avgpool(x)
        x = self.layer3(x)
        return feats, x

3. 使用torch.utils.checkpoint.checkpoint_sequential函数：如果模型中的forward过程比较复杂，可以使用torch.utils.checkpoint.checkpoint_sequential函数来实现更细粒度的checkpoint。该函数用于按顺序执行网络层，并在需要的层上执行checkpoint。例如：

import torch
import torch.utils.checkpoint as cp

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layers = torch.nn.Sequential(
            self.layer1,
            self.layer2,
            self.layer3
        )
    
    def forward(self, x):
        x = self.layers(x)
        return x
    
    @cp.checkpoint_sequential(self.layers)
    def forward_with_checkpoint(self, x):
        x = self.layers(x)
        return x

注意事项：

1. 使用checkpoint会导致运行时间增加：由于checkpoint会将forward过程分为多个小部分执行，可能会导致运行时间增加，特别是在小模型上使用checkpoint时效果不明显。因此，应该根据具体情况来判断是否使用checkpoint。

2. 不支持DataParallel模型：由于checkpoint会改变forward过程中的计算流程，因此不支持在DataParallel模型中使用checkpoint。

3. 需要安装Apex库：PyTorch官方的checkpoint功能已经移动到Apex库中，因此在使用时，需要安装Apex库并导入torch.utils.checkpoint模块。

综上所述，torch.utils.checkpoint是一个用于减小显存占用的工具，在大模型上使用checkpoint可以显著减少显存使用量，进而让模型能够训练在较小的显存设备上。但需要注意的是，使用checkpoint可能会导致运行时间增加，并且不支持在DataParallel模型中使用。