0%

使用 torch.profiler记录模型训练轨迹

使用 torch.profiler记录模型训练轨迹,并使用Tensorboard进行可视化分析,首先导入需要的库,准备模型和数据集,设置记录器,生成json格式的文件,最后通过Tensorboard可视化。

Steps

  1. Prepare the data and model
  2. Use profiler to record execution events
  3. Run the profiler
  4. Use TensorBoard to view results and analyze model performance
  5. Improve performance with the help of profiler
  6. Analyze performance with other advanced features
  7. Additional Practices: Profiling PyTorch on AMD GPUs

1. Prepare the data and model

导入需要的库:

1
2
3
4
5
6
7
8
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T

准备数据集

1
2
3
4
5
6
transform = T.Compose(
[T.Resize(224),
T.ToTensor(),
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)

模型定义

1
2
3
4
5
device = torch.device("cuda:0")
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
criterion = torch.nn.CrossEntropyLoss().cuda(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
model.train()

模型训练

1
2
3
4
5
6
7
def train(data):
inputs, labels = data[0].to(device=device), data[1].to(device=device)
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()

2. 使用Profiler记录轨迹

some useful parameters are as follow:

schedule: 参数例如wait=1,warmup=1,active=3,repeat=1(profiler 会跳过第一个step/iteration,在第二个iter热身,记录三个iter。). In total, the cycle repeats once. Each cycle is called a “span” in TensorBoard plugin.

wait阶段,profiler 不生效,在warmup 阶段,proliler 开始工作但不记录结果,是为了减少开销,proliling 的开始开销很大,会影响结果。

on_trace_ready : 在每个cylce结束时调用,例如使用torch.profiler.tensorboard_trace_handler来时生成Tensorboard使用的结果文件,在Profiling后,结果文件存储在./log/resnet18中。

record_shapes:是否记录输入张亮的形状

profile_memory: 追踪张量空间申请和释放。

with_stack:记录算子的代码信息,如果在vscode中集成TensorBoard, 单击可以跳转到特定行。

https://code.visualstudio.com/docs/datascience/pytorch-support#_tensorboard-integration

以上下文管理器启动/停止:

1
2
3
4
5
6
7
8
9
10
11
12
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch_data in enumerate(train_loader):
prof.step() # Need to call this at each step to notify profiler of steps' boundary.
if step >= 1 + 1 + 3:
break
train(batch_data)

也可以以非上下文管理器启动/停止:

1
2
3
4
5
6
7
8
9
10
11
12
prof = torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
record_shapes=True,
with_stack=True)
prof.start()
for step, batch_data in enumerate(train_loader):
prof.step()
if step >= 1 + 1 + 3:
break
train(batch_data)
prof.stop()

3. 运行profiler

4. 使用Tensorboard展示结果

安装Pytorch Profiler TensorBoard Plugin

1
pip install torch_tb_profiler

登录TensorBoard

1
tensorboard --logdir=./log

打开TensorBoard

1
http://localhost:6006/#pytorch_profiler
如果您读文章后有收获,可以打赏我喝咖啡哦~