Train a Model Faster with torch.compile and Gradient Accumulation

Training a language model with a deep transformer architecture is time-consuming. However, there are techniques you can use to accelerate training. In this article, you will learn about:

Using torch.compile() to speed up the model
Using gradient accumulation to train a model with a larger effective batch size

Let’s get started!

Train a Model Faster with torch.compile and Gradient Accumulation
Photo by François Genon. Some rights reserved.

Overview

This article is divided into two parts; they are:

Using torch.compile()
Gradient Accumulation

Using torch.compile

When you write your model code and run it with PyTorch, the code is executed in eager mode. This means the code is executed line by line, and the results are stored in memory. This is native to Python since it is an interpreted language. You know this is the case because when you make a mistake in your code, you will not see the error until you run that line of code.

Running a model in eager mode is slow. Starting with PyTorch 2.0, you can use torch.compile() to compile a model for improved performance. This generates a new model object that is optimized. It is not the same model object you created using nn.Module, but it shares the same tensors with the original model. You can use this compiled model for forward pass, backward pass, and optimizer updates as usual.

Building a model and compiling it as a computation graph is how TensorFlow 1.0 was supposed to work. This makes debugging harder, since the model you execute cannot match line by line with the code you wrote. Therefore, you should not compile your model until you have run a trial and confirmed that it is error-free.

Not all models can be compiled. However, if your model supports compilation, you immediately benefit from the speedup. To compile a model, all you need to do is replace the model object right before you are ready to use it:

… model = LlamaForPretraining(model_config).to(device) model.load_state_dict(checkpoint) model = torch.compile(model) …

...

model = LlamaForPretraining(model_config).to(device)

model.load_state_dict(checkpoint)

model = torch.compile(model)

...

Do not load the model weights after compilation. This is because the compiled model is an object that shares the same weights as the original model. During compilation, the computation graph is built referencing the weight tensors of the original model. If you load the weights after compilation, the model may not work as expected.

Similarly, to save the compiled model, you should refer to the original model’s state dict, as follows:

torch.save(getattr(model, “_orig_mod”, model).state_dict(), “model.pth”)

torch.save(getattr(model, “_orig_mod”, model).state_dict(), “model.pth”)

The original model can be accessed from the compiled model using model._orig_mod. In the code above, we use getattr(model, "_orig_mod", model) to get the original model if it exists, or use model itself if it does not. This line of code works for both compiled and original models.

Gradient Accumulation

When you train a model, you likely spend two to three times more time on the backward pass than the forward pass. This is because the backward pass is more computationally intensive and uses more memory.

One easy trick to speed up training is to perform fewer backward passes. This can be achieved by increasing the batch size: with the same number of data samples, a larger batch size means fewer batches to process.

However, a larger batch size requires more memory. In a memory-constrained environment, you can mimic a larger batch size by running multiple forward passes and accumulating the gradients. This is called gradient accumulation.

It is easier to explain this idea with code:

.. accumulate_steps = 4 for epoch in range(num_epochs): optimizer.zero_grad() for i, batch in enumerate(dataloader): # get batched data input_ids, target_ids = batch # create attention mask: causal mask + padding mask attn_mask = create_causal_mask(input_ids.shape[1], device) + \ create_padding_mask(input_ids, PAD_TOKEN_ID, device) # extract output from model logits = model(input_ids, attn_mask) # compute loss: cross-entropy between logits and target, ignoring padding tokens loss = loss_fn(logits.view(-1, logits.size(-1)), target_ids.view(-1)) loss = loss / accumulate_steps # Run backward, but update only once every `accumulate_steps` steps loss.backward() if (i + 1) % accumulate_steps == 0: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() optimizer.zero_grad() scheduler.step()

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

..

accumulate_steps = 4

for epoch in range(num_epochs):

optimizer.zero_grad()

for i, batch in enumerate(dataloader):

# get batched data

input_ids, target_ids = batch

# create attention mask: causal mask + padding mask

attn_mask = create_causal_mask(input_ids.shape[1], device) + \

create_padding_mask(input_ids, PAD_TOKEN_ID, device)

# extract output from model

logits = model(input_ids, attn_mask)

# compute loss: cross-entropy between logits and target, ignoring padding tokens

loss = loss_fn(logits.view(–1, logits.size(–1)), target_ids.view(–1))

loss = loss / accumulate_steps

# Run backward, but update only once every `accumulate_steps` steps

loss.backward()

if (i + 1) % accumulate_steps == 0:

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

optimizer.zero_grad()

scheduler.step()

The training loop above is an excerpt from the previous article for training a Llama model on your local GPU.

Normally, when you run a forward pass, you calculate the loss. Then you call loss.backward() to backpropagate the loss gradient through the model parameters. In PyTorch, the backward() method is cumulative, meaning gradients are added up. Therefore, you need to call optimizer.zero_grad() explicitly to clear the gradients before running the backward pass.

In the code above, you deliberately do not call optimizer.zero_grad() in every iteration. Instead, you run backpropagation for the loss divided by accumulate_steps. This way, the gradients are scaled down but accumulated over accumulate_steps iterations. Once every accumulate_steps iterations, you run the optimizer to adjust the model parameters.

This approach yields results comparable to using a larger batch size. However, since you run fewer optimizer updates, the learning rate schedule should be adjusted accordingly. This means you need to initialize the scheduler with a different number of steps:

… num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs cosine_scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_training_steps – num_warmup_steps, eta_min=0 )

...

num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs

cosine_scheduler = lr_scheduler.CosineAnnealingLR(

optimizer,

T_max=num_training_steps – num_warmup_steps,

eta_min=0

)

Summary

In this article, you learned that using torch.compile() can help you speed up the model by compiling the computation graph. You also learned that gradient accumulation is a technique for training with a larger effective batch size by accumulating gradients from multiple mini-batches. Since you run fewer optimizer updates this way, you save time on backward passes and parameter updates.

Source link