The best way to scale training on multiple GPUs

Published in

Searce

8 min readMay 6, 2022

Using multiple GPUs to train a PyTorch model
Deep Learning models are too big for a single GPU to train. This is one of the biggest problems they face. The current models would take too long to train on a single GPU. In order to train models quickly, multiple GPUs must be used.

We need to scale training methods to make use of 100s or even 1000s of GPUs. Researchers were able to reduce the ImageNet training time from 2 weeks to 18 minutes, or train the largest and most advanced Transformer-XL in 2 weeks instead of 4 years by a famous researcher. To do so, he used more than 100 GPUs.

Training iteration speeds are very important to us. We have therefore scaled up our training to multi-GPUs to increase our iteration speed. I will explain how to scale up training with PyTorch in this blog post. We have created models in TensorFlow and scaled our training using Horovod, a tool created by many Engineering. I would recommend using their Docker image if you go down that route.

In my opinion, PyTorch provides an optimal balance between control and ease of use, without sacrificing performance. nn.DataParallel and nn.DistributedParallel are two functions in PyTorch that implement distributed training with multiple GPUs. There are easy ways of wrapping your code and changing your code to allow the network to be trained on multiple GPUs.

It is simpler and faster to use than nn.DataParallel, but its use is limited to a single machine. In each batch, nn.DataParalllel only executes one process to compute the model weights and distribute them to each GPU.

We will look at nn.DataParallel and nn.DistributedDataParallel in more detail in this blog post. I will discuss the main differences between these two types of training, as well as how training in multiple GPUs works. I will begin by explaining how a neural network is trained.

Training loop

We’ll start by going over how neural networks are usually trained. Each loop in the training of a neural network is composed of four main steps:

The forward pass, in which the neural networks process the input.
By comparing the predicted label to the ground truth, a loss function is calculated.
A backward pass is then performed, using the loss to calculate the gradients for each parameter.
The parameters are then updated using the gradients.

For batch sizes greater than one, we might want to batch normalize the training.

DataParallel

DataParallel distributes training across multiple GPUs on a single machine. Let’s take a look at how it works. Whenever a neural network is trained with DataParallel, the following steps occur:

Mini-batch split on GPU:0.
Distribute and move min-batch throughout all GPUs.
Copy model out to GPUs.
Forward pass occurs in all different GPUs.
The compute loss with respect to the network outputs on GPU:0, as well as the returns to different GPUs. Calculate gradients for each GPU.
Add gradients on GPU:0 and use the optimizer to update the model on GPU:0.

A Simple example

Let’s code this up. First, let’s import everything we need

For predicting MNIST, we define a simple convolutional model

Line 4–14: We are defining the layers in this neural network.

Line 16–21: We define the forward pass

The main() function will take in some arguments and run the training function:

Line 2–6: By using DataParallel, we run multiple jobs in parallel across multiple GPUs based on the instantiated model.
Line 9–23: Here, we define the loss function (criteria) and the optimizer (in this case, we use SGD). In this section, we define the training data set (MNIST) and the data loader.
Line 24–45: This is where the loop for training the neural network occurs. Here, we load the inputs and the expected outputs. Both the forward and backward passes and the optimizer are run.
I know there is some extra stuff in here (the number of GPUs and nodes, for example) that we don’t need right now, but it’s helpful to have the skeleton in place.

Distributed DataParallel

The ‘nn.DistributedDataParallel’ model has one process per GPU, each of which controls its own model. GPUs can be arranged across multiple nodes or on the same node. Only gradients are passed between processes.

gradients are passed between the processes/GPUs

The processes load mini-batch files from disk and pass them to the GPU for training. All gradients are then reduced across all GPUs after each GPU does its forward pass. In order to further alleviate the networking bottleneck, gradients for each layer are calculated concurrently with the backward pass. After the backwards pass, the average gradients are applied to every node, ensuring synchronization of the model weights.

Tutorial

For multiprocessing, we need a script that will launch a process for each GPU. All processes need to know which GPU to use, and where it ranks among them. Each node will need to run the script.

Here is a look at what each function has changed. In order to make the new code accessible, I fenced it off.

Let’s review the arguments of the main function:
args.nodes indicates how many nodes we are using (number of machines).
args.gpus indicates how many GPUs each node has (per machine).
args.nr indicates the position of the current node (machine) within all the nodes (machines). It ranges from 0 to args.nodes — 1.
Here are the new changes:

Line 12: We can calculate the world_size by multiplying the number of nodes by the number of GPUs per node. This gives us the number of processes to run, and it’s equal to the number of GPUs times the number of nodes.
This line 13 tells the multiprocessing module what IP address to look for in process 0. Without it, all processes will fail to sync up. This should be the same across all nodes.
Line 14:Similarly, this is the port to use when looking for process 0.
Line 15:Rather than running train only once, we will spawn args.gpus processes, each of which runs train(i, args), where i is from 0 to args.gpus — 1. There are args.nodes * args.gpus = args.world_size processes in total, since the main() function runs on each node.
Instead of lines 13 and 14, I could have run export MASTER_ADDR=10.57.23.164 and export MASTER_PORT=8888 in the terminal.

Next, let’s talk about how to modify training.

Line 3: The process’s global rank within all of the processes. We’ll use this for line 6.

Lines 4–6: Start the process and join it to the other processes. This is called “blocking” because no process can continue until all processes have joined. Currently, I’m using the NCCL because it’s the fastest. Process groups are told where to find certain settings by init_method. We set the environment variables MASTER_ADDR and MASTER_PORT within main, so it’s looking for those variables. Therefore, we set env:// to look for those variables. We could have also set WORLD_SIZE and world_size there.

Line 23: Using a DistributedDataParallel model, wrap the model. Every GPU is reproduced as a result.

Lines 35–39: Whenever data is loaded, nn.utils.data.DistributedSampler ensures that each process gets a different slice of the training data. Debugging and verifying that the GPUs are loading the right data can be accomplished by calculating the SHA of the tensors loaded into each GPU.

Lines 46 and 51: Use the nn.utils.data.DistributedSampler instead of shuffling normally. Therefore, shuffle is set to false.

The application needs 4 terminals (one per node) in order to run on, say, 4 nodes with eight GPUs each. On node 0 (as set by line 13 in main):

Next, on the other nodes:

for i= 1,2,3. To be more specific, we run this script on each node so that it launches args.gpus processes that synchronize before training begins.

It should be noted that the effective batch size is now the per/GPU batch size (the value in the script) * the total number of GPUs (the world size).

challenges

When running the same model on several GPUs rather than just one GPU, a few problems can occur. One of the biggest problems can be a shortage of memory in the main GPU. Due to this, the first GPU will calculate the loss by saving all the outputs from each GPU.

When you train the network, you will see the following message on the console: ran out of memory trying to allocate 2.59GiB

Here are two techniques we use to reduce memory usage and thus solve this problem:

Reduce the batch_size
Use Apex for mixed precision

As a general rule, the first technique is fairly straightforward and involves just changing a single hyper-parameter.

In our second technique, we are going to decrease the precision of the weights that we use in the neural network, and thereby use less memory. Mixed precision means you use 16-bit for some things, but 32-bit for things like weights. You have to understand the FP16 and FP32 better when it comes to deep learning.

Mixed precision APEX

We recommend using lower precision numbers to prevent the problem of running out of memory. This allows us to use larger batch sizes and take advantage of NVIDIA Tensor Cores for faster computations.

The code should be changed in two places to make it work. One is within the loop of the train:

Training step

Line 18:The amp.initialize function wraps the model and optimizer for mixed precision training. The model must already be on the correct GPU before calling amp.initialize. Option levels range from O0, which uses all floats, to O3, which uses half-precision throughout. The Apex documentation explains how O1 and O2 differ in degrees of mixed-precision.

Line 20:DistributedDataParallel is a drop-in replacement for nn.DistributedDataParallel in apex.parallel. Since Apex only supports one GPU per process, we no longer need to specify the GPUs. Additionally, it assumes that the script calls torch.cuda.set_device(local_rank)(line 10) before moving the model to GPU.

Lines 37–38:To prevent the gradients from underflowing in mixed-precision training, the loss must be scaled. Apex automatically does this.

In order to fix a bug with AMP’s implementation, make sure you set opt_level=O1 whenever you initialize it.

Checkpoint

Whenever we use Apex, we need to change the way we save and load models. See the following issue. In addition, we need to change the way checkpoints are saved and loaded into our models:

Line 5: Amplification.state_dict is added to the checkpoint

Line 19: Here we load the state_dict to amp.

Once you have those things in place, your model should be capable of being trained on multiple GPUs. Before scaling training to multiple GPUs, we recommend starting with a small model on one GPU. There may be a need for scalability in the future, and this tutorial may be able to help.