Distributed deep learning is a hot topic as it increases the training time in compared to single GPU depending on the problems and the data that you deal with. However, modifying code to make your single-GPU program become multi-GPUs is not always straightforward!
In this post, I'm going to talk about the topic of training deep learning models using multi-GPUs: why it is challenging, which factors are important when making deep learning distributed, and finally which libraries/frameworks to use to make this process easier.
This post is based on the content of the course "Fundamentals of Deep Learning for Multi-GPUs", offered by NVIDIA, which I have taken recently.
GPU is the platform for deep learning, which makes deep learning accessible. Multi-GPUs are used to further speed up the computations, in cases single GPU memory is not efficient.
The first question is: Do you need to use multi-GPUs or is it enough to train your deep learning models using single-GPU? It depends on your problems, the size of data to be trained on and the time that you expect the models to finish running.
In a recent report by NVIDIA, it is said that using multi-GPU, one can train the whole ImageNet in just 15 minutes in a new record []. For reference, previously it was reported to take 14 days for training ImageNet in single GPU:
However, training using multi-GPUs is not straightforward as one will have to deal with several problems related to the optimization process, the component of the pipeline (data loading), networking, etc.
In the first part of the course, we look at the differences between Gradient Descent and Stochastic Gradient Descent.
In Gradient Descent, the whole dataset is used for training and calculating the cost function.
SGD on the other hand only uses a portion of the dataset (a batch) for computing the shape of the cost function. This introduces a level of noise in our trajectory. These noises are desirable since they generates minima with significantly different mathematical properties than Gradient Descent.
When training using multi-GPUs, we increase the batch size (by summing up all batches scheduling to each GPU). When the batch size increases, we also remove the beneficial noises. This affects our algorithm behavior and the training performance.
In the second part, we run the code on AlexNet for image processing, and see how each part of the components in our pipeline and other factors such as networking affect the overall performance. Through the experiments, we see how each step of the process effects the end2end system performance. The idea is that the more time the GPU needs to spend processing a single image the less we need to deliver through our pipeline.
After trying a naive implementation of multi-GPUs and discussing the pros and cons, the problems with the naive approach, we talked about Horovod library and MPI framework (commonly used in High Performance Computing and distributed workloads) for implementing deep learning models in multi-GPUs mode.
Horovod developed at Uber is a distributed training framework that can be used in several libraries including Tensorflow, Keras and Pytorch. Using Horovod, we can simplify the complexity of writing efficient distributed software, hence making migrating from single-GPU model to multi-GPUs more straightforward.
The key changes in your program when integrating Horovod are as follows:
In this post, I'm going to talk about the topic of training deep learning models using multi-GPUs: why it is challenging, which factors are important when making deep learning distributed, and finally which libraries/frameworks to use to make this process easier.
This post is based on the content of the course "Fundamentals of Deep Learning for Multi-GPUs", offered by NVIDIA, which I have taken recently.
GPU is the platform for deep learning, which makes deep learning accessible. Multi-GPUs are used to further speed up the computations, in cases single GPU memory is not efficient.
The first question is: Do you need to use multi-GPUs or is it enough to train your deep learning models using single-GPU? It depends on your problems, the size of data to be trained on and the time that you expect the models to finish running.
In a recent report by NVIDIA, it is said that using multi-GPU, one can train the whole ImageNet in just 15 minutes in a new record []. For reference, previously it was reported to take 14 days for training ImageNet in single GPU:
Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days [team of Yang You, Zhao Zhang, James Demmel, Kurt Keutzer, Cho-Jui Hsieh from UC Berkeley]
However, training using multi-GPUs is not straightforward as one will have to deal with several problems related to the optimization process, the component of the pipeline (data loading), networking, etc.
In the first part of the course, we look at the differences between Gradient Descent and Stochastic Gradient Descent.
In Gradient Descent, the whole dataset is used for training and calculating the cost function.
SGD on the other hand only uses a portion of the dataset (a batch) for computing the shape of the cost function. This introduces a level of noise in our trajectory. These noises are desirable since they generates minima with significantly different mathematical properties than Gradient Descent.
When training using multi-GPUs, we increase the batch size (by summing up all batches scheduling to each GPU). When the batch size increases, we also remove the beneficial noises. This affects our algorithm behavior and the training performance.
In the second part, we run the code on AlexNet for image processing, and see how each part of the components in our pipeline and other factors such as networking affect the overall performance. Through the experiments, we see how each step of the process effects the end2end system performance. The idea is that the more time the GPU needs to spend processing a single image the less we need to deliver through our pipeline.
After trying a naive implementation of multi-GPUs and discussing the pros and cons, the problems with the naive approach, we talked about Horovod library and MPI framework (commonly used in High Performance Computing and distributed workloads) for implementing deep learning models in multi-GPUs mode.
Horovod developed at Uber is a distributed training framework that can be used in several libraries including Tensorflow, Keras and Pytorch. Using Horovod, we can simplify the complexity of writing efficient distributed software, hence making migrating from single-GPU model to multi-GPUs more straightforward.
The key changes in your program when integrating Horovod are as follows:
- hvd.init() initializes Horovod.
- config.gpu_options.visible_device_list = str(hvd.local_rank()) assigns a GPU to each of the TensorFlow processes (this code needs to be slightly adjusted if you want to mix the data and model parallel implementation).
- opt=hvd.DistributedOptimizer(opt) wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-all reduce.
- hvd.BroadcastGlobalVariablesHook(0) broadcasts variables from the first process to all other processes. This can be used together with the MonitoredTrainingSession or if it is not used (like in this example) it can be called directly via the hvd.broadcast_global_variables(0) operations instead.
In the course, you will have a final assignment to modify a code on single GPU to run on 4 GPUs using Horovod library. With the correct implementation achieving high accuracy training on 4GPUs, you will receive a certificate of competency.
Comments
Post a Comment