Skip to main content

Deep learning with Multi-GPUs

Distributed deep learning is a hot topic as it increases the training time in compared to single GPU depending on the problems and the data that you deal with. However, modifying code to make your single-GPU program become multi-GPUs is not always straightforward!

In this post, I'm going to talk about the topic of training deep learning models using multi-GPUs: why it is challenging, which factors are important when making deep learning distributed, and finally which libraries/frameworks to use to make this process easier.
This post is based on the content of the course "Fundamentals of Deep Learning for Multi-GPUs", offered by NVIDIA, which I have taken recently.

GPU is the platform for deep learning, which makes deep learning accessible. Multi-GPUs are used to further speed up the computations, in cases single GPU memory is not efficient.

The first question is: Do you need to use multi-GPUs or is it enough to train your deep learning models using single-GPU? It depends on your problems, the size of data to be trained on and the time that you expect the models to finish running.

In a recent report by NVIDIA, it is said that using multi-GPU, one can train the whole ImageNet in just 15 minutes in a new record []. For reference, previously it was reported to take 14 days for training ImageNet in single GPU:
Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days [team of Yang You, Zhao Zhang, James Demmel, Kurt Keutzer, Cho-Jui Hsieh from UC Berkeley]

However, training using multi-GPUs is not straightforward as one will have to deal with several problems related to the optimization process, the component of the pipeline (data loading), networking, etc.

In the first part of the course, we look at the differences between Gradient Descent and Stochastic Gradient Descent.
In Gradient Descent, the whole dataset is used for training and calculating the cost function.

SGD on the other hand only uses a portion of the dataset (a batch) for computing the shape of the cost function. This introduces a level of noise in our trajectory. These noises are desirable since they generates minima with significantly different mathematical properties than Gradient Descent.

When training using multi-GPUs, we increase the batch size (by summing up all batches scheduling to each GPU). When the batch size increases, we also remove the beneficial noises. This affects our algorithm behavior and the training performance.

In the second part, we run the code on AlexNet for image processing, and see how each part of the components in our pipeline and other factors such as networking affect the overall performance. Through the experiments, we see how each step of the process effects the end2end system performance. The idea is that the more time the GPU needs to spend processing a single image the less we need to deliver through our pipeline.

After trying a naive implementation of multi-GPUs and discussing the pros and cons, the problems with the naive approach, we talked about Horovod library and MPI framework (commonly used in High Performance Computing and distributed workloads) for implementing deep learning models in multi-GPUs mode.

Horovod developed at Uber is a distributed training framework that can be used in several libraries including Tensorflow, Keras and Pytorch. Using Horovod, we can simplify the complexity of writing efficient distributed software, hence making migrating from single-GPU model to multi-GPUs more straightforward.

The key changes in your program when integrating Horovod are as follows:

  • hvd.init() initializes Horovod.
  • config.gpu_options.visible_device_list = str(hvd.local_rank()) assigns a GPU to each of the TensorFlow processes (this code needs to be slightly adjusted if you want to mix the data and model parallel implementation).
  • opt=hvd.DistributedOptimizer(opt) wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-all reduce.
  • hvd.BroadcastGlobalVariablesHook(0) broadcasts variables from the first process to all other processes. This can be used together with the MonitoredTrainingSession or if it is not used (like in this example) it can be called directly via the hvd.broadcast_global_variables(0) operations instead.


In the course, you will have a final assignment to modify a code on single GPU to run on 4 GPUs using Horovod library. With the correct implementation achieving high accuracy training on 4GPUs, you will receive a certificate of competency.








Comments

Popular posts from this blog

Spam and Bayes' theorem

I divide my email into three categories: A1 = spam. A2 = low priority, A3 = high priority. I find that: P(A1) = .7 P(A2) = .2 P(A3) = .1 Let B be the event that an email contains the word "free". P(B|A1) = .9 P(B|A2) = .01 P(B|A3) = .01 I receive an email with the word "free". What is the probability that it is spam?

Python Tkinter: Changing background images using key press

Let's write a simple Python application that changes its background image everytime you click on it. Here is a short code that helps you do that: import os, sys import Tkinter import Image, ImageTk def key(event): print "pressed", repr(event.char) event.widget.quit() root = Tkinter.Tk() root.bind_all(' ', key) root.geometry('+%d+%d' % (100,100)) dirlist = os.listdir('.') old_label_image = None for f in dirlist: try: image1 = Image.open(f) root.geometry('%dx%d' % (image1.size[0],image1.size[1])) tkpi = ImageTk.PhotoImage(image1) label_image = Tkinter.Label(root, image=tkpi) label_image.place(x=0,y=0,width=image1.size[0],height=image1.size[1]) root.title(f) if old_label_image is not None: old_label_image.destroy() old_label_image = label_image root.mainloop() # wait until user clicks the window except Exception, e: # Skip a...

Skip-gram model and negative sampling

In the previous post , we have seen the 3 word2vec models: skip-gram, CBOW and GloVe. Now let's have a look at negative sampling and what it is used to make training skip-gram faster. The idea is originated from this paper: " Distributed Representations of Words and Phrases and their Compositionality ” (Mikolov et al. 2013) In the previous example , we have seen that if we have a vocabulary of size 10K, and we want to train word vectors of size 300. Then the number of parameters we have to estimate in each layer is 10Kx300. This number is big and makes training prone to over-fitting and gives too much focus on words that appear often, and less focus on rare words. Subsampling of frequent words So the idea of subsampling is that: we try to maximize the probability that "real outside word" appears, and minimize the probability that "random words" appear around center word. Real outside words are words that characterize the meaning of the center word, wh...