The rise of deep learning has led to an ever increasing need to speed up computations when processing large amounts of data. If you are a data scientist or a machine learning engineer, you probably know that NVIDIA’s hardware is playing a major role in this AI revolution, as all the fancy deep learning frameworks out there (TensorFlow, PyTorch etc.) are depending on NVIDIA’s CUDA platform in order to run their computations in a highly parallelised fashion on GPUs.
As a result, the GPU Technology Conference (GTC for short) hosted by the company, has evolved to be a massive event in the tech calendar year, attracting many thousands of attendees, gathering to learn about the latest applications of deep learning in fields like healthcare, autonomous vehicles, e-commerce and finance, but also the latest advancements on the hardware side, promising to speed up the training of machine learning algorithms even more.
This year’s conference was planned to take place in San Jose, during the last week of March, but as one can imagine, the decision was made that it could no longer take place (at least physically), due to concerns around coronavirus. On the bright side, the good people of NVIDIA took the decision to run as much of the conference as possible online, rebranding it as GTC Digital. What’s even better, is that access to this new online version of GTC is free for all (with the exception of some hands-on workshops). So I sat on my couch, grabbed my laptop and dived into the world of accelerated machine learning.
Usually, it is Jensen Huang, NVIDIA’s founder and CEO, who opens GTC with a keynote speech, providing news on the company’s latest cool projects, new GPUs and the roadmap for the years to come. This time, it was Will Ramey, Senior Director and Global Head of Developer Programs, that opened the event, delivering a talk aimed to demystify deep learning, give an overview of the expanding universe of modern AI and provide examples of how it has revolutionised tech over the years. He reminded us of the days when e-mail spam filters where using simplistic bag-of-words approaches as their core feature and the contrast to today’s vast collection of data from smartphone and other devices, that has enabled the development of more sophisticated deep learning approaches to solve the problem.
The conference offered a plethora of live talks like this, alongside talks on-demand, poster sessions, podcasts and live hands-on workshops. The selection of speakers did not disappoint either, with many of the leading companies in the fields of data science and AI, like Microsoft and Uber, being represented at the conference. Additionally, I was pleasantly surprised to hear of all the cutting-edge applications of deep learning that non-tech companies, like Walmart and Domino’s are developing (no, they haven’t built a neural net to optimise the flavour of pizza, yet).
As a last note, before we have a closer look into some of the talks, it is worth mentioning that this digital version of the conference, along with all its recorded material, will stay available online for the foreseeable future, with a few more on-demand talks to be added within the next couple of weeks. So, if you find yourself having a bit of extra free time these days (who doesn’t), I recommend you give it a go.
The presentation delivered by Alessandro Magnani, data scientist at Walmart Labs, turned out to be a goldmine. It was another piece of evidence of how deep learning is revolutionising e-commerce. The first topic he talked about, was that of product type classification. The challenge here is that every product listed on Walmart’s website, has to be assigned to one of the few thousands of product types available in Walmart’s taxonomy, based on its title and image. That’s a tedious and extremely time-consuming task for a human, which means it’s worth automating.
Building a neural network that can infer the type of a product, based on its title, proved to be very useful, but the team faced challenges, as the product titles were sometimes misleading the model. Their final solution, which is published here, is based on the idea of multi-modal learning, where a fusion of text and image data is used from the model to extract useful features and boost classification accuracy.
Another interesting challenge that Walmart Labs have to deal with, is building an accurate visual search platform, as Walmart customers have the option of searching for products similar to an image they provide, rather than text. For us humans, comparing images is a relatively simple task – for example, everyone can tell that a blue chair and a green chair are two similar products, just by looking at them. However, for a machine, this task is not simple at all. To deal with this challenge, they create embeddings for each product image, taken from one of the intermediate layers of a trained neural network (which I assume is an autoencoder, even though it’s not mentioned). The neural network has been trained on a combination of original product images and synthetic ones, where noise or other objects have been superimposed on the original, to simulate challenges frequently present in user images. Having the embeddings, it’s then easy to use similarity metrics like the Euclidean norm to find images from their catalogue that are similar to the one that the user has provided.
Next thing on the menu was Domino’s. What captured my attention in Zach Fargoso’s presentation, was the way they are allocating computing resources amongst their team of data scientists. Before we get into details though, let me tell you that the data science team at Domino’s grew exponentially over the last couple of years, as from a small team of 5, they are now working across multiple parts of the business, using machine learning to improve operations in marketing, the company’s supply chain and store sitting, amongst others.
So, the team at Domino’s made the decision to move away from the practice of purchasing GPU-enhanced machines for each individual data scientist. Instead, they have invested in 2 DGX-1 servers, each of which comes with 8 Tesla V100 GPUs, 32 gigabytes of memory each. However, the issue arising with having shared hardware is, who will use each GPU and for how long and what if someone only needs a fraction of a V100, or what if someone wants to use all 16 of them?
Anyone who has fought over GPU resources in the past, either at work or while studying at uni, I’m sure will greatly appreciate this approach (and I know there are many of us out there). So, Domino’s is using Kubernetes to orchestrate the entire process. Kubernetes will dynamically launch a docker container every time it gets a request from a data scientist and it will allocate GPU resources, based on what has been requested and what is the overall availability on the servers. It can even allocate just a fraction of a GPU to any given container, enabling this way data scientists to run multiple experiments simultaneously. This dockerised approach leads not only to better resource usage, but also ensures that every member of the team is using a standardised environment, where the version of libraries and other key pieces of software used, is exactly the same.
Last, but not least, let’s have a quick look at Uber‘s presentation, where Travis Addair, a software engineer at Uber’s Machine Learning Platform, presented Horovod – a state-of-the-art distributed framework for training deep neural networks.
The need for distributed training at Uber arose from the huge amounts of data they are using to train their models for their autonomous driving projects. The amount of data in these projects is so large that a model sometimes needs to be trained on hundreds of GPUs, making the GPU usage efficiency of utmost importance.
So, one would reasonably ask at this point: “Don’t well-established deep learning frameworks, like TensorFlow, natively support training on multiple GPUs?”. The answer is “yes”. TensorFlow allows training on multiple GPUs by using a parameter sharing approach. The topology of this approach involves multiple workers training an instance of the model on separate GPUs, using different batches of the data. Resulting gradients are then sent back from the workers to a parameter server, which aggregates them and sends them back to the workers. The team at Uber noticed that this approach had serious scalability issues due to latency in network communications between the server and the workers. An example provided by them, shows that training ResNet-101 on 128 GPUs, using distributed TensorFlow, leads to only 40% efficiency.
Horovod was built specifically to tackle this challenge. It uses an Allreduce approach where there is no parameter server. Instead, the workers are positioned in a ring (or similar) topology and they talk directly to each other, exchanging gradient updates. As a matter of fact, a research team at Baidu had already proved that this is the bandwidth-optimal way of aggregating the gradients for this problem. Indeed, benchmarks provided by Uber, show that GPU efficiency can even be doubled using this library. Horovod is now hosted by the LF AI Foundation, which is part of the Linux Foundation.
Overall, NVIDIA’s GTC Digital did not disappoint at all. It fulfilled its purpose of communicating all the amazing work that is happening out there in the tech world in the fields of accelerated computing and deep learning and it did it for free. If there is one thing that I missed from this online conference, it’s the spontaneous and fruitful conversations with other participants, as well as the meaningful connections and collaborations that sometimes follow. Very much looking forward to attending next year’s event in person.