Data Parallelism and Distributed Deep Learning at production scale

Categories: AI and MLBig Data & AnalyticsTechnology

Deep Learning

Deep Learning has been the cool kid on the block in the Data Science world for the better part of a decade now. Despite this, a lot of companies still do not know how to create Deep Learning models that are commercially viable. In this three-part blog series, I will share our GlobalLogic secrets about how we approach data-parallel modelling at an awesome scale and how we orchestrate Deep Learning pipelines using Amazon SageMaker.

The examples in this blog series are real ongoing projects productionised using our in-house bespoke Deep Learning Accelerator service offering. If your company require Deep Learning models for life (not just for Christmas) then get in touch and we’ll be happy to help!

This first blog will introduce Deep Learning and focus primarily on scaling data processing using Amazon Web Services (AWS). The second blog will focus on distributed training using SageMaker’s Tensorflow container and Uber’s Horovod. The third and final blog will focus on model deployment, custom inference, and share some extra tips.

Deep Learning

Figure 1   SageMaker Training Pipeline

Note: This blog only contains snippets of SageMaker Pipelines code, not the processing scripts. The purpose of this blog series is to provide guidance and links to helpful resources.

 

Our processing high-level design

By the end of this blog, I will have created a SageMaker Pipeline to handle distributed data pre-processing in two pipeline steps:

  1. Create manifest files
  2. Shard these manifest files to multi-instance processing jobs as shown in the following architecture

Deep Learning

Figure 2  High-Level Design for Processing steps

 

What is Deep Learning?

Before we get into the crux of this blog it’s important to take a minute to explain what Deep Learning is, where it fits in line with Machine Learning (ML) and why it is so popular.

Broadly speaking, Deep Learning is a subset of ML and Artificial Intelligence (AI) as shown in the graphic below. AI, on a practical level, is approaching a problem using a perpetual Inductive Learning school-of-thought. This differs to traditional software programming that uses deductive reasoning to translate pre-defined business logic into conditional statements (with certainty).

Deep Learning

ML is simply recursive algorithms that repeatedly update a set of model parameters to minimise or maximise a business objective (e.g. maximise model accuracy of correctly identifying a dog).

Deep Learning is algorithms that update a lot of model parameters; conversely, shallow learning are algorithms that update a small number of parameters – however, there is no formal definition of what is ‘a lot’. That’s it!

Deep Learning is typically associated to neural networks that scale well for complex problems requiring a huge number of parameters. I will be demonstrating a Tensorflow Deep Neural Network model in the second part of this blog series.

 

Data Processing at scale

Okay, so we know Deep Learning uses a lot of model parameters but what does this have to do with the amount of data? Well, usually you will want the number of model parameters vs. quantity of data to have the same scaling factor. I won’t get into specifics here, but essentially each model parameter is akin to our model learning, a feature (a common characteristic) of the input data and when we learn more model parameters, these features become extremely specific to the dataset we are learning from, and our model starts to memorise patterns (overfit) on the input data. Increasing the amount of data our model is learning from avoids this memorisation issue; to generalize these observed patterns.

Deep Learning

Figure 3  Effect of increasing amount of data compared to number of model parameters

 

Importance of data lineage in MLOPs

After you have collected, curated, and beautifully organised your data in Amazon S3. What now? Well, we could just go right ahead and train our model on all of this, but when we’re talking about gigabytes, possibly terabytes of data, it’s crucial we know exactly what data is being used where and when. This is particularly important when needing to track data bias, data quality and quantity over time.

So, the first step is tracking data lineage. There are many tools available to help us do this (such as Amazon Feature Store), but for simplicity we will be using S3 manifest files. And you’ll understand why this simple approach works extremely well in the next few steps.

 

Step 1: Create manifest files for all data channels

In any ML pipeline, we need to define our training, validation, and testing datasets. It’s extremely important we do not mix up our datasets! Otherwise, we could unknowingly report false model performance.

I create these three manifest files in a SageMaker Processing step. At a high level, this is the process:

  1. Queries S3 to extract class labels from S3 prefixes (using boto3)
  2. Create a dataframe of all S3 prefixes with two columns: ‘prefix’ and ‘class_label’
  3. Apply any encoding to the class_label column (e.g. Label encoding)
  4. Create a new column ‘channel’ and assign each row to one of our three datasets: train, validation, test
  5. For all rows in each unique ‘channel’, create a manifest file for this channel (make sure there are enough samples per class label for each channel)
  6. Save the dataframe and each manifest file

Each manifest file should follow this schema template:

Deep Learning

Figure 4   An example of what a manifest file looks like

Here is an example of a processing step in our pipeline to create these manifest files:

Deep Learning

Figure 5   SageMaker Pipeline Create Manifest files step

 

Step 2: ShardedByS3Key automatic data distribution

Now we have created our manifest files, the power of AWS processing can be exploited! Besides having created a flat-file database for helping track data lineage, we can also use S3’s built-in data distribution strategy: ‘ShardedByS3Key’ to automatically subsample and distribute data to each of our processing instances.

This means we can scale our processing job to an (almost) unlimited number of processing instances. Each processing instance will be given a copy of our pre-processing code and each instance will also be given their own unique subset of each data channel.

Beware the urge to spread your processing workload to hundreds of processors as you will reach a point of diminishing returns. Each processor will output their own processed dataset channels (train, validation, and test). A useful tip is to scale your processing jobs to the sum of GPUs being used in your training job/s, as this makes the data flow much easier to orchestrate (and avoids having to track multiple processed data locations per GPU).

Our SageMaker Pipeline code for this step will look like the following:

Deep Learning

Figure 6  SageMaker Pipeline processing step

And that’s it! We can test our pipeline by running the following code:

Deep Learning

Figure 7   Execute SageMaker Pipeline

You can test this is working by looking at the logs in Amazon Cloudwatch. Each processing instance has their own log stream (‘algo-{instance_number}’).

Deep Learning

Figure 8   Checking multi-instance processing is working

 

Summary

In two very simple steps we have a scalable SageMaker Pipeline for data processing. This processing technique can be applied to most ML projects and best-of-all, as Amazon S3 and SageMaker Processing are managed services we do not need to spend resources maintaining the data distribution codebase.

Data scientists can now spend more of their time focusing on translating business requirements to data science methodology and optimising ML solutions.

We’ve started using this framework for developing Computer Vision models for the Tech4Pets Charity, which has allowed us to produce production-grade models at record-low development time and cost! Check out my colleague Roger’s article about  how ML is helping save lockdown pets.

 

Closing tip

In this blog series I talk almost exclusively about horizontal scaling (increasing the number of compute instances). However, you should always try scaling vertically first (increasing the compute resources per machine/s) as increased processing power alone may solve your scaling problems. Horizontal scaling is much more challenging to get right and brings with it a host of potential problems to address, such as handling rank-specific code blocks, multi-instance networking issues and processes deadlock.

Keep your eyes peeled for part two of my three part blog series!

 

About the author

Jonathan Hill, Senior Data Scientist at GlobalLogic. I’m passionate about data science and more recently focused helping productionise Machine Learning code and artefacts to MLOPs pipelines using AWS SageMaker best practices.

Author

Deep Learning

Author

Jonathan Hill

Senior Data Scientist

View all Articles

Top Authors

Arti Gupta

Arti Gupta

Sr. Manager, Engineering

Siddhi Thakkar

Siddhi Thakkar

Manager, Engineering

Neven Dimač

Neven Dimač

Software Engineer

Krishna Singh

Krishna Singh

Principal Architect, Technology

Geetesh Garg

Geetesh Garg

Consultant, Business Solutions & Consulting Group

Top Insights Categories

  • URL copied!