Distributed Training


Distributed machine learning refers to multi-node machine learning algorithms and systems that are designed to improve performance, increase accuracy, and scale to larger input data sizes.

Overview

A Gentle Introduction to Multi GPU/Node Distributed Training
High-level overview of the different types of training regimes that you'll encounter as you move from single GPU to multi GPU to multi node distributed ...
distributed-training tutorial article
An Overview of Distributed Training of Deep Learning Models
Overview of the different techniques that are used by contemporary distributed DL systems and discuss their influence and implications on the training ...
distributed-training training overview arxiv:2007.03970

Tutorials

Distributed model training in PyTorch
Distributing training jobs allows you to push past the single-GPU bottleneck, developing larger and more powerful models leveraging many GPUs ...
distributed-training pytorch deep-learning article
Distributed Training With TensorFlow
Tf.distribute.Strategy can be used with a high-level API like Keras, and can also be used to distribute custom training loops.
distributed-training tensorflow article keras
AWS vs Paperspace vs FloydHub : Choosing your cloud GPU partner
A look at various features of the top three cloud GPU service providers.
distributed-training gpu aws floydhub

Libraries

General
Ray
Ray is a fast and simple framework for building and running distributed applications.
hyperparameter-optimization reinforcement-learning scalable-reinforcement-learning hyperparameter-tuning
Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
distribution horovod uber article
DeepSpeed: Extreme-scale model training for everyone
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
distributed-training pytorch gpu deepspeed
Mixed Precision Training
Mixed precision Investigation in using 16-bit and 32-bit floating-point types in a model during training
mixed-precision training machine-learning tutorial
Table of Contents
Share a project
Share something you or the community has made with ML.
Topic experts
Share