Distributed Training


Distributed machine learning refers to multi-node machine learning algorithms and systems that are designed to improve performance, increase accuracy, and scale to larger input data sizes.

Overview

A Gentle Introduction to Multi GPU/Node Distributed Training
High-level overview of the different types of training regimes that you'll encounter as you move from single GPU to multi GPU to multi node distributed ...
distributed-training tutorial article

Tutorials

Distributed model training in PyTorch
Distributing training jobs allows you to push past the single-GPU bottleneck, developing larger and more powerful models leveraging many GPUs ...
distributed-training pytorch deep-learning article
Distributed Training With TensorFlow
Tf.distribute.Strategy can be used with a high-level API like Keras, and can also be used to distribute custom training loops.
distributed-training tensorflow tutorial article
Training Neural Nets on Larger Batches
💥 Practical Tips for 1-GPU, Multi-GPU & Distributed setups
distributed-training training gpu convolutional-neural-networks
AWS vs Paperspace vs FloydHub : Choosing your cloud GPU partner
A look at various features of the top three cloud GPU service providers.
distributed-training gpu aws floydhub

Libraries

General
Ray
Ray is a fast and simple framework for building and running distributed applications.
hyperparameter-optimization reinforcement-learning scalable-reinforcement-learning hyperparameter-tuning
Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
distribution horovod uber article
Mixed Precision Training
Mixed precision Investigation in using 16-bit and 32-bit floating-point types in a model during training
mixed-precision training machine-learning tutorial
Table of Contents
Share a project
Share something you or the community has made with ML.
Topic experts
Share