OneFlow Made Training GPT-3 Easier(Part 1)

OneFlow is a performance-centered and open-source deep learning framework.
  • Fundamental parallelism techniques of training GPT
  • The correct level of abstraction for distributed deep learning frameworks
  • How to redesign a distributed deep learning framework like OneFlow
  • An introduction of frameworks that are capable of training GPT-3
  • Fundamental parallelism techniques of training GPT
  • Pipeline Parallelism
  • Gradient Accumulation
  • Backward Re-computation
  • 1F1B Strategy

Frameworks That are Capable of Training GPT-3

Data Parallelism Performance Comparison
Model Parallelism Performance Comparison

Core Parallelism Techniques of Training GPT

  • Data Parallelism
  • Tensor Model Parallelism
  • Pipeline Model Parallelism
Topology of GPT-3 cluster

Pipeline Parallelism

  1. The convergence of SSP has not been fully proved mathematically, and the paper GeePS has pointed out that the convergence quality with SSP is not as good as with BSP;
  2. There will be multiple versions for each variable (weight) on the GPU at the same time in SSP, as a large model network, the cost of memory consumption in GPT-3 caused by multiple versions of model is unaffordable;
  3. BSP can conquer the drawbacks of pipeline parallelism in training GPT-3 through Gradient Accumulation and Checkpointing;
  4. In addition, the analysis premise of NVIDIA’s paper suits to BSP. According to the updating method of the parameter optimizer, the proportion of pipeline parallelism bubbles is Bubble time fraction = (p-1) / m, where p is the number of stages and m is the number of micro-batch in gradient accumulation. If SSP is adopted, the theoretical basis of the paper will be changed.

Gradient Accumulation

  • In a single device, a large mini-batch size is divided into multiple micro-batches by gradient accumulation. In this way, the memory need of each micro-batch is reduced.
  • Under data parallelism, gradient accumulation solves the problem of excessive overhead of synchronizing the gradient (with the increase of devices, the AllReduce synchronization overhead of the gradient also increases). Data parallelism and gradient accumulation can be combined. When combined, the speed-up ratio of data parallelism can be improved for performing a smaller number of gradient synchronization operations.
  • Under pipeline parallelism, gradient accumulation allows different stages to execute different micro-batches in parallel, so that the calculation of each stage is not blocked and the purpose of pipeline is achieved.

Backward Re-computation (Checkpointing)

1F1B Strategy




OneFlow is a performance-centered and open-source deep learning framework.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Machine learning is the future. Here’s why your search should be using it right now.

Neural Network and its industry use cases

Pizza is all around me

#3: RLiable to Save RL Research, NeurIPS is Here, Offline RL Gains Steam, Check out COMARL

Original U-Net in PyTorch

Image of Semantic Segmentation

Neural Style Transfer — Using Deep Learning to Generate Art

Probabilistic Deep Learning for Breast Cancer Detection

Dog Breed Classification Application on Android using TensorFlow Lite

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


OneFlow is a performance-centered and open-source deep learning framework.

More from Medium

OneFlow v0.6.0 came out!

Terra: Imperative-Symbolic Co-Execution of Deep Learning Programs

ViViT 📹 Video Vision Transformer

“Neural Structured Learning”