OneFlow Made Training GPT-3 Easier(Part 1)

OneFlow is a performance-centered and open-source deep learning framework.
  • The correct level of abstraction for distributed deep learning frameworks
  • How to redesign a distributed deep learning framework like OneFlow
  • Fundamental parallelism techniques of training GPT
  • Pipeline Parallelism
  • Gradient Accumulation
  • Backward Re-computation
  • 1F1B Strategy

Frameworks That are Capable of Training GPT-3

The currently popular open-source libraries of GPT are Megatron-LM released by NVIDIA, and DeepSpeed released by Microsoft, both of which are developed based on PyTorch.

Data Parallelism Performance Comparison
Model Parallelism Performance Comparison

Core Parallelism Techniques of Training GPT

In the influential paper recently published by NVIDIA, Efficient Large-Scale Language Model Training on GPU Clusters, 3072 80GB A100s were used to train GPT (which cost more than 80 million USD), with model parameters reaching up to 1 trillion (5 times that of the original GPT-3).

  • Tensor Model Parallelism
  • Pipeline Model Parallelism
Topology of GPT-3 cluster

Pipeline Parallelism

Given the cluster topology aforementioned, it is evident that pipeline parallelism is the key to train GPT using 3072 A100s. If only data parallelism or model parallelism is chosen, the communication cost between fully interconnected devices will soon be unaffordable with the increase of the number of devices.

  1. There will be multiple versions for each variable (weight) on the GPU at the same time in SSP, as a large model network, the cost of memory consumption in GPT-3 caused by multiple versions of model is unaffordable;
  2. BSP can conquer the drawbacks of pipeline parallelism in training GPT-3 through Gradient Accumulation and Checkpointing;
  3. In addition, the analysis premise of NVIDIA’s paper suits to BSP. According to the updating method of the parameter optimizer, the proportion of pipeline parallelism bubbles is Bubble time fraction = (p-1) / m, where p is the number of stages and m is the number of micro-batch in gradient accumulation. If SSP is adopted, the theoretical basis of the paper will be changed.

Gradient Accumulation

The principle of gradient accumulation is to divide a big mini-batch into multiple micro-batches, accumulate the gradient of each micro-batch after its forward and backward computation, and update the model after the accumulation of the last micro-batch is finished.

  • Under data parallelism, gradient accumulation solves the problem of excessive overhead of synchronizing the gradient (with the increase of devices, the AllReduce synchronization overhead of the gradient also increases). Data parallelism and gradient accumulation can be combined. When combined, the speed-up ratio of data parallelism can be improved for performing a smaller number of gradient synchronization operations.
  • Under pipeline parallelism, gradient accumulation allows different stages to execute different micro-batches in parallel, so that the calculation of each stage is not blocked and the purpose of pipeline is achieved.

Backward Re-computation (Checkpointing)

Checkpointing is a concept mentioned in the paper of Chen Tianqi published in 2016 Training Deep Nets with Sublinear Memory Cost, which is also called sublinear memory cost. Checkpointing and CPU offload are two methods of achieving sublinear memory cost.

1F1B Strategy

Besides the re-computation, another problem in GPipe related to memory usage should be determined: the number of micro-batches in one big batch, since the number of buffers of activations relies directly on the number of micro-batches (the times of gradient accumulations).



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.