OneFlow Made Training GPT-3 Easier（Part 1）
Even if an excellent data scientist who knows and can solve all algorithm problems related to Transformer, he may still be puzzled by other knowledge about distributed training, and fails to run a project.
In this 3-part series of blog articles, you will learn how easy data scientists train GPT with OneFlow.
Written by Cheng Cheng, translated by Dong Wenwen
The GPT-3 released by OpenAI is a breakthrough in the field of AI in 2020. Its 175B parameters and its outstanding performance that surpasses humans on multiple NLP tasks have convinced everyone that big model may be the future.
Together with the emergence of GPT-3, a subsequent problem arrives: the computation capacity and the storage of a single machine can no longer handle such a huge model (though BERT can still be trained with a super server such as DGX-1/2).
As estimated by an article published by NVIDIA, Efficient Large-Scale Language Model Training on GPU Clusters, even if a 175B GPT-3 can be stored in a single device, the time required to train using 8 V100s (the configuration of a DGX-1) is expected to be 36 years, 7 months using 512 V100s, and 1 month using 1024 80GB A100s.
This means that training large models can only be achieved by distributed method.
The demand for computation capacity is still a relatively easy problem to solve, once you have sufficient money to buy more GPUs. After all, OpenAI is not the only organization that has large clusters. However, utilizing thousands of GPUs collectively and efficiently is the key problem left to be tackled.
Even if an excellent data scientist who knows and can solve all algorithm problems related to Transformer, he may still be puzzled by other knowledge about distributed training, such as communication, topology, model parallelism and pipeline parallelism among hundreds of servers, and fails to run a project. This also explains why within the year of GTP-3’s release, only large companies such as NVIDIA and Microsoft are capable of reproducing GPT-3.
The development of distributed training is a trend, but it is unfamiliar to most algorithm engineers. Thus, the R&D team of OneFlow has prepared a series of articles to explain the core techniques required for distributed training of GPT and how OneFlow can reproduce GPT-3. This series involves three articles, which mainly cover the following topics:
- Fundamental parallelism techniques of training GPT
- The correct level of abstraction for distributed deep learning frameworks
- How to redesign a distributed deep learning framework like OneFlow
Being the first of the series, this article will include:
- An introduction of frameworks that are capable of training GPT-3
- Fundamental parallelism techniques of training GPT
- Pipeline Parallelism
- Gradient Accumulation
- Backward Re-computation
- 1F1B Strategy
Frameworks That are Capable of Training GPT-3
DeepSpeed can run larger models with fewer devices than Megatron-LM with data parallelism (ZeRO, ZeRO-Offload). However, DeepSpeed does not support tensor model parallelism. A comparison between OneFlow and Megatron-LM is present below.
From the figures above, it is clear that the performance of Oneflow is better than Megatron-LM in terms of data parallelism and model parallelism.
For a more detailed comparison, please refer to the public evaluation report https://github.com/Oneflow-Inc/DLPerf/blob/master/reports/GPT/dlperf_gpt_test_report_210512.md.
The core techniques of training GPT will be introduced below, and the implementation details will be introduced in other articles.
Core Parallelism Techniques of Training GPT
In the influential paper recently published by NVIDIA, Efficient Large-Scale Language Model Training on GPU Clusters, 3072 80GB A100s were used to train GPT (which cost more than 80 million USD), with model parameters reaching up to 1 trillion (5 times that of the original GPT-3).
With the goal of distributedly training a super large model, three core parallelism techniques are introduced in this paper:
- Data Parallelism
- Tensor Model Parallelism
- Pipeline Model Parallelism
Among the above parallelisms, data parallelism is the most common one. Model parallelism is to divide a large model tensor into multiple sub-tensors, so that the large model can be computed in parallel. Pipeline parallelism (which is called pipeline model parallelism in NVIDIA’s paper) is to partition the entire network into stages, where each device runs a certain amount of stages, thus enabling the large model to be executed in a way of relay.
To train a large model with a scale of 1T, NVIDIA used 384 machines DGX-A100 (each equipped with 8 80GB A100s and 8 200Gbps InfiniBand(IB)), with the GPUs inside the machine being interconnected with Ultra-high-speed NVLink and NVSwitch. The configuration is definitely to be the top.
And how do these devices work collaboratively? The GPT network consists of multiple Transformer Layers, each of which layer includes a sub-graph composed of multi-layer MLP and attention mechanisms. The GPT network mentioned above (with a scale of 1T) is composed of 128 transformer layers. This large GPT network is partitioned into 64 stages, with each stage running on 6 DGX-A100s. The workload among the 6 machines is trained with data parallelism, while the workload among GPUs inside each machine is trained with model parallelism. The 3072 A100s in the entire cluster are divided into a matrix of (6 x 8 x 64), and then train the 1T model using data parallelism, model parallelism and pipeline parallelism simultaneously.
In the above figure, the data parallelism part is relatively easy to understand. Thus, we will mainly focus on explaining Pipeline Parallelism in the following paragraphs.
Given the cluster topology aforementioned, it is evident that pipeline parallelism is the key to train GPT using 3072 A100s. If only data parallelism or model parallelism is chosen, the communication cost between fully interconnected devices will soon be unaffordable with the increase of the number of devices.
In contrast, pipeline parallelism not only reduces the memory burden of a single device, but it also reduces communication cost: there is only one adjacent tensor data that needs to be transmitted between two stages, and communication cost is independent of the scale of network and the number of devices.
At the same time, to make multiple stages execute in parallel, pipeline parallelism also relies on two crucial characteristics, namely Gradient Accumulation and Sublinear Memory Cost. Before introducing them, two constraints of deep learning training and model updating will be firstly explained, namely BSP (Bulk Synchronous Parallel) and SSP (Stale Synchronous Parallel).
As required by BSP, forward computation of the latter batch can only be started after the backward computation and model update of the previous batch completes, so that the forward computation of every batch uses the latest model.
If pipeline parallelism is applied with BSP, then the works of stages are serial, which means that when one device is working, all the other devices are just idling, and the advantage of distributed training can not be brought into play.
The figure above is the timeline of a network using pipeline parallelism with BSP but without gradient accumulation. Assuming that the entire network is equally divided into 4 stages, and each stage is run on one device, the time of backward computation is one time larger than that of the forward computation, and two times larger if re-computation of Checkpointing is counted. Because the 4 devices are executed serially, there are a lot of “bubbles” in the timeline, which results in nearly 70% of the time being idle.
In contrast, SSP is an asynchronous model updating rule, and the forward computation is allowed to use the outdated models, which means the batch 1 does not need to wait the completion of model update of batch 0 in stage 1. However, SSP was not adopted by NVIDIA while training GPT-3. Here are the main reasons:
- The convergence of SSP has not been fully proved mathematically, and the paper GeePS has pointed out that the convergence quality with SSP is not as good as with BSP;
- There will be multiple versions for each variable (weight) on the GPU at the same time in SSP, as a large model network, the cost of memory consumption in GPT-3 caused by multiple versions of model is unaffordable;
- BSP can conquer the drawbacks of pipeline parallelism in training GPT-3 through Gradient Accumulation and Checkpointing;
- In addition, the analysis premise of NVIDIA’s paper suits to BSP. According to the updating method of the parameter optimizer, the proportion of pipeline parallelism bubbles is
Bubble time fraction = (p-1) / m, where p is the number of stages and m is the number of micro-batch in gradient accumulation. If SSP is adopted, the theoretical basis of the paper will be changed.
The principle of gradient accumulation is to divide a big mini-batch into multiple micro-batches, accumulate the gradient of each micro-batch after its forward and backward computation, and update the model after the accumulation of the last micro-batch is finished.
Micro-batch is highly similar to data parallelism: data parallelism is a spatial expansion, in which the data are divided into multiple parts and assigned to multiple devices for parallel computation, and the model is updated after the gradients are accumulated with reduce collective communication; while the micro-batch can be treated as a kind of temporal expansion, in which the data are divided into multiple parts and are computed sequentially on the same device, and the model is updated after the gradients of multiple micro-batches are accumulated.
When the total mini-batch size of the two methods are the same, and the effect of data parallelism is equal to that of accumulation of multiple micro-batches. Data parallelism and gradient accumulation are mathematically equivalent.
By accumulating the gradients of multiple micro-batches, gradient accumulation makes it unnecessary for the forward computation of the next micro-batch to depend on the backward computation of the previous micro-batch (such dependency is still required for the last micro-bath of a mini-batch).
Gradient Accumulation has solved multiple problems:
- In a single device, a large mini-batch size is divided into multiple micro-batches by gradient accumulation. In this way, the memory need of each micro-batch is reduced.
- Under data parallelism, gradient accumulation solves the problem of excessive overhead of synchronizing the gradient (with the increase of devices, the AllReduce synchronization overhead of the gradient also increases). Data parallelism and gradient accumulation can be combined. When combined, the speed-up ratio of data parallelism can be improved for performing a smaller number of gradient synchronization operations.
- Under pipeline parallelism, gradient accumulation allows different stages to execute different micro-batches in parallel, so that the calculation of each stage is not blocked and the purpose of pipeline is achieved.
The pipeline parallelism mentioned in the paper of GPipe (2018) can be easily implemented through the micro-batch method in OneFlow (the details of the OneFlow process will be discussed in the upcoming article).
The timeline of GPipe is shown in the figure below: (there are 4 stages, 8 micro-batches in total, and each mini-batch updates the model after 8 iterations of calculating and accumulating gradients)
In the example of Gpipe’s pipeline parallelism, micro-batches can be conducted in different stages simultaneously. As shown in the figure, the label in each square indicates the serial number of a micro-batch. The same micro-batch passes through all stages in a serial way. The idle time of a device is only about 25% in this case.
But this type of pipeline parallelism may result in a large amount of memory consumption. Remember that the intermediate activation of the forward computation of each micro-batch is consumed by the backward computation, 8 buffers will be needed for complete forward activation for stage 1. This problem leads to another crucial technique: Re-computation.
Backward Re-computation (Checkpointing)
Checkpointing is a concept mentioned in the paper of Chen Tianqi published in 2016 Training Deep Nets with Sublinear Memory Cost, which is also called sublinear memory cost. Checkpointing and CPU offload are two methods of achieving sublinear memory cost.
Generally, tensors calculated in the forward direction will be used in backward pass. Consequently, for specific layers, the forward tensors need to be retained in memory until the end of their corresponding backward computation. This tensor retaining process occupies a significant amount of memory, especially for the layers that are early in the order. Fortunately, the checkpointing method can significantly reduce the time of retaining these forward tensors, thus saving the required memory.
The core idea of Checkpointing is to label a small number of tensors (which are checkpointed) in the forward network. Then, these labelled tensors will be retained in the forward computation while all the other intermediate tensors are discarded immediately if they are no longer required by the forward computation. During the backpropagation, the discarded tensors other than the labelled tensors will be obtained by temporary re-computations, which take the checkpointed tensors as inputs.
Most of the tensors holding forward activations do not need to be kept until the backward computation. This optimization has highly reduced the lifetime of the large, number of tensors and has greatly improved the memory reuse efficiency.
The difference between the computation graph of the two-layer Transformer before and after Checkpointing is shown in the figure above. The main difference lies in the lines linking the forward and the backward, which are reduced from four to two.
The methods of implementing Checkpointing varies in frameworks: Megatron-LM implements its own checkpointed_forward by overloading the torch.nn.Module, which means that Megatron-LM customizes the forward and backward execution logic of the Transformer Layer. On the other hand, the implementation of OneFlow’s Checkpointing is shown in the figure above, in which the subgraph of re-computation is added, and the process of backward consuming forward is replaced by backward consuming subgraphs of forward re-computation.
Re-computation was mainly used in single-device or data parallelism previously. Indeed, it was not specifically designed for pipeline parallelism originally.
But re-computation is critical in pipeline parallelism, because it is unnecessary to cache all forward activations, but only to cache a small number of tensors that are checkpointed. Therefore, the memory cost can be greatly reduced in pipeline parallelism.
CPU offload is a method similar to virtual memory in the computer operating system (swapping the infrequently used memory to disk temporarily thus increasing the capacity of memory). As we know, GPU memory is expensive, with high speed and small capacity. By contrast, host (CPU) memory is cheap, with low speed and large capacity. The process of Swapping out some activations that are not used in the forward computation temporarily to the CPU host memory, and then swapping them into the GPU device memory when needed in the backward computation can save memory.
These two methods support memory optimization in different ways: Checkpointing is a trade off between computation overhead and memory, and CPU offload is a trade off between transmission overhead and memory.
Besides the re-computation, another problem in GPipe related to memory usage should be determined: the number of micro-batches in one big batch, since the number of buffers of activations relies directly on the number of micro-batches (the times of gradient accumulations).
Usually, the number of accumulation execution is large (the number is generally more than twice the number of stages in order to saturate pipeline as much as possible). Therefore, even if the number of buffers is small, a large amount of memory is required.
An improvement strategy called 1F1B (One Forward Pass Followed by One Backward Pass）was thereby proposed in another paper PipeDream, in which the number of buffers of activations only relies on the number of stages. This optimizing step can save memory so that larger models can be trained.
The principle of 1F1B is concise: since the activation in the forward computation can only be released after the completion of corresponding backward computation (regardless of whether the technique of Checkpointing is used or not), in pipeline parallelism, to reduce the number of buffers of activations as much as possible, the storage time of each activation must be shortened as much as possible. That is, freeing the buffers as early as possible.
So, to complete the backward computation of each micro-batch as early as possible, the priority of backward computation needs to be increased: the micro-batch with a smaller serial number conducts backward computation before the forward computation of micro-batch with larger serial number.
Therefore, if the backward computation of one micro-batch in the last stage starts immediately after its forward computation finished, other stages will start backward computation as early as possible, which exactly follows the principle of 1F1B strategy. The timeline of pipeline in 1F1B strategy is shown below:
From the comparison of the pipeline of 1F1B and GPipe, we can find that 8 pieces of activations need to be cached for backward computation using the method of GPipe, while only 4 activations are needed using 1F1B.
Although the proportion of idle time of the two is the same, more layers can be executed and a larger size of micro-batch can be supported with more saved memory, thereby improving the performance.
The techniques (GPipe, Gradient Accumulation, Recomputation, and 1F1B) introduced above are the core ideas for training GPT.
No matter Megatron-LM, OneFlow, DeepSpeed, PaddlePaddle, or MindSpore, the above techniques are supported through different implementations.
In the next article, we will discuss data/model parallelism and the difficulty of distributed training in mainstream frameworks. You will know the reason of developing another framework from scratch like OneFlow given that Megatron-LM’s performance is already excellent.