Written by Cheng Cheng, translated by Dong Wenwen
Part-1 of 3 articles has presented that GPT has to be trained distributedly, and introduced the relevant parallel techniques. In Part-2, we have compared the usability of these techniques on OneFlow and PyTorch (Megatron-LM), analyzing and comparing it from a user perspective.
This article will take a closer look at the internal mechanism from the perspective of framework design and developers. The characteristics of OneFlow’s runtime system based on actor model, and OneFlow’s boxing mechanism will be covered.
It can be found that the efforts of other frameworks and advanced customization by users on all distributed parallelism are in fact a particular use case of OneFlow.
How does OneFlow Realize Pipeline Parallelism?
OneFlow’s runtime system is based on actor model. For more details of its design, please refer to here.
OneFlow’s runtime has overcome multiple classical difficulties in distributed deep learning area in a general and elegant way. Here are two specific examples: first, naturally supporting pipeline and second, decentralized scheduling.
Naturally supporting pipeline: actor solves the flow control problem through the internal state machine and the number of Regst produced, as well as the upstream and downstream Regst message mechanism.
The state machine of an actor is shown in the figure below:
Decentralized scheduling: actors only need to care about upstream and downstream messages to decide whether they can act or not. Therefore, in the process of pipeline parallelism, actor erases the need for complex scheduling logic. All actors form a completely decentralized distributed collaborative network.
Let’s introduce these two characteristics with specific scenarios: the figure below demonstrates a common deep learning job composed of four works: data-loading, preprocessing, copyH2D and training.
In this example, each work has been carried out by an actor. As we know, if the pipeline goes without flow control, the number of batches of less time-consuming work (such as data loading in the figure above) will quickly exceed that of more time-consuming work (such as training). And the extra batches produced by data loading can not be consumed in time, which will result in the waste of system resources or crashing.
But in OneFlow, the memory quota of each stage is pre-determined and fixed with a constant value. For example, when each RegstNum of the four actors equals to 2, the following pipeline timeline will be obtained(assuming the training time is relatively long):
After executing several Batches, the execution steps of the 4 stages are completely controlled by the longest stage. The batch generation speed of each stage is coordinated, there will be no waste of resources.
This is how OneFlow uses Back Pressure to solve flow control problems.
In OneFlow, the key to pipeline parallelism is the number of Regst. In actual implementation, OneFlow uses a more general algorithm to implement Megatron-LM’s pipeline parallelism: inserting Buffer Op.
In the logical computation graph, we will insert a Buffer Op to the edge linking the backward consumer and the forward producer.
The number of Buffer Regst depends on in which stage the actor resides.
After the backward-to-forward consumption is optimized by Checkpointing (that is, sublinear memory planning or saving memory by recomputation), there will only be very few consumption edges within each Placement Group.
The overall algorithm implementation can be explained by the following diagram:
Assuming that the entire network is partitioned into 4 stages and there are 8 Transformer Layers in total, we need to insert Buffer Op into the forward and backward computation graph of the first 3 stages.
The last stage does not need to insert a buffer op because it immediately performs backward computation of the micro-batch after each forward is completed.
Compared with Megatron-LM’s complex handwritten scheduler and handwritten communication primitives, the OneFlow only needs to insert Buffer Op to support pipeline parallelism.
Since we have used Checkpointing for each Transformer Layer, each Layer has only one forward-to-backward edge, so we only need to insert one Buffer Op for each transformer layer.
Boxing: Transforming Physical Tensor on Demand
Let’s take the data + model parallelism of Linear Layer as an example to explain all the combinations of data parallelism and model parallelism, which are essentially the configurations described by SBP.
Any communication operations between devices, where to insert in the entire network, what communication operations to insert, and who to communicate with each device are all derived automatically by SBP formulation. The mathematical consistency is also guaranteed.
With the help of OneFlow, algorithm engineers can say goodbye to the communication primitives in distributed parallelism.
Not only that, the developers of the OneFlow framework do not need to care about the communication primitives in the distributed system, since the abstraction of the SBP layer decouples the operator/network from the distributed communication.
Let’s take 1-D SBP as an example, and then expand to 2-D SBP.
Data parallelism under 1-D SBP, for a Linear Layer, is mainly the MatMul (matrix multiplication) calculation.
We assume that the calculation of matrix multiplication is a calculation of (m, k) x (k, n) = (m, n)
from a logical perspective.
m represents how many samples are there in total; k and n are the number of neurons in the hidden layer and the number of neurons in the output layer respectively.
The logical computation graph -> physical computation graph mapping relationship of data parallelism is shown in the following figure:
Under data parallelism, each device has whole models (Tensor W, Shape = (k, n)
), which corresponds to a broadcast
configuration.
Suppose there are two devices, then GPU 0 has the first half of the data (tensor a, shape = (M / 2, K)
), and GPU 1 has the second half of the data, then we say that Tensor X’s SBP parallel = split(0)
.
At the same time, we can see that the matrix multiplication output Tensor out is also divided according to the 0th dimension.
For a linear Layer, there are two types of model parallelism, namely, the 0th dimension (row split, corresponding to RowParallelLinear in Megatron-LM) and the 1st dimension (column split, corresponding to ColumnParallelLinear in Megatron-LM) of the model Tensor.
The logical computation graph -> physical computation graph mapping relationship of the row split (RowParallelLinear) model parallelism is shown in the following figure:
The original model tensor is split into pieces of sub-tensor, which are then distributed to each GPU. In this example, GPU 0 is responsible for the first half of the model tensor, while GPU 1 takes care of the other half. Thus, the size of Tensor W on each device is Shape = (k/2, n)
, and their corresponding Tensor Out retains the size of the original model tensor, namely Shape = (m, n)
. However, though the sizes are the same, Tensor Outs are only partial results. As a result, we can get the final out tensor (corresponding to the original model tensor) by adding up the two Tensor Outs, i.e. SBP Parallel = PartialSum
.
Another way of splitting the big model tensor (ColumnParallelLinear) is demonstrated below:
In this example, both Tensor b and Tensor out are split following the rules of Split(1), where each device requires all the data.
In an actual GPT network, both RowParallelLinear and ColumnParallelLinear will be used repeatedly and alternatively. Thus, the SBP Parallel output can be divided into to categories: PartialSum and Split(1). Since all data may be required in the next Model Parallelism execution (Broadcast), AllReduce and AllGather need to be used to achieve communication among devices. In OneFlow, we call this process Boxing. When two logical operators are treating a tensor as different types of SBP Parallel, OneFlow will automatically insert tasks such as communication and data split/transmission/integration so that the later operator will always get the type of SBP it desires.
The following figures illustrate two types of Boxing. The first one demonstrates the process of PartialSum -> Broadcast. Specifically, to receive a piece of intact data, we need to sum each PartialSum together, which is where OneFlow automatically executes AllReduce (a similar example would be writing out AllReduce inside Megatron-LM).
The second figure describes the process of Split (1) -> Broadcast. In this case, we need to contact the tensors which were previously obtained by splitting the first dimension of the original big tensor. To retain consistency, OneFlow will automatically carry out AllGather:
To conclude, OneFlow decides when and how parallel devices communicate with each other with the help of OneFlow. Through Boxing, OneFlow enables users to execute various types of Data and Model Parallelisms with significantly less work.
2-D SBP is obtained by inserting a 1-D SBP into another 1-D SBP. In terms of MatMul’s Data Parallelism (a x b = out
), its corresponding SBP signature would be{a : Split(0), b : Broadcast, out : Split(0)}
, Model Parallelism (RowParallelLinear) would be{a : Split(1), b : Split(0), out : PartialSum}
, and finally Hybrid Parallelism would be {a : [Split(0), Split(1)], b : [Broadcast, Split(0)], out : [Split(0), PartialSum] }
When the later operation takes this out tensor [Split(0), Broadcast]
as an input, OneFlow will automatically insert an AllReduce operator between the two operations.
Specifically, this AllReduce Op will be split following the rules of Split(0).
For instance, if there are 4 machines with 8 devices each, these 8 devices in each group will execute AllReduce with each other, but no communications are happening among groups.
Through this type of formulation, OneFlow is able to achieve model parallelism (inside each machine) and data parallelism (machines communicating with each other) simultaneously.
When facing a more complex parallel network, most users may find Megatron-LM extremely inconvenient to use. However, with OneFlow’s Boxing algorithm, the level of complexity is not much of a problem since the nature of any parallel model is a variation of 2-D SBP.
Of course, 2-D SBP is one of the simplest examples.
Related articles:
OneFlow Made Training GPT-3 Easier(Part 1)
Correct Level of Abstraction for Distributed Deep Learning Frameworks(Part 2)
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.