# How to Go beyond Data Parallelism and Model Parallelism: Starting from GShard

*Written by Jinhui Yuan; Translated by Kaiyan Wang, Xiaozhen Liu*

This article lists papers on GShard, presents background information and inspiration from the papers, and finally evaluates what else can be done to improve GShard from similar work that has been done in OneFlow.

GShard papers, first placed on the arXiv on June 30, 2020, include “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (*https://arxiv.org/pdf/2006.16668.pdf*)", and a more systematic paper “GSPMD: General and Scalable Parallelization for ML Computation Graphs (*https://arxiv.org/pdf/2105.04663.pdf*)".

The paper contains two main parts of work, one on parallel APIs and one on Mixture of experts. The former part is more interesting and I will only discuss this part. The contribution on parallel APIs is outlined clearly in the abstract of the paper:

GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler.

I am not going to go into too much detail which is available in the original paper. This article will only present some background information and an evaluation of what else could be improved in GShard from similar work that has been done in OneFlow. Only by putting GShard in context can we see more clearly how good and bad it is.

# Inspiration from work similar to GShard

Starting with data parallelism and model parallelism, let’s list the related work I know of before GShard.

**1. One weird trick for parallelizing convolutional neural networks**

This is perhaps the first paper exploring model parallelism, published by Alex Krizhevsky on the arXiv in 2014 (*https://arxiv.org/pdf/1404.5997.pdf*).

The most important insight of this paper is the discovery that different layers are suitable for different types of parallelism. Specifically, convolutional layers with larger data than parameters are suitable for data parallelism, while fully connected layers with larger parameters than data are suitable for model parallelism.

This was first implemented on cuda-convnet, an early deep learning framework that fewer people know about now.

**2. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks**

This paper was posted on ICML by Zhihao Jia in 2018 (link: https://cs.stanford.edu/~zhihao/papers/icml18.pdf). He has published two influential papers in this field.

Alex’s paper intuitively suggests that there are levels suitable for data parallelism and levels suitable for model parallelism. So given a certain neural network, is there an automatic way to find the optimal parallel approach? This paper by Zhihao Jia is trying to address this question.

First, this paper goes further in abstraction and finds that data parallelism and model parallelism are just different ways of tensor slicing. Some are data slicing, some are model slicing, and for multidimensional tensor, the effect is different in different dimensions, such as sample, channel, width, length, etc.

Second, different slicing methods are different configurations that will lead to different effects. So finding the optimal parallelism is actually a search for the optimal configuration in the configuration space. Then, the problem becomes a search problem.

Finally, it introduces a cost model to measure the pros and cons of each configuration and proposes a series of methods for pruning the search space, and implements a prototype system.

This paper outlines the basic framework for automatic parallelism, and many automatic parallelism solutions are such a process.

**3. Beyond Data and Model Parallelism for Deep Neural Networks**

This paper, published by FlexFlow on SysML in 2019 (*https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf*), mainly proposes the execution simulator to refine the cost model. But on the abstraction of the search space, I think it is rather a bit of a step backward. For example, while the previous paper worked on generic tensor slicing, this one tries to name by partitioned dimensions, especially SOAP (sample, operator, attribute, parameter).

Firstly, I think naming partitioned dimensions is a regression in abstraction and generality, which is where I feel Mesh-Tensorflow below falls short and where GShard excels in abstraction. Secondly, it is not perfect in conceptual simplicity and completeness, i.e., there is duplicate semantics inside SOAP and it is not complete enough.

**4. Supporting Very Large Models Using Automatic Dataflow Graph Partitioning**

This is the Tofu (*https://arxiv.org/pdf/1807.08887.pdf*) published by Minjie Wang et al. on EuroSys in 2019. The problem this paper is trying to solve is the same as that of the two proposed by Zhihao Jia, and is a nearly simultaneous and parallel exploration.

Tofu proposes a set of DSLs to facilitate developers to describe the tensor partitioning strategy, using a poly-like integer interval analysis to describe the parallel strategy. Similarly, a lot of very distinctive work has been done on the search algorithm for the parallel strategy. Here, however, I am mainly concerned with how these works abstract the search space.

The difference between Tofu and other work is that it focuses on the partition of operators, while other work focuses on the partition of tensors, which are of course equivalent. However, I think it is better to focus on the partition of the tensor, which does not require the user to modify the implementation of the operator. Tofu requires the user to describe the implementation of the operator in the DSL.

This difference is also reflected at the API level. For example, Mindspore and OneFlow, as one of the few systems in the general framework that implement complete data parallelism and model parallelism, differ in their Python APIs. As you can see in the sample code of Mindspore training the PanGu Model, its partition interface is placed on operator. In contrast, OneFlow’s SBP system is to put the partition interface on the tensor, with a single card on the operator API exactly the same as distributed training (OneFlow’s paper: *https://arxiv.org/pdf/2110.15032.pdf*).

This paper discusses the scope of this solution and what problems cannot be solved, which is rare in similar work. In particular, it limits the partition-n-reduce model, which is actually the same model for GShard. `Partition and reduce`

also have some drawbacks. In fact, after partition, it is not necessary to `unreduce`

immediately. It is possible to have intermediate partial results continue to participate in the computation in the system, which may be more efficient. This is why OneFlow's SBP annotation system introduced P (Partial).

**5. Mesh-TensorFlow: Deep Learning for Supercomputers**

Mesh-TensorFlow is a paper presented by Google Brain on NIPS 2018 (*https://arxiv.org/pdf/1811.02084.pdf*). It should be noted that the authors of Mesh-TensorFlow and GShard are almost the same, so Mesh-TensorFlow can even be seen as the predecessor of GShard.

The core idea of Mesh-TensorFlow is also beyond batch splitting, where data parallelism is batch splitting and model parallelism is tensor splitting in other dimensions. This paper abstracts the cluster’s accelerator card into a mesh structure and proposes a way to split and map the tensor to this mesh structure.

In my opinion, the main shortcoming of Mesh-TensorFlow is the need to name the dimensions of the tensor, which is anti-abstraction and loses generality. Of course GShard overcomes this problem as a follow-up work, and this is where GShard has improved.

So much for the literature review, and let’s summarize:

**The purpose of all this work is to provide an annotation system similar to the “type system” of the programming language. This system needs to be minimal and complete, and it defines the “automatically parallelled” search space.****Any configuration in the search space, i.e., any parallel strategy, is mathematically correct; the only difference is the execution efficiency, and our goal is to find the most efficient parallel strategy.****The framework needs such a capability that, given any kind of configuration, it can be translated and converted into a physical network (execution plan) that ensures that this parallel strategy can be executed successfully, even if it is not very efficient.****The framework should ideally be able to automatically search for the most efficient one.**

By examining all the work in these dimensions, we can evaluate each of them, and of course GShard.

# GShard Needs to Be Improved

GShard provides three types of annotations:

I think GShard has some shortcomings compared to OneFlow’s SBP.

First, this definition is redundant to some extent. split and shard are actually the same thing, but split is done in one dimension, while shard can split in multiple dimensions. broadcast in OneFlow corresponds exactly to GShard replicate, and split in OneFlow corresponds to split and shard in GShard. However, OneFlow extends split to multi-dimensional: 1D split is equivalent to GShard split, and ND split is equivalent to GShard shard.

Second, this definition is not complete. The concept of Partial in OneFlow SBP is missing in GShard. Thus, when there are local computation results in the system, they needs to be processed immediately by a reduction operation to get the complete result (If the inputs are partitioned along contracting dimensions, the local result is partial and we need to use an AllReduce to combine them and produce the final result).

However, in practice, some local computation results are still legal to participate in downstream computations, and the reduction operation can be delayed until necessary.

To illustrate these issues, I will give a brief overview of OneFlow’s SBP here.

The same logical tensor can be mapped in the above ways on two devices.

For matrix multiplication, when the SBP Signatures of the two inputs are given, the SBP Signature of the output is also determined. The above table lists all legal implementations of matrix multiplication. As we can see from rows 5 and 6, the local results can be used as inputs for matrix multiplication, but the output is also a local result. If there is a sequence of matrix multiplications Y = U * V * W, the previous local computation results are not required to participate in the downstream computations after the reduction. The local computation results can keep flowing through the system without reduction until the full results are needed (As a side note, it is more precise to use “unreduced” for “partial” above).

Finally, GShard is not concise enough for the concept of multidimensional division. It uses different definitions for one-dimension and multi-dimension: split and shard. While OneFlow uses `split`

uniformly, only distinguishing between 1D and ND, which is more general.

The following figure shows an example of a 2-dimensional split. The whole device is divided into 2 groups, and each group contains 2 devices. A matrix can be first divided into two groups by splitting S(0) on the 0-axis, and then divided into 2 devices by splitting S(1) on the 1-axis within each group.

Multidimensional SBP can be very powerful. For example, it is very convenient to implement the 2D matrix multiplication based on SBP in the following paper: An Efficient 2D Method for Training Super-Large Deep Learning Models (*https://arxiv.org/pdf/2104.05343.pdf*).

*Originally published on Zhihu:*

I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more details of OneFlow.

*Related articles：*

*Welcome to visit OneFlow on** **GitHub** **and follow us on** **Twitter** **and** **LinkedIn**.*

Also, welcome to join our *Discord group*** **to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.