Combating Software System Complexity: Appropriate Abstraction Layer

OneFlow
10 min readSep 27, 2021

--

Written by Yuan Jinhui; Translated by Wang Kaiyan, Dong Wenwen

In previous articles, we have explored methods such as “entities should not be multiplied unnecessarily” and “conceptual integrity” to combat the complexity of software systems, emphasizing the need for a more “restrained” use of abstraction, i.e., to try to solve the largest number of problems with the fewest concepts.

Of course, abstraction is a necessary means of overcoming complexity. Whether it is modularity or virtualization, the purpose of abstraction is to “isolate” and “hide” information, to remove the falsehoods and keep the truth as well as to remove the crudeness and keep the essence. In this article, we take the design of the network data transmission module in the distributed DL framework as an example to discuss how to do the right layered abstraction.

Benefits of Layered Abstraction

Computer pioneer David Wheeler has a famous saying:

All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection.

Unless a higher level of abstraction is introduced than required, layered abstraction always brings plenty benefits: introducing a new layer of abstraction hides details that are not relevant to the upper-level application and selectively exposes some functionality to the upper-level users for their use. This also means that the underlying details are transparent to the user, without compromising the functionality of the upper-level application, while enjoying the benefits of the “separation of concerns” and increased ease of use.

There is no better-known example of layered abstraction for network interconnection needs than the seven-layer Open System Interconnection(OSI) and the four-layer implementation of the TCP/IP treaty series. Each layer of abstraction solves one problem, making the implementation of the previous layer simpler and eliminating the need to focus on the implementation details of the next layer.

Figure 1. The logical correspondence between the Open System Interconnection and the TCP/IP treaty series

This article discusses network transport abstraction for DL frameworks but does not cover OSI or TCP/IP implementations, focusing only on the transport layer to the application layer.

Control Plane and Data Plane

Distributed DL frameworks need to handle various network transmission requirements at both the system design and implementation levels, and functionally, there are two basic types of requirements:

The control plane handles protocol-level requirements often included in distributed systems, including handshaking between machines at system startup, graceful exit at the end of a run, delivery of computation graphs and execution plans, heartbeat protocols, and so on. This kind of communication requires a small amount of transmission and low transmission frequency, so the ease of programming is more important than transmission efficiency.

The data plane handles the transmission required for the computational process of distributed deep learning, including the transmission of raw data, intermediate activation values or inverse computation of difference results, weights and gradients, and so on. This kind of communication requires large amounts of transmission and very high transmission frequencies, so transmission efficiency is more important than the ease of programming.

Overview of Network Transmission Requirements for Distributed Deep Learning Frameworks

We can classify the modules related to network communication in the deep learning framework according to the hierarchy shown in the following figure.

Figure 2. Abstraction layers for network transmission in distributed deep learning frameworks

The lowest-level communication protocol can be the most common TCP/IP, which is based on socket for programming. The requirements of deep learning loads for transmission bandwidth and latency are high, and generally deep learning clusters support RDMA protocols, which are programmed based on IB verbs. Regardless of the number of layers of encapsulation used in the upper layers, all DL frameworks ultimately rely on sockets or IB verbs for their communication functions.

Theoretically, any transmission requirement can be programmed directly based on the underlying protocol, which will bring maximum efficiency gains. However, the complexity of programming directly based on the underlying API is too high, and the programming interfaces of sockets and IB verbs are inconsistent. To reduce the difficulty of network programming, encapsulation of the underlying network interface is generally required.

First, a layer of point to point (P2P) transfer abstraction is introduced, such as ZeroMQ, which hides the complexity of socket programming and exposes only the message-based and pipe-based abstractions upwards. P2P abstractions can hide the difference between socket and IB verbs, exposing a consistent programming interface to upper layers. The NCCL, PyTorch TensorPipe, and OneFlow CommNet modules are all designed for this purpose.

The threshold for using the P2P programming interface can be further lowered by encapsulating the P2P pipeline into a remote procedure call (RPC). RPC is a client-server architecture that allows programs on one machine to use services on other machines as if they were calling local functions. The control plane requires ease of use over transfer efficiency, and without exception, all DL frameworks rely on the remote procedure call (RPC) abstraction at the control plane.

There are various implementations of RPC, such as Google’s gRPC, Baidu’s bRPC, PyTorch’s RPC library based on its own back-end TensorPipe, and so on. However, RPC abstraction introduces additional overhead to data transfer, such as serialization and deserialization operations, which affects transfer efficiency. In particular, most RPC implementations (e.g., gRPC, bRPC) do not support RDMA and cannot enjoy the benefits of high-performance networking (of course, PyTroch’s RPC developed based on TensorPipe, which supports RDMA, is an exception).

As a result, data plane transfer requirements are not recommended for RPC-based development. TensorFlow data plane point to point transfers are implemented based on RPC, which was once a performance bottleneck for TensorFlow. However, the community has developed a TensorFlow networking plugin for TensorFlow that helps TensorFlow use RDMA for point to point transfers.

Communication modes in distributed computing can be divided into “regular” and “irregular”. The regular transmission mode refers to the requirements similar to MPI in high performance, including all-reduce, all-gather, reduce-scatter and other cluster communication primitives. All transmission modes other than cluster communication primitives can be called irregular transmission modes.

Examples of regular transmission modes include NVIDIA’s cluster communication library NCCL for GPU clusters, as well as the DL framework’s own cluster communication features, including PyTorch gloo, OneFlow boxing, and the parameter server ps-lite (here the ByteDance version is used as an example, which supports RDMA). However, in general, the cluster communication functions developed by the framework cannot surpass NCCL, so all DL frameworks integrate and invoke NCCL by default.

Ray is used here as an example for implementing irregular transmission modes with the highest degree of flexibility to implement any distributed functionality, including of course those regular transmission modes. However, the disadvantages are that, on the one hand, Ray is based on gRPC implementation and does not support RDMA protocol for high-performance networks, and on the other hand, deep learning is mainly about regular communication mode, and the efficiency of cluster communication based on Ray, a general distributed system, is not as good as the cluster communication library specially developed for regular transmission mode. Therefore, I think there is no place for Ray in the deep learning market.

Both regular and irregular transmissions can be implemented based on point to point transfer. However, when a feature becomes particularly important, it is worthwhile to design and implement a set of code specifically for it.

To compensate for the fact that earlier versions of DL frameworks like TensorFlow and PyTorch had poor support for even the simplest data parallelism, plugins like Horovod and BytePS were developed specifically to speed up data parallelism. However, once the DL frameworks themselves solved these problems, the need for such plugins was not as strong.

To compensate for the inability of existing DL frameworks to support the data parallelism and pipeline parallelism needed to train large models, custom libraries such as DeepSpeed and Megatron-LM have emerged. However, these features are supposed to be supported by general-purpose DL frameworks. Once the DL framework itself has these features, the need for such custom libraries is not as strong.

For general-purpose distributed computing frameworks like Ray, DL frameworks like TensorFlow, PyTorch, OneFlow, and plug-ins like Horovod, BytePS, and custom libraries like DeepSpeed, Megatron-LM, etc., more consideration needs to be given to what programming interfaces are exposed to upper-level users. These issues will not be discussed in this article, and will be covered in a special article later.

Best Practices

Through the above analysis, we can approximately get some best practices:

  1. The underlying transport protocol should support both sockets and IB verbs.
  2. The control plane is best implemented through RPC abstraction.
  3. The point to point transfer in the data plane can be implemented directly based on P2P encapsulation without going through RPC encapsulation.
  4. The regular communication mode of the data plane should use NCCL, irregular communication mode can be achieved with the help of point to point transmission.

It can be seen that the abstractions of RPC, point to point transfer, and clustered communication primitives are essential. We should use the right abstraction in the right scenario, especially not to implement point to point transfer and cluster communication transfer based on RPC, which is not worth the loss. However, there are some counterexamples in the original mainstream framework.

The original design of TensorFlow was based on a gRPC implementation for both the control and data planes, which was clearly not well thought out and had serious performance issues. When this problem was exposed, there were several fixes, such as extending gRPC to use RDMA for transfers, but this project was later aborted; deprecating the gRPC-based parameter server and using NCCL for cluster communication, which worked well and was the consensus of all other frameworks; developing TensorFlow networking, so that TensorFlow point to point transfer can be done with the help of RDMA, but this feature is a plug-in, the default case is still through gRPC.

TensorFlow also implements a set of cluster communication features based on gRPC’s P2P semantics, which is also redundant and impractical.

To summarize:

(1) instead of building a P2P transfer component based on the underlying transport protocol, the P2P transfer component is leapfrogged to achieve P2P transfer requirements directly based on gRPC, which certainly reduces the difficulty of implementation, but loses the opportunity to use RDMA and introduces the overhead of gRPC itself.

(2) not only based on gRPC to meet the needs of P2P transfer, but also with the help of gRPC to achieve the function of cluster communication, which is not a problem from the point of view of completeness, but from the point of view of practicality is very bad. The cluster communication function is a frequently used and highly dependent module for deep learning, which is fully worthy of highly optimized customized development (actually NCCL), but the gRPC-based cluster communication function is rarely used.

PyTorch implements TensorPipe specifically for point to point transfer, and implements an RPC library based on TensorPipe. However, in the case of point to point transfer, PyTorch insisted on using RPC-wrapped TensorPipe instead of calling TensorPipe directly. Personally, I think this layer of RPC encapsulation is superfluous and a waste of TensorPipe, which is well designed and implemented.

Taking OneFlow as an example, when implementing a distributed DL framework based on the above best practices and removing plugins that may be in transition, and with data parallelism, model parallelism, and pipeline parallelism as native features of the framework, an appropriate abstraction layer might look like this:

Figure 3. An appropriate abstraction layer of OneFlow

A notable feature of the above diagram is that the cluster communication is built based on a point to point communication library directly skipping RPC. Of course, this is still not the most ideal design, and I think a better state would be:

  1. Implement a simple RPC library based on CommNet to remove the dependency on the bloated gRPC. OneFlow only needs the simplest RPC functionality, and implementing a set of RPC based on CommNet is not complicated.
  2. Boxing can already support generic cluster communication functions, so it can be used on various accelerators other than NVIDIA GPGPUs. However, the performance on GPU clusters can not compare with NCCL, if it can exceed NCCL, it can completely remove the dependence on NCCL, but it is very difficult to make a cluster communication library that is stronger than NCCL.

The interface for cluster communication does not need to be designed, alignment with MPI is sufficient. Achieving a cluster communication library that can challenge NCCL is a very interesting problem to be solved in the future, giving full play to pipeline functions and maximizing overlapping transfers and computations is the top priority. I hope OneFlow can achieve this goal in the future.

Conclusion

Layered abstraction enables information hiding and separation of concerns, making the details of the underlying implementation transparent to upper-level users and enabling interface-oriented rather than implementation-oriented programming for upper-level applications.

The network transmission requirements in a distributed DL framework can be divided into a control plane and a data plane, both of which have different requirements.

The control plane requires more ease of use, so the control plane generally requires RPC abstraction. The data plane requires more efficiency, so the data plane discourages middle layer abstractions and supports at least two protocols, TCP/IP and RDMA; the data plane is divided into point to point transfers and clustered communication libraries, both of which have highly optimized implementations.

Ray and other general distributed computing frameworks are good at dealing with irregular communication modes, but the mainstream of deep learning scenarios is regular communication mode, so there is limited scope for Ray.

This article focuses on the design principles inside the distributed DL frameworks. In the following articles, we will discuss how to design the interfaces of distributed DL frameworks for upper-level developers.

Related articles:

1.Combating Software System Complexity: Conceptual Integrity and Uniform Metaphor

2.The Limitations of Existing Deep Learning Frameworks: Dynamic Scheduling

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

--

--

OneFlow
OneFlow

Written by OneFlow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient. https://github.com/Oneflow-Inc/oneflow