What an Optimal Point-to-Point Communication Library Should Be? (Part 1)
Written by Jinhui Yuan; Translated by Kaiyan Wang, Xiaozhen Liu
In the previous article “Combating Software System Complexity:Appropriate Abstraction Layer”, we divided several levels of abstraction when discussing the network transmission requirements of distributed deep learning frameworks, and one of the lower levels was point-to-point (P2P).
In this series, we will discuss why need to build a new point-to-point communication library, what an ideal point-to-point communication library should be, and how to work together to construct an open-source project for it.
As the first half of this series, this article will start from the general characteristics of the ideal point-to-point communication library. And in the next part, we will dive into the details.
1. What is Point-to-Point Communication?
First, let’s take a look at the definition from Wikipedia: In telecommunications, a point-to-point connection refers to a communications connection between two communication endpoints or nodes. In short, it is a one-to-one transmission, with only one transmitter and one receiver.
Why is point-to-point transmission important? Because point-to-point transmission is the basic unit to build any complex transmission mode in the upper layers.
For example, ring all-reduce or tree all-reduce, which are commonly used in distributed deep learning training, are assembled based on the most basic point-to-point transmission feature. Point-to-point communication libraries can also be encapsulated into more user-friendly interfaces. For example, the various libraries of remote procedure call (RPC) are also based on point-to-point transmission.
First, you should know that this article only introduces CPU to CPU transmission. While in real projects, there are more GPU to GPU transmissions, which is more complicated. Among which, the simplest is GPUDirect RDMA, the same as RDMA programming on CPU, but only supports data center level GPU, otherwise, it should be a mode like GPU-CPU-net-CPU-GPU.
2. What is Point-to-Point Communication Library?
In fact, the network programming APIs provided at the OS level are point-to-point, such as sockets. The RDMA library itself is a point-to-point API.
Then, why do we still need a library? The main purpose is to make programming easier and universal without reducing the performance. By hiding the various underlying programming interfaces and exposing consistent APIs to upper-layer applications, we can create a consistent programming experience for users. For example, both TCP/IP sockets and RDMA networks have different programming interfaces. But we want the programs written by upper-layer applications to invoke the underlying transmission service through a consistent interface.
ZeroMQ (https://zeromq.org/) is a point-to-point communication library with a wide range of applications (it also supports some multi-party communication features), making Socket programming more simple with higher performance and making it easier to write high-performance web applications.
3. Why Build a New Point-to-Point Communication Library?
The answer is simple: the existing point-to-point communication library has various problems. For example, ZeroMQ does not support RDMA so it is not suitable for deep learning scenarios.
OneFlow has a CommNet module, which supports both socket and RDMA. Both its interface and implementation are satisfactory, but it is not independent enough because it is deeply coupled with the OneFlow system, making it not convenient to be used by other projects. Facebook has built TensorPipe for the PyTorch project, which supports sockets and RDMA. Both its interface definition and implementation are very close to what I thought an ideal communication library would be, but it still has shortcomings. After reading the whole article, I believe you will understand this.
4. What are the Characteristics of an Ideal Point-to-Point Communication Library?
Here, I list three characteristics of an ideal point-to-point communication library based on the underlying transmission mechanism, upper-layer application requirements, and existing point-to-point communication libraries:
- Easy to program and able to meet the requirements of a variety of upper-layer applications. It can be encapsulated into RPC for use in deep learning frameworks like OneFlow, and even for use in HPC and common cluster communication primitives in deep learning (all-reduce, broadcast, etc.).
- High performance demonstrated by zero-copy, low latency, and high transmission efficiency.
- Underlying layers support for TCP/IP sockets and RDMA transfers.
To meet the above requirements, the communication library should be technically designed to implement the following four points:
- Message-oriented programming model
- Non-blocking interface
- Zero-copy
- Friendly to both small and large messages
Then let’s discuss in more detail why these points are critical.
5. Message-Oriented Programming Model
Both Socket and ZeroMQ abstract the channel of point-to-point communication as a pipe. The transmitter writes data to the pipe with the following send
function, and the receiver reads data from the pipe with the following recv
function (we deliberately omit the endpoint addresses of the transmitter and receiver in the function input parameters, such as the file descriptor of the Socket).
The communication library does not care about the specific content of the transfer, so it regards the content as a sequence of bytes (which means that serialization and deserialization are the responsibility of the upper layer application). The communication library only cares about the size of the transfer (byte size). Assuming that the size of the transfer is known in advance by both sides. The transmitter will preallocate in_buf
with the size of the transfer and put the content to be sent into in_buf
and call the send
function; the receiver will also preallocate out_buf
with the size of the transfer and call the recv
function to receive the data. It should be noted that we assume the buffers in_buf and out_buf in the input parameters are user-managed.
To simplify the user programming process, the interface should be oriented to complete messages, instead of byte stream. That is, no matter how much data is transferred, the send
and recv
functions should complete the task once they return. In this way, users only need to call the function once when there is a transfer need, and do not have to care whether the underlying data is divided into multiple segments.
ZeroMQ and the blocking mode in Socket programming are both consistent with the above semantics. In blocking mode Socket programming, the function does not return until the data has been transferred. However, non-blocking Socket programming does not conform to the semantics. When the operating system cannot accommodate the data transfer at once, it will first complete a part of the transfer and return the actual number of bytes transferred, and the user may need to call send
and recv
functions again later for the transfer.
6. Non-blocking Call Mode
Let’s take Socket programming as an example. In blocking mode, the function will return only when the data actually completes the transfer. However, the transfer time is determined by the transfer volume and the transfer bandwidth, and it may take a long time to wait. During this period of time waiting for the transfer to complete, the threads calling send
and recv
can only sleep and do nothing else. Many threads may have to be started and managed in order to increase the system transfer efficiency.
In non-blocking mode, when send
and recv
are called, if the system cannot complete the transfer at once, it will copy a piece of data from user space to kernel space (although this copy takes very little time to execute, we still need to be aware of it) and return the amount of data transferred, prompting the user that "the transfer is not complete, please call again at the appropriate time to continue the transfer".
Neither of these two modes is user-friendly enough. The best way is to treat the communication library as a service for the needs of the upper layer. The upper layer application sends the task to the communication library and returns immediately. Notify the upper layer application when the transfer is complete. For this purpose, the API can be adapted as follows:
That is, each function has an additional input callback function, so send
and recv
will return immediately. The user-defined callback function done
is executed when the data transfer is complete.
Of course, with this non-blocking programming interface, it is very easy to do a little work to get the blocking mode supported.
7. Zero-Copy
In the above discussion, we assumed that the in_buf and out_buf memory is managed by the upper layer application. For example, if in_buf is allocated before send
is called, it can be freed after the send
function returns. However, please note that in non-blocking mode, even if send
returns, the data may not have been sent yet. Therefore, the library must request a section of memory inside the send
function and copy the data of in_buf to the section of memory managed by the library. In this way, the library can keep using this section of memory managed by itself until the data is actually transferred and then released.
However, the above solution has some disadvantages. For example, each time data is transferred, the communication library has to allocate additional memory of the same size as the buffers passed in by the user. The allocation takes time and copying data from the application buffers to the buffers managed by the communication library also takes time, and uses more memory.
Therefore, it is more desirable that although in_buf is allocated by the upper layer application, the ownership of the memory in that buffer is transferred to the communication library when the send
function is called. In this case, the in_buf cannot be released immediately after the send
function returns, because the in_buf is used directly during the send
process. The in_buf memory cannot be released in the callback function done
until when the sending actually completes.
Similarly, even if out_buf is also allocated by the communication library. When the callback function done
in recv
is executed, the ownership of out_buf is transferred to the upper layer application, instead of copying out_buf to an application-managed buffer again.
In the next article, we will go into more details and talk about the CommNet design in OneFlow.
I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more details of OneFlow.
Related articles:
- How to Go beyond Data Parallelism and Model Parallelism: Starting from GShard
- How to Implement an Efficient Softmax CUDA kernel? — OneFlow Performance Optimization Sharing
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.