What an Optimal Point-to-Point Communication Library Should Be? (Part 2)

OneFlow
9 min readDec 17, 2021

--

Written by Jinhui Yuan; Translated by Kaiyan Wang, Xiaozhen Liu

In the last article “What an Optimal Point-to-Point Communication Library Should Be? (Part 1)”, we introduced what a point-to-point communication library is, and discussed some of the general characteristics of the optimal P2P communication library. To continue the unfinished topic, in this article, we’ll dive into the details about how to design an optimal P2P library and introduce the design of CommNet in OneFlow.

1. How to Negotiate the Amount of Data to be Transferred by Transmitter and Receiver?

Previously, we assumed that both the transmitter and the receiver already knew the amount of data to be transferred, that is, the value of the parameter size. This assumption is not realistic, but not far off the mark.

First of all, every time there is a transfer request, although the transfer size is not the same, the transmitter must know the transfer size, while the receiver does not necessarily know it. Second, before each transfer, the transmitter can send the value of the size, so that the receiver knows the amount of the data to be transferred and can allocate the memory well in advance.

It should be noted that before transferring, both sides need to communicate the amount of the data to be transferred, i.e. the value of size. This value is transferred via send/recv, which is fixed and does not need to be communicated by both sides. This makes it somewhat like a bootstrap process.

Suppose the data needs to be sent from A to B. In this case, we need to call send/recv pair at least three times, as shown in the following figure:

In the first communication from A to B, send and recv are called respectively. A transfers the value of size to B, and B allocates memory (alloc) for out_buf based on the received value. After B allocates the memory, the second communication starts from B to A. B sends a "please start" signal to A. This signal is short and fixed in length, so both A and B do not need to negotiate the memory to be allocated. When A receives the "please start" signal, the third communication begins with data transferred from A to B.

What’s wrong with the above solution?

First of all, send and recv need to be called three times for each communication. Even though the amount of the data to be transferred is small, the data has to endure the delays of three communication.

Second, send and recv must be used in pairs. That is to say, the transmitter and receiver must be called at the same pace. For example, if the transmitter calls send and the receiver does not call recv, the communication will not succeed. Or if the transmitter calls send twice, but the receiver calls recv only once, the second communication will also fail. What's more, it is up to the transmitter to decide when a transfer is needed, the receiver is passive and does not know when recv needs to be called, so the above solution does not work well.

How to solve the above problems?

For the first problem, two transfer modes can be designed for short messages and long data. For data to be transferred within lengths of a certain threshold, it can be sent directly without any negotiation between the two sides. The transmitter can assume that the receiver must be able to receive it successfully, and it can also assume that the receiver must have called recv in advance to pair with send. While for long data, the transfer must be done by three calls as above.

For the second problem, the communication library can always prepare in advance for short message requests sent from anywhere, that is, it prepares a fixed number of recv calls in advance.

Those who are familiar with Grpc asynchronous programming or RDMA programming might be familiar with this. Each communication process prepares a number of PostRecvRequest in advance at startup, and each time it pairs with a send elsewhere, it consumes a RecvRequest and then replenishes it with a new one in time.

Finally, some readers may be confused about why the receiver needs to know the value of the transfer size in advance when transferring long data. This is mainly to allocate memory well in advance to ensure that the data can be transferred successfully and that no memory needs to be allocated again during the transfer process, and zero-copy can be achieved.

Otherwise, assuming that the memory is not allocated well in advance, it is necessary to adjust the memory to be allocated according to the demand of the transfer. If the allocation is unsuccessful, the transfer process will be interrupted because of insufficient memory. Of course, in this case, zero-copy cannot be achieved.

2. The Design of API

With the above discussion, it looks like only the send/recv interface is needed for all the requirements of transferring short messages and long data.

However, in addition to the API, the transmitter and receiver also need to process some complex logic. The receiver always needs to prepare some RecvRequest in advance. When transferring long data, both the transmitter and receiver need to negotiate back and forth several times. From the point of view of designing an underlying library, we want to simplify users’ operations as much as possible and hide the details that are not related to their demands. In this way, only the send/recv interface is not enough.

For short messages, we want the transmitter to be able to send them directly, and the communication library to ensure that the recv call is ready at the receiver side. This recv does not need to be called explicitly, i.e., in the short message scenario, the recv API is not necessary. The user only needs to provide the communication library with a callback function after receiving the short message. Whenever the receiver receives a short message, the corresponding callback function is called to process it.

If multiple types of short messages are required in business logic, then the short messages can be categorized and just the corresponding type of callback function is provided for each short message type.

For the second and third communication of a long data transfer, the receiver needs to call send and recv once and the transmitter needs to call send once, but the details of these calls should be transparent to users. All these operations can be done by the underlying communication library. The user programming interfaces can be combined into a single-sided operation read to be called by the receiver, while the transmitter's application does not need to do anything. However, the user-specified callback function still needs to be called after the data transfer is finished to process the received data.

That is to say, the minimum API for a point-to-point communication library can be of the following form:

It should be noted that in practice the read interface requires a token to mark the location of the data on the transmitter side, through which the correct data can be read remotely.

3. The Design of OneFlow CommNet

CommNet satisfies the two most important abstractions required for OneFlow’s features: Eager Message and RMA Read. In the current implementation, Massage is used to transfer ActorMsg and RMA Read is used to transfer the actual content of regst.

Settings of Eager Message:

  • Point-to-point: each message corresponds to one transmitter and one receiver
  • The transmitter sends a message and the receiver receives the corresponding message later
  • The transmitter sends messages directly to the receiver without negotiation
  • The receiver accepts the message unconditionally
  • The transmitter could assume that the transfer will be successful and the receiver will receive the message
  • The receiver processes messages by polling or registering callbacks
  • Connected or connectionless abstraction. In connectionless abstraction, the transmitter uses the receiver’s identifier as the sending parameter, while in connected abstraction, the transmitter needs to establish a connection with the receiver beforehand and use the connection identifier as the sending parameter
  • Different messages sent by the same thread to the same receiver or connection must be received in the same order as sent by the receiver
  • The message is a fixed-size or dynamic-size block of data does not need to care about the upper-layer protocol
  • Generally designed for processing small blocks of data
  • Key metrics are generally latency and transfer efficiency

Settings of Remote Memory Access (RMA) Read:

  • Point-to-point: each operation corresponds to one local and one remote side
  • The local side initiates the operation, and the result of the operation is to read a section of data from the remote address into the local memory
  • The remote side needs to generate an access token in advance, and the local side must use the token to access the data in the address range registered by the remote side when the token is generated. The local side and the remote side need to exchange the access token by any other means before the operation is initiated
  • One access to the local side means any data in the address range corresponding to the access token can be read, and the data in the same address can be read any number of times
  • The remote side does not need to be involved in the reading process
  • The local side processes events that have been transferred by polling or registering callbacks
  • The local side assumes that remote memory is always available
  • Generally designed for processing large chunks of data
  • Key metrics are generally bandwidth and transfer efficiency

4. Discussion

Why abstract the second and third communication into a single-sided read operation and not let the transmitter explicitly call send or write? Because the calls are unnecessary. Their execution timing should be determined by the receiver and they should be automatic, so there is no need to expose it to the higher-level application interface.

Actually, readers who are familiar with RDMA programming should know that send is provided in RDMA but there is no recv interface, and RDMA provides single-sided operations such as write and read. Our discussion above shows that the point-to-point communication library requires only one single-sided operation: read. The rationality of our proposed programming API can be further verified by referring to the design of the RDMA programming interface.

In fact, similar interface designs have been proposed by scholars studying MPI. For example, to address the shortcomings of existing MPI interfaces, a group of scholars studying next-generation MPI described a similar design in “Towards millions of communicating threads”. In this paper, the transfer requirements for short messages are named “eager protocol”, while the transfer of long data requires negotiation between the two sides and is called “rendezvous protocol” (this concept is also exists in the distributed design of TensorFlow).

The above discussion is all about designing APIs from the perspective of upper-layer application requirements. Of course, the design of APIs also needs to consider the underlying implementation. For example, we need to consider the difference between the Socket-oriented epoll programming model and the RDMA programming model. Our communication library needs to support these different transfer mechanisms, and the design of the API should also take into account the difficulty of programming when using different transfer mechanisms.

We know that RDMA itself already provides single-sided operations send and read, and it should be convenient to use RDMA to support the API proposed in this article. However, when we go into further details, we can still find some complications. For example, RDMA transfers require pinned memory, and for variable-length data transfers, the overhead of allocating pinned memory online each time is relatively high. Solving this problem is not simple. On the other hand, epoll does not have a complete counterpart, so using it to implement the communication library may require additional work.

In the following articles, we will further discuss how to implement this communication library using RDMA and epoll.

I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more features about OneFlow.

Related articles:

  1. What an Optimal Point-to-Point Communication Library Should Be? (Part 1)
  2. How to Go beyond Data Parallelism and Model Parallelism: Starting from GShard

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

--

--

OneFlow

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient. https://github.com/Oneflow-Inc/oneflow