Written by Jing Qiao, Chi, Yao; Translated by Xiaozhen Liu, Chenyang Xu, Yakun Zhou
In the previous article, we introduced the basic concept of flow control and reviewed the history of credit-based flow control. In this article, we will focus on its present, analyze the core principle of credit-based flow control, and take a look at its application in other fields nowadays.
1. The Core Principle of Credit-based Flow Control
First, we can explore the core principle of credit-based flow control by analogy. You might find that this idea is particularly simple but ingenious, and it conforms to the “less is more” design aesthetics principle.
In networks for data transmission, buffers are very common, like the send buffer and receive buffer between the sender node and the receiver node. So, what role do these buffers play?
We can make an analogy between the buffers and the dams in series:
What are the benefits of building a dam? Let’s think about it:
The first benefit is the “buffer” effect: the dam helps make water more accessible to nearby residents, who would have to wait for the distant water to meet their urgent needs if there were no dam and a sudden increase in water consumption over a short time.
Second, the dam can help control the flood by timely feedback. For example, when a downstream dam fills, its upstream dams are notified to intercept the flow. Then, when a flood occurs, each upstream dam can help relieve flooding at downstream congestion points, and the capacity of each dam is used efficiently.
Just like a dam can regulate the fluctuation of water consumption, a buffer in networks for data transmission also solves the problem of device performance fluctuation, which leads to inefficiency between devices.
Then, can we simulate the principle of dam flood control to provide a feedback mechanism for buffers in the network to regulate the data flow dynamically?
Of course we can. That’s how credit-based flow control works.
In the scenario of dam flood control, whenever a dam fills, its upstream dams are notified to intercept the flow. That is to say, the downstream dams feed their storage capacity to the upstream dams.
Likely, in credit-based flow control scheme, the receiver (downstream node) can feed the occupancy of its buffer to the sender (upstream node) to regulate the amount of data sent by the sender. Here, we call this feedback information credit, as shown in the figure below:
The selection of credit value and the behavior of the sender after receiving the credit can be determined based on specific scenarios. For example, in the proposal for ATM network credit-based flow control submitted by Professor H.T. Kung in 1993, three ways to select credit values are given corresponding to different scenarios.
We won’t go into detail about this, just make it clear that the value of credit represents the occupancy of the sender buffer, and the sender determines the amount of data for sending based on the value of the credit.
It is clear that in credit-based flow control, the sending rate, or the flow amount, is oriented by the receiver and the sender is forced to adjust to being in sync with the receiver.
Therefore, credit-based flow control is also known as a flow control based on backpressure mechanism, which reflects the principle that the receiver controls the sender.
So what are the benefits of credit-based flow control?
- The receiver buffer overflow will never occur theoretically.
- The flow control is adaptive and can keep the receiver very busy all the time.
In practice, credit-based flow control also has some specific issues that need to be considered, such as:
- What if the credits are lost?
- Can a specific system accept the overhead for transferring the credits? In 1994 when TM Forum voted for the flow control scheme, it announced:
Credit-based flow control was considered not scalable to a large number of virtual circuits (VCs). Given that some large switches will support millions of VCs (in credit-based flow control, each VC has a credit and in ATM networks, there can be multiple virtual circuits on a single physical channel), this would cause considerable complexity in the switches. — Congestion control and traffic management in ATM networks: Recent advances and a survey
2. Credit-based Flow Control Thrives in Other Areas
As new technologies emerge in more and more areas, we can see a trend towards “multi-connection and distribution” for data transmission. As a result, many problems have been abstracted into network problems, and flow control has become an important part of these technical solutions.
The idea of Credit-based Flow Control is simple and clear, and it is applicable to any flow control scheme. In fact, many hardware or software systems have similar mechanisms, but under different names.
For example, from the hardwares like the new bus technology PCI Express and Intel QPI, to the new network protocol RDMA, and then to the stream computing system Flink, we can always find the idea of Credit-based Flow Control.
Next, let’s take a look at how Credit-based Flow Control works in these scenarios.
PCIe
PCIe is the abbreviation of PCI-Express (Peripheral Component Interconnect Express), which is a high-speed serial computer expansion bus standard. It was proposed by Intel in 2001 and has now replaced the older PCI, PCI-X, and AGP bus standards, becoming the most popular bus standard.
The data connection in PCIe is called lane. The actual data channel can be combined by multiple lanes to provide higher bandwidth, which is to some extent similar to the virtual channels in ATM networks. For flow control in such multi-channel communication scenario, it is necessary to deal with the problem of packet loss and resending due to insufficient receiver buffer. Credit-based Flow Control comes in handy here.
In order to know how many buffers are available for the receiver, the receiver needs to report the available buffers at any time through DLLP (Data Link Layer Packet). The DLLP here is the specific implementation of Credit in PCIe.
Intel QPI
Intel’s QuickPath Interconnect (QPI) technology is used to achieve direct interconnection between chips, replacing the traditional FSB (Front Side Bus). In the traditional FSB (Front Side Bus), all data communication is done through a single bus, which is called the single-shared bus approach, as shown in the figure below.
One of the highlights of QPI is that it supports multiple system bus connections, as shown in the figure below. The system bus will be divided into multiple connections and the frequency will no longer be fixed. The speed of each system bus connection can vary depending on the data throughput requirements of the various subsystems.
This transition from a single bus to multiple buses is a major direction for CPUs to improve data transmission performance in recent years.
This looks like a small network topology. Interestingly, its data communication protocol is also designed like a layered model of computer networking.
Among them, the data packet transmitted by the link layer is called Flit, which stands for Flow Control Unit. Each Flit contains a flow, whose function is to feed back the buffer space of Rx (Receiver) to Tx (Transmitter). Here is an example of using a Credit-based scheme to implement flow control to prevent the receiver’s buffer from overflowing.
RDMA
RDMA (Remote Direct Memory Access) is generated to solve the delay of server-side data processing in network transmission. It is very suitable for large-scale parallel computer clusters. In recent years, it has become popular in data centers.
At present, there are roughly three types of RDMA networks, namely Infiniband, RoCE (RoCE and Roce v2), and iWARP. Among them, Infiniband is a network specially designed for RDMA, which ensures reliable transmission from the hardware level, while RoCE and iWARP are both Ethernet-based RDMA technologies.
A closer analysis reveals that all three types of flow control schemes for RDMA networks are based on the idea of Credit-based Flow Control.
InfiniBand
In “InfiniBand Credit-based Link-Layer Flow-Control”, it is mentioned that Credit-based Flow Control is implemented at the link layer in InfiniBand. The Flow Control Packet of the link layer is the specific implementation of Credit in InfiniBand.
RoCE
In RoCE, the flow control algorithm is PFC (Priority-based Flow Control). The article “Revisiting network support for RDMA” describes the algorithm of PFC is that when the queue exceeds a certain configured threshold, the switch sends a pause frame to the upstream entity.
The downstream nodes (the receiver) controls the upstream (the sender). This is the idea of Credit-based Flow Control.
iWARP
Flow control is performed directly through TCP, and TCP is the idea of credit-based flow control.
Flink
In the era of internet big data, the requirements for real-time data are getting higher, and more and more distributed computing engines have begun to adopt streaming abstraction.
Apache Flink is the fourth generation of computing engine after Hadoop, Pig/Hive, Spark/Storm in the era of Big Data, and is designed to perform stateful computation on data streams.
As for the article “How Apache Flink™ handles backpressure” that introduces Flink, the backpressure in the title should be credit-based flow control.
Specifically, in the schematic diagram drawn in the article, the black bars of Task1 and Task2 represent the buffers.
The article mentioned: “If no buffer is available, reading from the TCP connection is interrupted” which is obviously the idea of credit-based flow control.
3. Review OneFlow’s Flow Control Design
At first, let’s review the design features of OneFlow. OneFlow automatically compiles the user’s logical computation graph into a physical computation graph on the distributed system, which is executed by the OneFlow runtime.
The execution unit of the physical calculation graph is actor, and the input data flows like water through each execution unit to complete the computation.
The computation graph composed of the actor in OneFlow has two unique features compared to the computation graph composed of OP in other frameworks:
- Having no central scheduling node.
- Each actor executing asynchronously. In other words, if the actor finds that there is data to read upstream and space to write downstream, it can work without knowing the other actors’ state.
The benefits of these two unique designs are:
- Decentralized design, which easily solves the single-point performance bottleneck of the central scheduling approach in mega-scale distributed training scenarios.
- Asynchronous execution design, which enables pipeline parallelism and makes the fullest use of hardware resources theoretically.
Now, have a think about how to do flow control for such a streaming distributed system?
After reading the whole article, will building a dam (regst) and using backpressure to control flow be your first choice?
4. Bonus Question
With the idea of using backpressure flow control, the next question is how to choose the dam’s capacity (size of Regst) during the specific construction.
For example, we set Regst = 2 at the beginning of the article. Is this the optimal choice? Looking forward to your thoughts and comments💡.
Related articles:
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.