The Limitations of Existing Deep Learning Frameworks: Resource Dependency
Why redesign a distributed deep learning framework like OneFlow?
An obvious starting point is that the original mainstream deep learning frameworks are inherently deficient. Especially at the abstraction and API levels, they are designed with various shortcomings that lead to great inconvenience to developers when using them. Although some of the deficiencies are being addressed, there are still some important issues being overlooked.
Therefore, we will launch a series of articles to discuss in detail the major shortcomings of the runtime system of the mainstream DL frameworks. This article, the first in the series, will introduce the importance of resource dependency and the limitations of original frameworks in dealing with this challenge.
Finally, this article will describe how the OneFlow framework was designed to solve this challenge simply and elegantly from the beginning.
Written by Yuan Jinhui; Translated by Wang Kaiyan, Dong Wenwen
Three Kinds of Dependencies between Ops
The dependencies between ops can be described by dataflow graphs, whether they are executed dynamically or statically. Generally speaking, there are three reasons for the dependency between ops:
Data dependency: dependency of the consumer on the producer. Dependencies arising from the transmission of a production-consumption relationship, e.g., the op A →B →C means that B needs to use the data generated by A, and C needs to use the data generated by B. C depends on A, but there is no need for an edge between A and C to explicitly describe this dependency.
Control dependency: two op A and B, which have no production and consumption relationship, are expected to occur in a particular order due to some demand. This can be done by introducing a control edge between the two ops, A → B, to ensure that B always occurs after A.
Dependencies caused by shared resources: dependency is formed when two op A and B, which have no production and consumption relationship, share a mutually exclusive resource that makes it impossible for them to occur simultaneously. Either A occurs before B, or B occurs before A.
It is worth noting that the schedulers of mainstream DL frameworks explicitly describe and handle the first two types of dependencies, but all ignore the handling of the third type of dependency.
This article uses a simple example to demonstrate:
- Ignoring shared resource dependencies is a fatal shortcoming in the design of DL frameworks, which will reduce the security and stability of the system;
- Solving this problem within the architecture design of the existing framework is not impossible, but it is difficult and will destroy the elegance of the abstraction of the system;
- OneFlow provides an easy solution to this problem based on the actor mechanism.
An Experiment on Resource Dependency
In mainstream DL frameworks, both data and control dependencies are represented with edges in the execution graph. Upon completion of each op, the scheduler updates dependencies of the remaining ops and identifies ops who are ready to run (whose dependencies have all been resolved).
Distributed DL can be more complicated in many scenarios due to the complexity of dependencies and resource management (e.g., to avoid out-of-memory errors or deadlocks, to minimize the stalls, etc.).
Figure 1 shows a simple example: M₁ and M₂ are two movement ops that serve two computing ops O₁ and O₂ on the same device respectively: O₁ consumes the data that M₁ carried to the device and O₂ consumes the data that M₂ carried to the device.
M₁ and M₂ do not depend on each other. M₁ needs one unit of device memory to store the execution results, and M₂ needs two units of device memory.
O₁ and O₂ do not depend on each other. O₁ needs three units of device memory to store the execution results and O₂ needs two units of device memory.
At some point, after both M₁ and M₂ occupied their memory, the free memory capacity can only satisfy the demand of O₂ but not those of O₁, although both O₁ and O₂ are in the ready set of the scheduler (as in TensorFlow’s) since their input data are already in place. That is the execution mechanism of the scheduler in all DL frameworks such as TensorFlow.
For this example, whether to schedule O₁ or O₂ first, we can discuss it in two cases:
- If O₂ is scheduled first, it can successfully execute and consume the output of M₂ in device memory. Once O₂ is done, the input and output of O₂ can be released (state 4), and O₁ can be accommodated, which can then be executed.
- If O₁ is scheduled first, the memory is insufficient for O₁. But the scheduler does not care nor know whether the current memory can satisfy the demand of O₁ at the moment of sending O₁ instructions.
If O₁ is dispatched to the OS thread on which the scheduler is running (i.e. the inline execution in TensorFlow executor, the kernel’s Compute function is executed on the scheduler’s thread), it will find that the system resources are not enough when the O₁’s Compute function allocates memory for its output during execution. It may report an OOM (out of memory) error immediately, or it may retry in a loop until the allocation is successful, but the OS thread will be blocked.
If there is only one OS thread running scheduler, then O₂ will never be scheduled even if the device memory is sufficient for O₂.
If more than one OS thread is running the scheduler, then even if one scheduling thread is blocked by O₁, there is another operating system thread to execute O₂.
If O₁ is not allowed to execute on the thread where the scheduler is currently located, that is, O₁ must be sent to another OS thread for execution, then the thread that is executing O₁ will be blocked. But as the scheduler will not be blocked, O₂ will be dispatched and executed successfully. When O₂ is done, the thread where O₁ is located will then be executed successfully.
Even if O₂ in the cases above can be successfully executed, the framework still needs to handle the tricky issue of determining an appropriate number of threads for a dynamic thread pool, and implementing a customized memory allocator which supports complicated retrying mechanism. If the size of the thread pool is fixed, deadlock may occur when the degree of parallelism exceeds the number of threads in the thread pool.
To avoid the deadlock above, we need to perform static resource planning and specify an appropriate execution order in advance (e.g., adding control dependencies between O₂ and O₁ in TensorFlow to make O₂ be executed before O₁ ).
However, none of the existing frameworks has carried out static resource planning to avoid such deadlock risks.
The case can become trickier if pipelining is in use. For example, each op may have multiple instances within a pipeline consisting of several consecutive ops, each corresponding to a different micro-batch at runtime.
In state 2 of Figure 1, besides O₂, an additional M₁ or M₂ for the next micro-batch is ready to run. If the scheduler starts M₁ or M₂ before starting O₂, they will consume part of or all the free memory. Then neither O₁ or O₂ could be executed successfully, and the outputs of the data movement ops thus cannot be consumed and be released, which will result in deadlock.
Further, if there are downstream ops of O₁ and O₂, they may be ready to run after executing O₂, which makes scheduling more difficult.
It’s Not Easy to Specify the Execution Order with Control Dependency
The premise of specifying the execution order of ops by adding control dependencies is to find an optimal execution order that can be successfully executed under resource constraints, which itself is very difficult.
Figure 2 illustrates a pipelining for overlapping the data-loading, pre-processing, copyh2d (i.e., copy the data from the host memory to the device memory), and computation. The pipelining is implemented with the standard double buffering technique. Each op allocates 2 units of memory, where the blue and gray block indicates that the memory is occupied by data, while the white block represents that the memory is idle.
As illustrated in Figure 2, within the same batch, the downstream op depends on the upstream op (e.g., batch 6), where pre-process depends on data-loading, copyh2d depends on pre-process, compute depends on copyh2d. However, when the pipelining enters the steady phase, the back-pressure enforces other ops to execute at the speed of the longest op (i.e., the computation).
This leads to some counter-intuitive dependencies where the upstream op working on a new batch depends on a downstream op working on an old batch. For example, batch 4 of copyh2d depends on batch 2 of computation. Such pipelining effect is hard to accomplish by specifying execution order with control edges.
However, the runtime and scheduling mechanism of the existing framework is difficult (if not impossible) to elegantly support this pipeline function. Therefore, in addition to the graph executor, other modules such as Nvidia DALI and the dataset API in TensorFlow are needed.
Implementing Double Buffering Pipeline by A Customized Memory Allocator
You may wonder that is it possible to use the core mechanism of the existing frameworks to implement double buffering. Figure 3 illustrates a potential solution of implementing double buffering in TensorFlow, where a customized memory allocator that limits each op to only two copies of memory is added.
- When the op uses the allocator to allocate memory, the allocator will query the counter to verify whether the op has a free buffer to use. If so, the allocator will allocate a buffer for the op to make execution continue. Once the computation is completed, the buffer will be released. If both buffers of this op are already occupied, the allocator will put step 2 (do compute) and step 3 (release) into a waiting list.
- When the op releases its memory to the allocator, the allocator will update the counter of the free buffer corresponding to the op and checks whether an op is requesting this buffer in the waiting list. If there is, the allocator will pop the op that is ready to run from the waiting list, and conduct step 2 and step 3.
The double buffering pipeline can be supported using this customized memory allocator. However, this approach is not practical in software engineering, because part of the responsibility of the scheduler is shifted to the customized allocator, and the scheduler module needs to be exposed to the customized allocator.
How to Resolve Resource Dependency Elegantly
The root of the problem revealed by the above thought experiment lies in that the existing frameworks don’t take into account the implicit dependency caused by shared resources, nor do they express this dependency explicitly. Instead, the success of the execution depends entirely on luck, that is, the dynamic state of the “runtime”.
In engineering implementation, the schedulers in the existing frameworks will launch the op once they find that the input data of the op is ready to run, and will allocate memory for the output data of the op after the op is executed. The logic of allocating memory and the logic of op computation are not separated.
The outstanding feature of OneFlow is that its compiler will determine the optimal memory quota of each op, and the memory quota of each op is fixed at runtime. When analyzing whether an op can be “launched”, the scheduler will not only check whether the op’s input data is available but also verify whether the op has a free memory quota.
Facts have proved that such a small change greatly simplifies the complexity of the system and solves the tricky problems mentioned above. For more details, please refer to OneFlow’s actor mechanism.
Figure 4. The finite state machine of an op, where the trigger action can execute only if the register out and in are both available.
Up to now, few people who are developing deep learning frameworks are aware of this problem. In almost all the frameworks, the logic of allocating memory and the logic of op computation are not separated.
But there are exceptions. Recently, Huawei’s MindSpore transformed its underlying graph executor from the old design into the same actor mechanism as OneFlow.
In MindSpore’s new design, although the condition for judging whether an op can be executed only depends on whether its input data is available, there is an improvement over TensorFlow and other frameworks. It has decoupled the logic of allocating memory from the logic of the op computation by designing an actor that is responsible for allocating memory. The actor will send a message to the op that needs memory after the allocation so that the op can be executed unimpededly.
This design may be better than the customized allocator discussed above, but it is still not as simple as the mechanism of OneFlow.
Google is also developing a new generation of runtime system TensorFlow Runtime to completely replace the bloated runtime system of TensorFlow in the future. In the design of the new version, they have a clear understanding of the issues discussed in this article. Here are some descriptions in their design documents:
Existing TF kernels encapsulate shape computation and memory allocation within the kernel implementation, making some graph compiler optimizations challenging or infeasible, such as reusing buffers across kernels. In TFRT kernels, shape computation and memory allocation will be hoisted out of the opaque C++ kernel implementations.
A core design principle of TFRT is that kernel executions are never allowed to block, as this allows us to have fine-grained control over the number of compute threads and the thread switching behavior, which is important for achieving high CPU utilization.
Interested readers can study the implementation of the new version of TensorFlow Runtime, where they set “elegance” as one of the design goals. Instead, the problems discussed in this article have been solved in OneFlow’s framework at the beginning of design, and the solution is probably the simplest one so far.