The Limitations of Existing Deep Learning Frameworks: Resource Dependency

via Pixabay

Three Kinds of Dependencies between Ops

  • Ignoring shared resource dependencies is a fatal shortcoming in the design of DL frameworks, which will reduce the security and stability of the system;
  • Solving this problem within the architecture design of the existing framework is not impossible, but it is difficult and will destroy the elegance of the abstraction of the system;
  • OneFlow provides an easy solution to this problem based on the actor mechanism.

An Experiment on Resource Dependency

Figure 1. An example that may cause deadlock with the scheduler in existing frameworks
  1. If O₂ is scheduled first, it can successfully execute and consume the output of M₂ in device memory. Once O₂ is done, the input and output of O₂ can be released (state 4), and O₁ can be accommodated, which can then be executed.
  2. If O₁ is scheduled first, the memory is insufficient for O₁. But the scheduler does not care nor know whether the current memory can satisfy the demand of O₁ at the moment of sending O₁ instructions.

It’s Not Easy to Specify the Execution Order with Control Dependency

Figure 2. Pipelining: the horizontal represents time, and the vertical represents data-loading, pre-process, copyh2d, and computation, where the computation is the bottleneck

Implementing Double Buffering Pipeline by A Customized Memory Allocator

Figure 3. Implementing a double buffering pipeline by a customized memory allocator.
  • When the op uses the allocator to allocate memory, the allocator will query the counter to verify whether the op has a free buffer to use. If so, the allocator will allocate a buffer for the op to make execution continue. Once the computation is completed, the buffer will be released. If both buffers of this op are already occupied, the allocator will put step 2 (do compute) and step 3 (release) into a waiting list.
  • When the op releases its memory to the allocator, the allocator will update the counter of the free buffer corresponding to the op and checks whether an op is requesting this buffer in the waiting list. If there is, the allocator will pop the op that is ready to run from the waiting list, and conduct step 2 and step 3.

How to Resolve Resource Dependency Elegantly

--

--

--

OneFlow is a performance-centered and open-source deep learning framework.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Redhat Expert Session-An Arena to develop innovation

.NET Core local development — How to store secrets?

Extracting secrets from a language model in LSTM

🚀 Airdrop: MUMNetwork 💰 Value: $ 50 👥 Referral: $ 10 💸 10,000 + 2,000 MUM ⏰ 3 minutes

How A Text Analysis API Can Make You An Ace At Multitasking

Use an API with an Offline Database to Sort Website’s Industries

Is There Any Text To Speech With Hindi Realistic Voices?

Try This API To Get Palm Oil Rates In USD

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
OneFlow

OneFlow

OneFlow is a performance-centered and open-source deep learning framework.

More from Medium

The History of Credit-based Flow Control (Part 1)

Overview — Video Swin Transformer

Re-id: data transforms that work and those that don’t

Paper of Choice: Image Generation From Small Datasets via Batch Statistics Adaptation