The Limitations of Existing Deep Learning Frameworks: Dynamic Scheduling

via Pixabay

The Problem of Thread Pool Allocation

Scheduling Mechanism of the Computation Graph

Figure 1: Data flow and directed acyclic graph

Pedigree of Dynamic and Static Scheduling

Abstraction of Hardware Resources

Figure 2: Relationship between ops, task queues and hardware resources

How Many Streams Should a GPU Create?

  1. For two hardware resources that can run independently, if they share a stream, the two hardware resources can only execute sequentially, not in parallel. For example, the DMA engine of Nvidia GPGPU (copy host to device, copy device to host, etc.) and the compute core are two different hardware resources. Suppose the transfer and computing share the same stream. In that case, it is impossible to overlap data copy and computation on this GPU, so the different underlying hardware resources should use different streams.
  2. For two ops with no dependency and can be executed in parallel, if they are dispatched to the same stream, the two possible parallel ops will only be executed sequentially because they share the same stream. For example, there are 1024 cores on the GPU, if each op only needs to use less than half of the cores, and the two ops are dispatched to different streams, then the two ops can execute in parallel based on different cores, so it seems that the streams cannot be too small. But if each op can use all 1024 cores when executing, then even if they are dispatched to two different streams, the two ops can only be executed sequentially.
  3. Suppose two ops with dependencies are dispatched to two different streams. In that case, the producer op in one stream must be executed before the consumer op can be dispatched to the second stream. This means that synchronization (or waiting) is required between different streams, and coordinating multiple streams is usually complicated. If both ops are dispatched to the same stream, then the consumer op does not need to wait for the producer op to finish executing before it can be dispatched to the stream. This is because the FIFO feature of the stream ensures that the second op will start executing only after the first op finishes executing. Using one stream can simplify the complexity of stream management.
  • Different hardware resources should use different streams;
  • Ops with dependencies are best dispatched to the same stream, and ops without dependencies are best dispatched to different streams;
  • There are half the pros and cons of creating multiple streams from the same hardware resource, as is creating only one stream from the same hardware resource. So what to do?

Threading Model: How Many Threads Does A Stream Need?

Figure 3: Four threading models
  1. Using a single thread to perform scheduling and launching concurrently. The thread accesses and modifies the state of the calculation graph, but the ops may be dispatched to different devices. It should be noted that when dispatching tasks to the device, the device context needs to be switched accordingly, which will bring a certain amount of overhead.
  2. Dividing the operations of scheduling and launching. A specific thread manages the computation graph, and this thread doesn’t launch CUDA kernels directly. Instead, launching operations are dispatched to some worker threads, each of which serves a particular device. In this case, the worker thread does not need to pay for the overhead of device context switching. Moreover, the calculation graph is only accessed by the scheduling thread, and there is no need to consider concurrent reading and writing by multiple threads. However, the update of the operator’s state must be completed by taking the scheduling thread as an intermediate proxy. For example, as shown in case 2, the producer op and the consumer op are on different worker threads. After the producer op is executed, the scheduling thread will be notified to update the calculation graph. The worker thread where the consumer op locates will not be effected until the computation graph is updated. If the producer op and the consumer op can talk directly, the overhead of going through the scheduling thread as an intermediate proxy can be avoided.
  3. To eliminate the proxy overhead in case 2, a thread is allowed to perform scheduling and launching of the ops on a particular device, which avoids device context switching. Each thread can access and modify the state of the calculation graph, and the ops on different threads with upstream and downstream production and consumption relations can directly talk to each other. This method eliminates the extra overhead of taking the scheduling thread between the producer and the consumer ops as an intermediate proxy. However, the global state must be protected with locks to minimize the contention overhead caused by concurrently accessing the shared state from multiple worker threads. However, a coarse-grained lock can severely hurt the overall performance.
  4. To minimize the contention overhead in case 3, observing that the states of the ops can be decoupled by op, fine-grained locks can be used (e.g., creating a dedicated lock for protecting the state of each op). However, it is tricky and error-prone while optimizing the program by reducing the scope of a lock in the general concurrent systems.

Optimal Number of Threads: How Many Operators Does A Thread Serve?

Asynchronous Execution, State Saving and Recovery

Summary

  1. The Limitations of Existing Deep Learning Frameworks: Resource Dependency
  2. The Limitations of Existing Deep Learning Frameworks: Data Movement

--

--

--

OneFlow is a performance-centered and open-source deep learning framework.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Network Update

Fortune Teller with Java Classes

AWS Framework Analytics

Python Flask on Unikraft

5 Classic Misconceptions Programmers Have About Software Development

Software developer holding a laptop and having books on the side

Java Reactive Programming — Combining Multiple Sources Of Flux / Mono

Azure Synapse Analytics using Synapse ML batch processing Health Care Text analytics

🚀 AvaStation Whitelist Contest is opening, join now!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
OneFlow

OneFlow

OneFlow is a performance-centered and open-source deep learning framework.

More from Medium

The History of Credit-based Flow Control (Part 1)

Why BFGS over Gradient Descent

Powering Data-Driven Autonomy at Scale with Camera Data

Why should you deploy your ML model in shadow mode?