Written by Yuan Jinhui; Translated by Wang Kaiyan, Dong Wenwen
In the previous article, we discussed Occam’s Razor guideline to combat the software system complexity challenge, emphasizing streamlined concepts, reflecting the minimalist aesthetic of “less is more”. This article emphasizes the need for consistency, unity, and universality of concepts, to capture the essence of things, and to cover and explain the widest range of phenomena with the fewest number of concepts.
Conceptual Integrity
Fred Brooks, the great master of software engineering, introduced Conceptual Integrity, which he considered to be the most central and important principle in system design:
I will contend that conceptual integrity is the most important consideration in system design. It is better to have a system…to reflect one set of design ideas, than to have one that contains many good but independent and uncoordinated ideas.
In short, consistent ideas are preferable to a group of excellent but separate or unrelated ideas.
Coinciding with Fred Brooks, Smalltalk designer Daniel H. H. Ingalls describes the meaning of Good Design when discussing the design principles behind Smalltalk:
A system should be built with a minimum set of unchangeable parts; those parts should be as general as possible; and all parts of the system should be held in a uniform framework.
This definition is very comprehensive, where “minimum” is the subject of our previous article; “unchangeable” and “general” are the subject of this article.
Further, Daniel H. H Ingalls argues that one way to achieve good design is to use “uniform metaphor”:
A language should be designed around a powerful metaphor that can be uniformly applied in all areas.
Following conceptual integrity or uniform metaphors helps programmers understand one part of the system and then easily understand another part, giving the whole system a unified soul that is integrated and does not fall apart or get out of control during iterative evolution.
What positive examples can we think of? (Material comes from Design Principles Behind Smalltalk (virginia.edu) )
Unix: Everything is a file. Directories, devices, file systems, named pipes, sockets, etc. are all based on the concept of “file”.
Smalltalk: Everything is an object, including how objects interact with each other in terms of messaging.
SQL: Everything is in a table, including keys, relational constraints, etc.
Lisp: Everything is a list.
In this article, we take the example of unifying engine-dependent programming models in standalone and distributed deep learning to see how the “concept of global consistency” can simplify system complexity.
Shared Memory Model VS Message Passing
As shown above, in the previous article we obtained a minimalist distributed dependency engine based on the principle of minimal abstraction, but in this engine, there are two concurrent programming models: the shared memory model and the message passing model. The overall computation graph is partitioned into subgraphs according to the machine on which each op resides, and each subgraph manages concurrent execution internally in a shared memory model, and coordinates between subgraphs by message passing.
Shared memory model and message passing are two classic models of explicit concurrent programming, both of which are functionally equivalent and have their own practical advantages and disadvantages, as illustrated by the following example from StackOverflow (What’s the difference between the message passing and shared memory concurrency models? — Stack Overflow).
Shared Memory Model: multiple concurrent modules communicate by asynchronously modifying data in a shared address space, inevitably introducing race conditions, in which case locks can be used to coordinate multiple concurrent threads. The shared memory model is more common in multi-threaded programming models, and it is more difficult to write efficient and correct code using this model.
An analogous example is that a group of people works collaboratively around a table with a pile of paper (data) that everyone can read and write on. Each person can only communicate by modifying the contents of the paper on the table, and conflicts between multiple people modifying the same paper at the same time need to be avoided.
Message Passing: multiple concurrent modules maintain their own private state or data and communicate by passing messages synchronously or asynchronously. Each concurrent module does not need to be locked to modify the private state or data, only the message queue needs to be locked, and the message passing code is usually easier to understand and maintain. This pattern is more common in distributed scenarios and is natively supported in some modern programming languages, such as Go.
An analogous example is that a group of people works together around a table, each person has a pile of paper in front of him/her, and each person can only modify the pile of paper in front of him/her. When someone else needs to modify the content of the paper, the last person in possession of the paper must give it to the next person as a “message”, thus avoiding the trouble of multiple people modifying a piece of paper at the same time.
How to Unify the Programming Model?
In distributed deep learning frameworks, it would reduce the complexity of the system and the cognitive burden on users if the dependency engine mechanisms within and across machines could be unified.
Different system designs are obtained by enumerating all cases according to whether shared memory or message passing is used within and across machines:
a. Use shared memory model within and across machines.
b. Use message passing within and across machines.
c. Use shared memory model within the machine and message passing across machines.
d. Use message passing within the machine and shared memory model across machines.
Which one is the most appropriate? These two factors may need to be considered:
- When abstracting the internal dependencies of the machine, it directly uses the mechanism of multi-threading and locking provided by the system, and the shared memory model is more direct; when abstracting dependencies across machines, it is natural to use message passing. This is also the choice of the vast majority of DL frameworks now, that is, scheme c; the opposite scheme d is directly out of the game.
- When abstracting cross-machine dependencies, message passing is more natural than the shared memory model. If you want to use the shared memory model, you need to implement the semantics of distributed shared memory (somewhat similar to a parameter server or key-value store). Although scheme a also unifies intra-machine and cross-machine mechanisms, it is not natural when abstracting cross-machine dependencies.
Collectively, it’s a competition between scheme b and scheme c. The focus is: does the dependency engine inside the machine prefer shared memory model or message passing?
Why Is Message Passing Favored?
The main benefit of the message passing is modularity, a way of thinking that forces programmers to think clearly about the ownership of state or resources. Each module can only modify the data and state it owns, but not the data owned by other modules, so it does not need to pay attention to things that are not related to the current module, which is a good example of the separation of concern criterion and greatly reduces the threshold of concurrent programming.
Many people have heard a quote from Rob Pike, the designer of the Go language:
Do not communicate by sharing memory; instead, share memory by communicating.
He clearly states his view that message passing is superior to the shared memory model in concurrent programming. Modern programming languages like Go, in order to make concurrent programming less difficult, provides developers with a programming model that is message passing, not shared memory.
The C++ language does not provide an intrinsic message passing programming model, but it is very easy to support this requirement based on message queues, such as the multi-threaded message passing mechanism implemented in OneFlow with just the locked queue channel class.
Message passing is not only natural in distributed computing scenarios, but also has the benefit of abstraction, simplicity, and ease of correctness in single-machine multi-threaded scenarios. By unifying the dependency engine mechanism within and across machines, the whole system concept is very easy to understand. If the implementation is clever enough, there is no need to consider the conceptual distinction between single machine or distributed, the whole distributed system is integrated and the principles of the system can be explained in the “shortest textual description”.
Actor-based or Channel-based?
There are two ways of thinking about sharing by communicating, one is object-centric, i.e., actor model, and another is channel-centric, based on communicating sequential processes (CSP) theory, both of which are widely used in the industry.
The channel-centric abstraction provides the concept of channel, where two parties need to communicate by send and recv operations to put and get messages into the channel, and the same channel can be used by different objects. A typical representative is the Go language, which provides an endogenous channel object.
In object-centric abstraction, each actor has its own mailbox, and if an object wants to send a message to another, it must know the recipient’s address to do so. This is typically represented by the Erlang language, which provides the actor programming model, and by newer languages such as Swift, which also supports the actor model natively.
Specifically in the area of distributed DL frameworks, which way should we choose?
My understanding is that if the purpose is to provide a set of infrastructure for the upper layers to invoke, the channel abstraction is generally used because this underlying infrastructure does not know how it will be used in the future and cannot be abstracted for future communication objects. Examples of this include, the TCP/IP socket model, the ZeroMQ programming library, and Go channel. Recently PyTorch has also developed a TensorPipe library to better support distributed scenarios, also based on channel abstraction.
In the case of developing a concrete software solution, actor abstraction may be more appropriate when there is a very clear understanding of the communication objects. Because in such a problem, in addition to solving message passing, it is more important to be modular, to distill what the communication objects are, what kind of state and behavior each object has, this information is not something that can be solved by a channel, but instead has to be supported by an actor model.
Of course, the above principles are not absolute, and the underlying called library can also use actor. For example, a base class is provided in the base library, and future upper-level applications can derive from this base class. When developing concrete solutions, even if you have a clear understanding of the communication object, you can also use the channel abstraction.
In DL frameworks, each op or subgraph has its own state, resources, and behavior, and we need to abstract these properties, and using an actor model is perfect for this.
The actor model and object-oriented programming (OOP) are in the same line. Turing Award winner Alan Kay, a pioneer of OOP, believes that messaging is more essential and important than object.
I’m sorry that I long ago coined the term “objects” for this topic because it gets many people to focus on the lesser idea. The big idea is “messaging”.
Joe Armstrong invented Erlang (Erlang (programming language) — Wikipedia, a concurrency-oriented programming language. The Erlang language natively supports the actor mechanism, and the core elements considered to be included in a concurrency-oriented programming language are (list abridged):
- Everything is a process (the process here is different from the concept in the operating system course, it can be understood as actor)
- Message passing is the only way to process interaction
- Each process has a unique name (address)
- If you know the name of the process, you can send messages to it
- Processes do not share resources with each other
Swift has a built-in actor mechanism since version 5.5 (swift-evolution/0306-actors.md at main · apple/swift-evolution (github.com)) .
OneFlow uses the actor model from the beginning of its design and is the industry’s first DL framework designed using this concept. We are often asked why OneFlow uses the actor model and what the benefits of actor are. Our experience is that actor as a user-level thread brings not only efficiency benefits, but more importantly, simplicity and consistency in abstraction.
Actor in OneFlow
Previously, OneFlow’s actor mechanism has been explained (Runtime of OneFlow Based On Boxing and Actor Model (Part 3)), so I will only briefly introduce its basic principles here.
Each op (which can also be a subgraph of ops) is managed by an actor, which can be understood as a thin layer of wrapper outside the op, used to manage the state of the op (dependency conditions), and the op is the action of the actor. When an op is executed, the actor managing the op sends a req message to the actor that needs to consume the output data of the op (the message contains only a pointer to the data, the real data is not put in the message body, so the message is very lightweight), informing the consumer that new data is ready. When the consumer receives the message, it updates the counter (state) of the dependency conditions, and if the trigger conditions of the managed op are met, it will start the corresponding op, which will then read the producer’s data when it is executed. When the reading is finished, it will send an ack message to the producer to inform it that it can reclaim the memory of that data.
In the example shown above, the data produced by actor1 is consumed by actor2, and the data produced by actor2 is consumed by actor3. It is worth noting that this messaging protocol applies to both actors belonging to the same machine and actors across machines. Whether the producer and consumer are on the same machine or not, they operate through the logic described above, and there is no difference between the two at the actor abstraction level.
The figure above shows an actor system running on two machines. Actors are similar to user-level threads, and actors with the same device context are bound to the same OS thread.
As shown by arrow 1 in the figure above, messages sent between actors on the same OS thread can take the local message queue (LMQ), for example, between actor_a and actor_b.
As shown by arrows 2 and 3 in the figure above, messages sent between actors on different threads on the same machine need to be routed through the actor message bus, for example, between actor_a and actor_c.
As shown by arrow 4 in the figure above, when actor_a sends messages to actor_d (actor_d manages the recv op responsible for data movement), it abstractly follows the dotted line of arrow 4. Of course, the actual delivery path across the machine is implemented by arrows 5, 6 and 7, but the path 5, 6 and 7 are transparent to the actor system, so the message passing across the machine is no different from the message passing within the machine at the abstraction level.
Minimal Abstraction for Distributed Deep Learning Frameworks
In addition to the unification of standalone and distributed dependency engines with message passing discussed in this article, the principle of conceptual integrity is practically ubiquitous in OneFlow designs, for example:
- Integrity of movement and computation: based on the idea of data movement as a first-class citizen, the data movement is also all abstracted into op, and the scheduling mechanisms of data movement and computation are completely consistent.
- Integrity of single machine and distributed framework: as discussed above, single machine and distributed dependency engines are unified through message passing mechanisms, rather than having two sets of programming patterns coexisting as in other frameworks.
- Integrity of scheduling granularity: actor, as a user-level thread, supports full-link asynchrony, facilitating static planning of thread resources.
- Integrity of message passing protocols: the working protocol between actors has clear rules that the producer Push req messages to the consumer, the consumer Pull the producer’s data and sends ack messages to the producer, etc.
- Integrity of compile-time and run-time actor: the execution plan generated by the compiler from the logic diagram consists of a series of tasks, which are the counterparts of the actor in the compilation stage. OneFlow guarantees one-to-one correspondence between the compile-time task and the run-time actor, so that all the run-time information can be corresponded to by analyzing the execution plan.
Based on the above design ideas, OneFlow is probably the simplest abstraction of distributed deep learning framework, and the easiest to understand from the architecture. We believe that it is also the most hardware vendor-friendly abstraction, which greatly reduces the difficulty of implementing distributed deep learning. Hardware vendors only need to develop single-device compute libraries and communication libraries to implement the most powerful distributed deep learning features with OneFlow.
Other Applications of Actor in the Field of Deep Learning
In a recent interview, legendary chip architect Jim Keller mentioned TensTorrent, the software stack his team implemented for AI chips, which is very similar in description to the message passing mechanism.
we have a really nice abstraction in the hardware that says (i) do compute, (ii) when you need to send data to put the data in the pipe, (iii) when you push it in the pipe, the other end pulls it out of the pipe. It’s very clean. That has resulted in a fairly small software team, and software you can actually understand. (message passing, pipe is similar to channel)
Then when we hook up multiple chips, the communication on a single chip compared to the communication from chip to chip is not any different at the software layer. So while the physical transport of data is different on-chip with a NOC versus off-chip with Ethernet, the hardware is built so that it looks like it is just sending data from one engine to another engine — the software doesn’t care. The only thing that does is the compiler, which knows the latency and bandwidth is a little different between the two. (The consistent abstraction of intra-chip and inter-chip communication is similar to the idea in this paper)
It is worth noting that other frameworks are starting to follow up to implement the actor model, such as Huawei’s Mindspore.
PyTorch has recently introduced a number of new designs to support distributed deep learning, and one of the big changes is the introduction of the underlying communication library TensorPipe. This is an abstraction that is closer to channel, but is still used only in the underlying communication rather than as an abstraction at the computation graph level, as TensorPipe is used as the underlying mechanism for PyTorch’s RPC. I believe this incomplete use is limited by PyTorch’s dynamic graph design.
The actor mechanism is only suitable for the static graph model. Only in static graphs, producers and consumers have fixed addresses, and the producer knows which consumers are downstream of it to communicate via messages. In the dynamic graph model, each operation only knows its history and cannot see downstream consumers, so this mechanism cannot be used.
Dynamic graphs are an essential tool to increase the user experience, but once you get past the “write and debug model” stage, static graphs are more appropriate from a deployment, training, and AI chip perspective. PyTorch is also enhancing its static graphs and the mechanism for converting dynamic and static graphs. We believe that PyTorch will evolve into the actor mechanism sooner or later.
Conclusion
OneFlow pursues the ultimate in efficiency but does not sacrifice abstraction for efficiency; rather, the pursuit of concise abstraction transcends efficiency. High efficiency is just a natural consequence of a minimalist system, while concise abstraction can be considered an essential advantage of OneFlow.
In the past series of articles, we have seen how OneFlow minimizes the abstraction of distributed deep learning systems with actor: unifying computation and data movement, single machine and distributed framework, resource scheduling granularity, message passing protocols, and basic units of compile-time and runtime.
We prefer the simple design, just as we believe that the essence of complexity is simplicity, and that there must be a principle of simplicity, elegance and integrity governing the operation of everything in this complicated world.
Programmers are the “makers” of software systems. If you build a complex system in a careless way, you may be defeated by your own work and lose control; but if you build a complex system with careful consideration and a concise and consistent worldview, this complex system will work well, be in control, and be open to the most unknown needs.
Specifically for distributed deep learning frameworks, the greater advantage of the minimalist system is flexibility and generality. A pervasive set of mechanisms is more efficient to develop when supporting new models with similar requirements, eliminating the pain of customizing for new requirements.
Related articles:
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.