OneFlow’s Optimization of CUDA Elementwise Template Library: Practical, Efficient, and Extensible
Written by Zekang Zheng, Chi Yao, Ran Guo, Juncheng Liu; Translated by Xiaozhen Liu, Hengrui Zhang
Elementwise operation refers to applying a function transformation to every element of a tensor. In deep learning, many operators can be regraded as elementwise operators, such as common activation functions (like ReLU and GELU) and ScalarMultiply (multiplying each element of a tensor by a scalar).
For this elementwise operation, OneFlow abstracts a CUDA template.
In this article, we will introduce the design thoughts and optimization techniques of CUDA template.
OneFlow’s CUDA template helps developers get a CUDA Elementwise operator only by encapsulating the computational logic into a structure. Take ReLU as an example:
Such an Elementwise template is easy to use. It not only improves the developing efficiency but also ensures computing performance. We used Nsight Compute on the NVIDIA A100 40GB GPU to compare with the Cast operator of PyTorch. The testing example is to cast a tensor from
half type. The runtime and bandwidth show that for each data shape, OneFlow performs better than PyTorch by 80-90% and its bandwidth gets very close to its theoretical limit.
Next, we will introduce the design thoughts and optimization techniques of this template.
Deciding the Optimal Block Size and Grid Size
For the number of blocks and threads, see How to Decide the Optimal grid_size and block_size in CUDA Kernel. But here the rules are slightly different. In CUDA official document Compute Capabilities, it was mentioned that:
- For mainstream architectures, the maximum number of 32-bit registers per thread block is 64 KB
- The maximum number of 32-bit registers per thread is 255
On the premise of using the maximum number of registers, each block can start up to
64 * 1024 / 255 = 256 threads (rounded to the multiple of 2). Therefore, here, we set a constant
constexpr int kBlockSize = 256;.
For Grid Size, the rules are in the function
- The minimum number of thread blocks is 1
- The maximum number of thread blocks is the minimum value between “the minimum total number of threads required to process all elements” and “the number of waves * the number of SM that can be scheduled by GPU at one time * the maximum number of blocks per SM”. Here, the number of waves is set to a fixed 32
When the amount of data is small, too many thread blocks will not be started. When the amount of data is large, we set the number of thread blocks to a sufficient integer multiple of waves to ensure the actual utilization of GPU is high enough.
Using Vectorizing Operations
The computational logic of most Elementwise operators is simple and the pain point is on bandwidth utilization. In NVIDIA’s blog CUDA Pro Tip: Increase Performance with Vectorized Memory Access, it is mentioned that vectorizing operations can improve the read/write bandwidth utilization. A series of data types in the CUDA kernel are provided to support such as
float4, which take 2 or 4 float data as a whole. In some high-performance training and inference libraries like
LightSeq, a large number of
float4 types are used:
In practice, the operators need to support different data types (such as int and half). If we use the built-in vectorizing operations of CUDA for data type, it is obvious that we need to write multiple versions of each operator, which increases developing burden. Therefore, we have invented a
Pack data structure to flexibly support vectorization of different data types.
We first define a
PackType to represent vectorized data. The data size it represents (after vectorization) is
sizeof(T) * pack_size.
Then we introduce a
union type, which internally defines
PackType<T, pack_size> storage; to occupy space:
T elem[pack_size]; shares the space with
storage. This facilitates later Elementwise operations: in subsequent calculations, we apply
functor to each element in the
elem array to obtain the output result.
CUDA supports a maximum 128 bit pack size, while among floating-point data types, the minimum type (half) size is 16 bit, which can pack
128/16 = 8 half data together at most. Therefore, we set two constants:
kMaxPackBytes represents the maximum number of bytes of the pack and
kMaxPackSize represents the maximum number of pack data.
When tracking the implementation in
oneflow/core/cuda/elementwise.cuh, you will find that this template provides interfaces called
Ternary, for elementwise operators of one element, two elements, and three elements.
At the beginning of the article, the
ReLU operator uses the interface of
Unary. After further analysis, we can find that
ApplyGeneric will eventually be called after layers of calls. The basic calling relationship is as follows:
The main work done in CUDA Kernel
- Create a
functorbased on the parameters
- Enter the loop, call
ApplyPackfunction for the packed data, and process a batch of packed data every time
- When the number of elements cannot be divisible by
pack_size, the remaining elements need to be processed by the thread
The code is as follows:
ApplyPack function is defined as follows. It loops the elements in a
functor on each element in the
elem array, gets the output result and returns:
Elementwise operator calling process is as follows:
Optimizing half2 Data Type
half data type, if it is directly operated on, its operator bandwidth is equivalent to that of
CUDA officially launched a series of special instructions for
half2. For example,
hadd2 can add two
half2 data, thereby improving throughput.
In this situation, OneFlow specializes
ApplyPack function to call special instructions related to
half2 by calling functor's
apply2 function. The interface is as follows:
Cast operator as an example, we call the
__float22half2_rn instruction inside the
CastFunctor to convert a
float2 data into a
Expanding Multiple Operations
As mentioned above, the existing OneFlow templates further divides
Elementwise operators into unary, binary, and ternary operations. By using the factory method pattern, The operators can call
ApplyGeneric uniformly at last.
This design method is easy to expand, since developers only need to write the corresponding factory method when supporting more input operations.
This article mainly introduces OneFlow’s design and optimization methods of a high-performance CUDA
Elementwise template. At last, let’s summarize the advantages of this template:
- High performance. Operators using this
Elementwisetemplate can fill the bandwidth of the machine, and the speed is fast enough.
- High development efficiency. Developers don’t need to pay too much attention to CUDA logic and related optimization methods, but only need to write calculation logic.
- Strong scalability. Currently, this template supports unary, binary, and ternary operations. Developers only need to write the corresponding factory method if there is a need to expand to support more inputs.
I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more details of OneFlow.