Written by Zekang Zheng, Chi Yao, Ran Guo, Juncheng Liu; Translated by Xiaozhen Liu, Hengrui Zhang
Elementwise operation refers to applying a function transformation to every element of a tensor. In deep learning, many operators can be regraded as elementwise operators, such as common activation functions (like ReLU and GELU) and ScalarMultiply (multiplying each element of a tensor by a scalar).
For this elementwise operation, OneFlow abstracts a CUDA template.
In this article, we will introduce the design thoughts and optimization techniques of CUDA template.
OneFlow’s CUDA template helps developers get a CUDA Elementwise operator only by encapsulating the computational logic into a structure. Take ReLU as an example:
Such an Elementwise template is easy to use. It not only improves the developing efficiency but also ensures computing performance. We used Nsight Compute on the NVIDIA A100 40GB GPU to compare with the Cast operator of PyTorch. The testing example is to cast a tensor from float32
to half
type. The runtime and bandwidth show that for each data shape, OneFlow performs better than PyTorch by 80-90% and its bandwidth gets very close to its theoretical limit.
Next, we will introduce the design thoughts and optimization techniques of this template.
Deciding the Optimal Block Size and Grid Size
For the number of blocks and threads, see How to Decide the Optimal grid_size and block_size in CUDA Kernel. But here the rules are slightly different. In CUDA official document Compute Capabilities, it was mentioned that:
- For mainstream architectures, the maximum number of 32-bit registers per thread block is 64 KB
- The maximum number of 32-bit registers per thread is 255
On the premise of using the maximum number of registers, each block can start up to 64 * 1024 / 255 = 256
threads (rounded to the multiple of 2). Therefore, here, we set a constant constexpr int kBlockSize = 256;
.
For Grid Size, the rules are in the function GetNumBlocks
:
- The minimum number of thread blocks is 1
- The maximum number of thread blocks is the minimum value between “the minimum total number of threads required to process all elements” and “the number of waves * the number of SM that can be scheduled by GPU at one time * the maximum number of blocks per SM”. Here, the number of waves is set to a fixed 32
When the amount of data is small, too many thread blocks will not be started. When the amount of data is large, we set the number of thread blocks to a sufficient integer multiple of waves to ensure the actual utilization of GPU is high enough.
Using Vectorizing Operations
The computational logic of most Elementwise operators is simple and the pain point is on bandwidth utilization. In NVIDIA’s blog CUDA Pro Tip: Increase Performance with Vectorized Memory Access, it is mentioned that vectorizing operations can improve the read/write bandwidth utilization. A series of data types in the CUDA kernel are provided to support such as float2
and float4
, which take 2 or 4 float data as a whole. In some high-performance training and inference libraries like LightSeq
, a large number of float4
types are used:
In practice, the operators need to support different data types (such as int and half). If we use the built-in vectorizing operations of CUDA for data type, it is obvious that we need to write multiple versions of each operator, which increases developing burden. Therefore, we have invented a Pack
data structure to flexibly support vectorization of different data types.
We first define a PackType
to represent vectorized data. The data size it represents (after vectorization) is sizeof(T) * pack_size
.
Then we introduce a Pack
of union
type, which internally defines PackType<T, pack_size> storage;
to occupy space:
T elem[pack_size];
shares the space with storage
. This facilitates later Elementwise operations: in subsequent calculations, we apply functor
to each element in the elem
array to obtain the output result.
CUDA supports a maximum 128 bit pack size, while among floating-point data types, the minimum type (half) size is 16 bit, which can pack 128/16 = 8
half data together at most. Therefore, we set two constants: kMaxPackBytes
represents the maximum number of bytes of the pack and kMaxPackSize
represents the maximum number of pack data.
Calling Chain
When tracking the implementation in oneflow/core/cuda/elementwise.cuh
, you will find that this template provides interfaces called Unary
, Binary
, Ternary
, for elementwise operators of one element, two elements, and three elements.
At the beginning of the article, the ReLU
operator uses the interface of Unary
. After further analysis, we can find that ApplyGeneric
will eventually be called after layers of calls. The basic calling relationship is as follows:
The main work done in CUDA Kernel ApplyGeneric
is:
- Create a
functor
based on the parameters - Enter the loop, call
ApplyPack
function for the packed data, and process a batch of packed data every timeApplyPack
is called - When the number of elements cannot be divisible by
pack_size
, the remaining elements need to be processed by the thread
The code is as follows:
The ApplyPack
function is defined as follows. It loops the elements in a pack
, calls functor
on each element in the elem
array, gets the output result and returns:
The entire Elementwise
operator calling process is as follows:
Optimizing half2 Data Type
For the half
data type, if it is directly operated on, its operator bandwidth is equivalent to that of float32
type.
CUDA officially launched a series of special instructions for half2
. For example, hadd2
can add two half2
data, thereby improving throughput.
In this situation, OneFlow specializes ApplyPack
function to call special instructions related to half2
by calling functor's apply2
function. The interface is as follows:
Taking the Cast
operator as an example, we call the __float22half2_rn
instruction inside the CastFunctor
to convert a float2
data into a half2
data.
Expanding Multiple Operations
As mentioned above, the existing OneFlow templates further divides Elementwise
operators into unary, binary, and ternary operations. By using the factory method pattern, The operators can call ApplyGeneric
uniformly at last.
This design method is easy to expand, since developers only need to write the corresponding factory method when supporting more input operations.
This article mainly introduces OneFlow’s design and optimization methods of a high-performance CUDA Elementwise
template. At last, let’s summarize the advantages of this template:
- High performance. Operators using this
Elementwise
template can fill the bandwidth of the machine, and the speed is fast enough. - High development efficiency. Developers don’t need to pay too much attention to CUDA logic and related optimization methods, but only need to write calculation logic.
- Strong scalability. Currently, this template supports unary, binary, and ternary operations. Developers only need to write the corresponding factory method if there is a need to expand to support more inputs.
I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more details of OneFlow.
Related articles:
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.