How to Choose the Grid Size and Block Size for a CUDA Kernel?

Written by Juncheng Liu; Translated by Xiaozhen Liu, Chenyang Xu

In this article, the engineer of OneFlow shares the way of setting grid_size and block_size when writing CUDA kernels.

In general, we might see the following code that launches a CUDA kernel:

cuda_kernel is the identifier of the global function, and within the (...) are the corresponding parameters that call cuda_kernel. Both of them have the same syntax as C++. As for <<<grid_size, block_size, 0, stream>>>, it is an extension of CUDA to C++, known as Execution Configuration. You can refer to CUDA C++ Programming Guide (Hereinafter called Guide):

The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:

Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched;

Db is of type dim3 (see dim3) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;

Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned in shared; Ns is an optional argument which defaults to 0;

S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.

Dg represents the dimension of the grid. Db represents the dimension of the block. They are of type dim3. If the type is one-dimensional structure, the values of the two dimensions y and z are both 1, except for x. Both Dg and Db can be directly replaced by the numbers corresponding to the x dimension, as shown at the beginning of this article. For more specific descriptions of grid dim and block dim, refer to Programming Model.

We will discuss what values should be taken for Dg and Db next.

grid_size and block_size represent the number of blocks and the number of threads in each block, respectively, for launching the kernel. Therefore, both values should be greater than 0.

The Guide K.1. Features and Technical Specifications points out that Maximum number of threads per block and Maximum x- or y-dimension of a block are both 1024. Thus, the maximum value of block_size can be 1024.

In a block, a warp is made up of 32 consecutive threads. These 32 threads execute the same instruction at a time, known as a SIMT. The threads use the same hardware resources even if the number of active threads in the last warp is less than 32. Therefore, block_size must be an integer multiple of 32.

The block is also known as Cooperative Thread Arrays, you can also refer to the following information:

The Parallel Thread Execution (PTX) programming model is explicitly parallel: a PTX program specifies the execution of a given thread of a parallel thread array. A cooperative thread array, or CTA, is an array of threads that execute a kernel concurrently or in parallel.

Threads within a CTA can communicate with each other. To coordinate the communication of the threads within the CTA, one can specify synchronization points where threads wait until all threads in the CTA have arrived.

The hardware for block is SM (Streaming Multiprocessor). SM provides hardware resources required for communication and synchronization for threads in the same block. Communication is not supported across SMs, thus, all threads in a block are executed on one SM. Also, since there might be synchronization between threads, once the block starts executing on the SM, all the threads in the block execute on the same SM at the same time (concurrency, not parallelism). That means that the process of scheduling the block to the SM is atomic. SM allows more than one block to execute concurrently. If the idle resources of an SM meet the execution of a block, the block can be immediately scheduled on that SM. The specific hardware resources generally include registers, shared memory, and various scheduling-related resources.

The scheduling-related resources has two specific limits: Maximum number of resident blocks per SM and Maximum number of resident threads per SM. That is, the maximum number of blocks and threads that can be executed simultaneously on the SM. GPU is characterized by high throughput and high latency. It is like an escalator, which can transport 60 people to another floor in one minute, but cannot transport one person to another floor in one second. For the escalator, to carry enough people to meet a certain goal, it needs to ensure that there are enough people on the escalator at the same time. For GPUs, this means trying to keep enough instructions on the pipeline at the same time.

There are many ways to achieve this goal. One simple way is to allow as many threads as possible to execute on the SM at the same time. The ratio of the number of concurrent threads executing on the SM to the maximum number of threads supported on the SM is called “occupancy”. The higher occupancy, the higher potential performance.

Obviously, the block_size of a kernel should be greater than the ratio of the maximum number of threads and the maximum number of blocks on the SM. Otherwise, 100% occupancy cannot be achieved. For different architectures, this ratio is also different. For V100, A100, and GTX 1080 Ti, the ratio is 2048 / 32 = 64. For RTX 3090, it is 1536 / 16 = 96. Therefore, in order to adapt to the mainstream architecture, if you want to set a fixed value of the block_size, it should not be less than 96. Considering the atomicity of block scheduling, block_size should be the approximate number of the maximum number of threads of SM. Otherwise, 100% occupancy cannot be achieved. The convention of the maximum number of threads of SM for the mainstream architecture GPU is 512. The approximate number above 96 are 128 and 256. That is to say, so far, the only three values of block_size left are 128 / 256 / 512.

Still, because block scheduling to SM is atomic, SM has to provide enough resources required for at least one block to execute, including shared memory and registers. Shared memory is generally explicitly controlled by the developer. If the number of threads in a block times the number of registers required per thread is greater than the maximum number of registers per block supported by SM, the kernel will fail to launch.

For the current mainstream frameworks, the maximum number of registers per block supported by SM is 32K or 64K 32-bit registers. Each thread can use 255 32-bit registers at most, and the compiler will not allocate more registers for threads. So for the registers, each SM can support at least 128 or 256 threads. A block_size of 128 can put an end to the boot failure caused by the number of registers, but few kernels can use so many registers, and there might be potential performance problems with only 128 or 256 threads executing simultaneously on the SM. But, setting block_size to 128 is not a loss compared to 256 and 512. So 128 is a very suitable common value for block_size.

So far we have determined the value of block_size. Then, let’s discuss grid_size, which is the total number of threads. For a general elementwise kernel, the total number of threads should be no greater than the total number of elements, i.e., one thread should process at least one element. At the same time, grid_size also has an upper limit of Maximum x-dimension of a grid of thread blocks, which is currently 2^31 - 1 for mainstream frameworks, a large enough value for most cases.

Sometimes it is feasible to create a thread for each element, because thread creation is an operation with very low overhead on the GPU. However, if each thread contains a common operation, the increase in the number of threads means that the overhead becomes larger, for example:

In this kernel, the processing of v is a common operation. If we reduce the number of threads and loop through y and x, the overhead of sqrt(*v) will be reduced accordingly. But the value of grid_size should not be lower than the number of SMs on the GPU, otherwise there will be SMs in the idle state.

The GPU can schedule (the number of SMs times the maximum number of blocks per SM) blocks at one time. Because the computation amount of each block is equal, all SMs should complete the computation of these blocks almost at the same time, and then process the next batch. Each batch is called a wave. Imagine that grid_size is exactly one more block than a wave. In this condition, the next kernel on the stream cannot be executed until the kernel is fully executed. Therefore, after the first wave is completed, only one block will be executed on the GPU, and the actual utilization of the GPU will be very low. This situation is called tail effect.

We should try to avoid this situation. Setting grid_size to exactly one wave might not avoid the tail effect, because the GPU may not be exclusive to the current stream, such as the NCCL execution will occupy the SMs. Therefore, usually, we can set grid_size to a sufficient number of integer waves to achieve more desirable results. If the number is large enough, not an integer number of waves will not have much effect.

To sum up, in a normal elementwise kernel or other similar cases, a block_size of 128 and a grid_size of enough waves can lead to a satisfactory result. However, more complex cases need to be analyzed on a problem-specific basis. For example, if an SM can only execute few blocks at the same time due to the shared_memory limitation, then increasing the block_size might improve the performance. If there is synchronization between threads in the kernel, then an excessive block_size will lead to a lower actual utilization of SMs. For these conditions, we can discuss them later separately.

I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more details of OneFlow.

Related articles:

  1. OneFlow v0.6.0 came out!
  2. How the DataLoader of OneFlow Works

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

--

--

--

OneFlow is a performance-centered and open-source deep learning framework.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

METANEVOLUTION WHITELIST CAMPAIGN IS OPEN NOW

Google SignIn without Firebase for Android on Flutter

UNMARSHAL NETWORK

2nd Week Internship @soffico GmbH

Extending the <meta> tag for Game Monetization

Weekly Dev Update #13

“BEST BUY ” Website Clone Project | Masai School,

The 5 Must-Haves In Every Prepper’s Home Office

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
OneFlow

OneFlow

OneFlow is a performance-centered and open-source deep learning framework.

More from Medium

How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization

A comprehensive guide to memory usage in PyTorch

Swin/Vision Transformers — Hacking the human eye

Swin Transformer 🚀: Hierarchical Vision Transformer using Shifted Window — Part I