How to Implement a Permute/Transpose Op 6 Times Faster Than PyTorch?

10 min readNov 12, 2021

Written by Zekang Zheng, Juncheng Liu, Chi Yao, Ran Guo; Translated by Kaiyan Wang, Wenwen Dong

Introduction

Transpose/Permute op can be seen in the models of Transformer, which dominates the NLP, and Vision Transformer, which is a rising star in the field of CV. Especially in Multi-Head Attention, this op is needed to change the data dimension arrangement.

Obviously, as a highly used op, the CUDA implementation of Transpose/Permute op affects the training speed of the actual network. This article will introduce the techniques to optimize the Permute Kernel in OneFlow and compare it with PyTorch’s Permute and the native Copy operation in experiment. The results show that the deeply optimized Permute operation is much faster and more bandwidth effective than PyTorch, and the bandwidth utilization is close to that of the native Copy operation.

Naive Permute Implementation

The function of Permute is to transform the order of tensor data dimensions, for example:

The implementation principle can be easily understood, i.e. the i-th dimension of the output Tensor corresponds to the dims[i]-dimension of the input Tensor. The pseudo-code corresponding to the implementation of permute in the above example is as follows:

But the reality differs from the pseudo-code above in that the Shape of a tensor is a mathematical concept that does not really exist on a physical device. In OneFlow, tensor data are stored in a contiguous piece of memory, and the following figure illustrates how a tensor of shape (2, 3) is stored from the top-level view and the bottom-level view:

OneFlow’s Permute implementation works as follows:

The corresponding high-dimensional index is calculated from the one-dimensional offset of the current output (offset).
Rearrange the output index according to the parameter dims to obtain the input index.
Convert the input index into an input offset.
Finally, the data is shifted. The whole process is illustrated as follows:

After completing Permute, the output is shown below:

The entire permute calculation process requires multiple conversions between 1D offset and higher dimensional indexes. To avoid the manual calculation, OneFlow provides a tool class NdIndexOffsetHelper to facilitate the conversion.

NdIndexOffsetHelper

The methods of the NdIndexOffsetHelper are as follows:

The NdIndexToOffset converts a high-dimensional index to a one-dimensional offset
The OffsetToNdIndex converts a one-dimensional offset to a high-dimensional index

With such a tool class, then we can easily write a version of the Naive Permute Kernel with the following kernel functions:

PermuteKernelParams is a structure containing the initialized NdIndexOffsetHelper (one for src and one for dst), the total number of elements count and the transformed dimensional order permutation.
First we obtain the high-dimensional index dst_index of the currently processed output element, and assign it to the Permute input index src_index.
Then we convert the input index to a one-dimensional offset src_offset, get the input element and assign it to the corresponding output.

Optimization of General Cases

The computational cost of this naive Permute Kernel comes from coordinate permutation and the memory access overhead from data movement. For these two perspectives we introduce the following optimization schemes.

1. Static Dispatch of IndexType

As deep learning models get larger, the number of elements involved in the operation may exceed the range represented by int32_t. And in coordinate permutation, the division operation has different overheads for different integer types. Therefore, we add a template parameter IndexType to the kernel function to specify the data type of the index, and decide whether IndexType is int32_t or int64_t depending on the number of elements involved in the Permute.

2. Merging Redundant Dimensions

In some special cases, Permute dimensions can be merged, with the following rules:

Dimensions of size 1 can be removed directly.
Consecutive dimensions can be merged into one dimension.

For the second rule, let’s consider the following Permute case:

Clearly this is a four-dimensional Permute case, but here dimensions 2, 3 and 0, 1 are Permuted together, so we can see it as a two-dimensional Permute case:

After merging dimensions, when calculating indexes based on offsets using the NdIndexOffsetHelper, before merging we need to calculate them as 4-dimensional indexes, while after merging we only need to calculate them as 2-dimensional indexes. This reduces the number of divisions and multiplications compared with the case of unmerged, thus increasing speed.

3. Using Greater Access Granularity

You may have observed a template parameter size_t movement_size in the kernel function, which indicates the granularity of the elements to be accessed.

In the Nvidia Performance Optimization blog Increase Performance with Vectorized Memory Access, it mentioned that CUDA Kernel performance can be improved by vectorizing memory operations to reduce the number of instructions and improve bandwidth utilization.

We set the rules for access granularity as follows:

CUDA supports access granularity of 1B, 2B, 4B, 8B, 16B, the larger the granularity the better the performance.
The last dimension is moved as a whole, i.e. permutation[n-1]==x.dims[n-1], and the size is a multiple of the new access granularity.
Ensure that the data pointers meet the alignment requirements of the new access granularity.

For the second rule, it corresponds to the following Permute scenario:

The last dimension does not change and only dimensions 1 and 2 are swapped, then we can use a larger access granularity to read the data and then perform the Permute operation. In the code we use the “GetMovementSize” function to determine the size of the access granularity.

We used Nsight Compute to compare the runtime and bandwidth with PyTorch’s Permute and the native Copy operation, with the following results:

The test environment is NVIDIA A100 40GB and the scenario is (0, 1, 2)->(1, 0, 2), with the horizontal coordinates indicating the data shape and data type. The test data covers different sizes from 16MB to 128MB, and the data type contains both fp32 and half data type.

As we can see from the two graphs above, OneFlow can approximate or even slightly exceed the bandwidth of the Copy operation in most cases. In terms of operating time compared with PyTorch, Oneflow is at least 1.24 times faster, and can reach 1.4 times at most.

The bandwidth of Permute is a little higher than that of native Copy because there is no unroll inter-instructional parallelism optimization in the Copy Kernel, whereas the Permute Kernel has done the relevant optimization internally. This is for reference only.

Using the two optimization techniques above, OneFlow can be easily faster than PyTorch. Regular Permute is applicable to a wide range of situations, and therefore there may be cases where memory accesses are not merged. In some special cases, we can merge accesses to improve bandwidth utilization and speed, which brings us to the following discuss on BatchTranspose optimization.

BatchTranspose Optimization

The BatchTranspose operation, i.e. matrix transpose, exchanges only the last two dimensions of the matrix. The following cases all meet the definition of BatchTranspose, where the contents of the brackets indicate the order of the dimensions.

In the plain Permute scenario, the optimization has been adequately performed for the case where the last dimension is moved as a whole.

But practical scenarios sometimes have matrix transpose, where the third rule optimization operation of greater access granularity cannot be applied and does not satisfy the access merge requirement, resulting in poor performance.

Using PyTorch as an example, when BatchTranspose is performed with a data size of 128MB, the actual amount of data read is much greater than the amount of data written (7–8 times) due to the unmerged access memory.

In the NVIDIA Performance Optimization blog An Efficient Matrix Transpose in CUDA C/C++, the approach is to set up a block of Shared Memory, then read a row of data into Shared Memory, and then write the elements in Shared Memory back to Global Memory in column order. Thanks to the small access granularity of Shared Memory (32B for Global Memory and 4B for Shared Memory), the access discontinuity of Global Memory can be avoided.

Shared Memory has 15 times higher bandwidth and 20–40 times lower latency than Global Memory, so the additional read and write overhead introduced is negligible.

In addition, we add one more padding to the Shared Memory, so that the elements accessed in column order can be evenly distributed among the 32 banks and avoiding bank conflict.

The corresponding figure is as follows (where the gray part represents the Padding element) :

Based on the points mentioned above we have implemented a version of BatchTranspose with the following code:

The optimization of BatchTranspose involves the following two points:

4. Explicitly Expanding Loops

In previous versions, the for loop was written as follows:

Even with the precompiled instruction #pragma unroll, we only see two relevant instructions in the assembly code in Nsight Compute, which means that this part of the loop is not expanded.

The conditions in the for loop can be simplified and extracted as shown in the following code：

The corresponding assembly code shows that this part of the loop is expanded, with a 24% increase in bandwidth utilization and speed.

5. Optimization for half2 Version

In particular, for the half data type, and where the transposed dimensions are all divisible by 2, we can further use half2 to merge. If the width of a bank in Shared Memory is 4B, then a bank can hold two half data, as shown in the figure below:

Then when loading into Shared Memory, we can merge the two half data into the half2 type for loading. But when fetching the column elements, because the elements are distributed on two different banks, they cannot be combined into a half2 and fetched directly. This requires constructing a temporary half2 object, storing the half elements on each of the two banks into that half2 object, and then writing them back to Global Memory. The corresponding code is as follows:

Under the same test conditions, we set the test scenario to (0, 1, 2)->(0, 2, 1) and the test results are as follows:

As we can see from the graph, OneFlow can approach native Copy operations in most cases, in terms of both computation time and bandwidth utilization. Compared with PyTorch in terms of operating time, OneFlow is at least 3 times faster in the case of fp32, and can reach 3.2 times at most. OneFlow is even faster in the case of half data type, up to 6.3 times at most.

Future Direction

Tests have shown that the overhead of integer division during coordinate conversion is relatively high. And there are plenty computing libraries on the market with high quality, such as Eigen and lemire/fast_division which provide fast division based on int32 and int64 types. According to the official benchmark test results, fast division can improve performance by 1-3 times compared with standard division. In the future we will explore suitable fast division methods for coordinate conversions to further improve operation speed.

Prospect

As is shown in this article and previous articles about CUDA optimization, there are some common methods used in kernel optimization, such as combining redundancies to reduce the number of computations and adjusting access granularity to improve access efficiency.

These common optimizations are potentially refined as components of deep learning compilers to partially replace manual tuning efforts.

However, the determination of automatic optimization bounds, and how to automate them, poses higher requirements than manual tuning. Interested readers are welcome to raise issues for discussion in OneFlow’s repository https://github.com/Oneflow-Inc/oneflow.

Reference

CUDA Pro Tip: Increase Performance with Vectorized Memory Access

An Efficient Matrix Transpose in CUDA C/C++

VOLTA Architecture and performance optimization

I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more functions of OneFlow.

Related articles：

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.