How to Implement an Efficient Softmax CUDA Kernel— OneFlow Performance Optimization

GPU Basics and CUDA Performance OptimiZation Principles:

How to Assess Whether a CUDA Kernel is Making Full Use of the Video Memory Bandwidth Resources?

Naive Softmax Implementation:

Comparison of OneFlow and cuDNN

OneFlow’s Approach for Deep Optimization of Softmax CUDA Kernel

Implementation 1: A Warp Processes One Row of Elements

Implementation 2: A Block Processes One Row of Elements

Implementation 3: A Block Processes One Row of Elements, Does not Use Shared Memory and Reads the Input x Repeatedly

General Optimization Techniques

References:

--

--

--

OneFlow is a performance-centered and open-source deep learning framework.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Photonai Is A High Code Framework for Scikit-Learn, Keras, Tensorflow, Pytorch, and other…

Open data 5m scale modeling for Epithermal gold ore at Lamuntet, West Sumbawa District, Indonesia

Predictive and Theoretical Models

What is Neuro-linguistic Programming (NLP)? ⋆ James Pesch

100x Faster Machine Learning Model Ensembling with RAPIDS cuML and Scikit-Learn Meta-Estimators

Preventing robots from making the same mistakes as Chinese emperors — my thoughts on fastSLAM

Fighting the Black Box: Guide to Interpreting Common Machine Learning Algorithms

Taking Deep Q Networks a step further

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
OneFlow

OneFlow

OneFlow is a performance-centered and open-source deep learning framework.

More from Medium

The Development of Credit-based Flow Control (Part 2)

Exploring the applicability of Kotlin for data science.

Speed UP Inference on OVHcloud AI Training( PART 1)

Contributing to LibTorch: recent architectures and “vanilla” training pipeline