Welcome to use OneFlow v0.6.0, we would love to hear feedback!
This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:
- Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
- A larger number of common operators
- Improvements in static graphs and ConsistentTensor
- Serving functionality as Nvidia Triton’s backend
- Richer visual pre-training models similar to torchvision and timm
- Better OneFlow-ONNX conversion functionality
The following are the detailed release notes.
Framework
1. Performance Optimization of nn.Graph
Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc
- Optimized nn.Graph’s performance in high frequency iterative training scenarios
- Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible
2. Performance Optimization of Eager
Compared to v0.5.0, v0.6.0 OneFlow Eager’s training speed increases dramatically in small batch scenarios
- Optimized the scheduling logic for virtual machines
- Optimized get/set item
- Optimized tensor.numel()
- Optimized oneflow.Size()
3. Performance Optimization of Operators
Optimized some operators that affect the performance of new model to significantly improve the training speed of these models
- Added fused dropout operators
- Added CPU-version group deconv and optimized its performance
- Added inplace-version implementation for operators mul, hard_sigmoid, and sin
- Optimized performance for linalg.vector_norm when ord=2.0 and it is 4 times faster than before
- Deeply optimized the LayerNorm operator, making its performance greatly better than PyTorch and Apex implementation. For more information, refer to How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization
- Realized automatic type promotion of operators. For more information, refer to Automatic Type Promotion of Operators in OneFlow
4. Performance Optimization of Eager’s Memory Occupation
Optimized some operators’ memory occupation during net training, making the same computing device run bigger models or data
- Optimized the backward memory occupation of broadcast binary operators
- Optimized the backward memory occupation of Slice operator
- Optimized the memory occupation of LayerNorm operator
5. More Useful Features to Static Computation Graph (nn.Graph)
The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs
To help the debugging of static graphs, we added the following features:
- debug mode supports graph.debug(1) printing more information about the composition
- Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
- Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
- Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests
Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph
To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation
Enabled nn.Graph to support more usage scenarios:
- Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
- Supported for using the following nn.Module Containers with nn.Graph: Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
- Supported for creating Optimizer in the init function of nn.Graph
- Supported multiple parameters sharing the same Tensor with nn.Graph
- Supported for scenarios where the actual number of processes is greater than the number of GPU devices
- Supported more Inplace execution for Consistent SBP inference under nn.Graph
6. A Larger Number of Operators
- Newly added operators: cumsum, meshgrid, linspace, diagonal, movedim, roialign, nms, arccos, and roll
- Newly added operators: masked_fill, floordiv, glu, pool1d, pool2d, and pool3d
- Newly added unfold and fold operators: Adding Unfold and Fold Ops into OneFlow
- Achieved automatic data type promotion of operators: [Automatic Type Promotion of Operators in OneFlow
- Added expand and repeat operators: Added Expand and Repeat Operators into OneFlow
- Supported one-click switching for the current torchvision library models by the command
import oneflow as torch
7. User-Defined autograd.Function
Users can customize autograd.Function just like using Torch.
8. Added Basic Serving Functionality
Serving functionality of models is provided by OneFlow as Nvidia Triton’s backend.
9. Added Some Functionalities of Tensor (ConsistentTensor)
- Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
- Supported Tensor’s conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
- Supported constructing ConsistentTensor from numpy
- oneflow.from_numpy()
- oneflow.numel()
- tensor.expand_as()
Model
1. Richer Visual Pre-training Models
Image Classification
- CNN series:
ResNet
,DenseNet
,VGG
,ResNext
,EfficientNet
, etc - Vision Transformer series:
ViT
,PVT
,Swin-Transformer
, etc - Vision MLP series:
Mlp-Mixer
,Res-MLP
,g-MLP
, etc
Object Detection
- SSD, SSDLite
- Faster R-CNN
- RetinaNet
Image Segmentation
- FCN
- DeepLabV3
Style Migration
- StyleNet: Suport Styles
sketch
,candy
,mosaic
,rain_princess
, andundie
2. Implemented Data Augmentation Operations Similar to torchvision
For data augmentation operations like CenterCrop
and ColorJitter
similar to torvhvision, developers can run import flowvision as torchvision
to execute in most scenarios.
3. Implemented Advanced Data Augmentation Opertations Similar to timm
Advanced data augmentation opertations implemented in flowvision.data:
- Mixup
- CutMix
- Random-Erasing
- AutoAugment
- RandAugment
- AugMix
4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model
flowvision.layers.attention
- Implemented plug-and-play attention models like
Non-Local
,SELayer
,CBAM
,BAM
,ECA
, etc
flowvision.layers.blocks
- Provided modules that might be used for model building like
PatchEmb
,Pooler
,ConvBnAct
, etc
flowvision.layers.regularization
- Provided regularization modules such as
drop-path
,drop-block
, andstochastic depth
to improve model generalization ability - Provided separate files such as
activation
andweight_init
to improve components likeactivation function
andinitialize method
OneFlow-ONNX Conversion
Updated OneFlow to ONNX toolkit:
- Supported OneFlow model converting to ONNX model in CPU or GPU mode
- Added test cases for operators and models to align all classification models in OneFlowVision library
- Fixed onnx-runtime bugs during PReLU conversion
- Compatible with v1.9.0 onnx-runtime library or later versions
- Released v0.5.4 oneflow-onnx package, and developers can run
pip install oneflow-onnx
to experience
Full changelog link: https://github.com/Oneflow-Inc/oneflow
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.