OneFlow v0.6.0 came out!

OneFlow

5 min readJan 10, 2022

Welcome to use OneFlow v0.6.0, we would love to hear feedback!

This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:

Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
A larger number of common operators
Improvements in static graphs and ConsistentTensor
Serving functionality as Nvidia Triton’s backend
Richer visual pre-training models similar to torchvision and timm
Better OneFlow-ONNX conversion functionality

The following are the detailed release notes.

Framework

1. Performance Optimization of nn.Graph

Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc

Optimized nn.Graph’s performance in high frequency iterative training scenarios
Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible

2. Performance Optimization of Eager

Compared to v0.5.0, v0.6.0 OneFlow Eager’s training speed increases dramatically in small batch scenarios

Optimized the scheduling logic for virtual machines
Optimized get/set item
Optimized tensor.numel()
Optimized oneflow.Size()

3. Performance Optimization of Operators

Optimized some operators that affect the performance of new model to significantly improve the training speed of these models

Added fused dropout operators
Added CPU-version group deconv and optimized its performance
Added inplace-version implementation for operators mul, hard_sigmoid, and sin
Optimized performance for linalg.vector_norm when ord=2.0 and it is 4 times faster than before
Deeply optimized the LayerNorm operator, making its performance greatly better than PyTorch and Apex implementation. For more information, refer to How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization
Realized automatic type promotion of operators. For more information, refer to Automatic Type Promotion of Operators in OneFlow

4. Performance Optimization of Eager’s Memory Occupation

Optimized some operators’ memory occupation during net training, making the same computing device run bigger models or data

Optimized the backward memory occupation of broadcast binary operators
Optimized the backward memory occupation of Slice operator
Optimized the memory occupation of LayerNorm operator

5. More Useful Features to Static Computation Graph (nn.Graph)

The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs

To help the debugging of static graphs, we added the following features:

debug mode supports graph.debug(1) printing more information about the composition
Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests

Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph

To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation

Enabled nn.Graph to support more usage scenarios:

Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
Supported for using the following nn.Module Containers with nn.Graph: Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
Supported for creating Optimizer in the init function of nn.Graph
Supported multiple parameters sharing the same Tensor with nn.Graph
Supported for scenarios where the actual number of processes is greater than the number of GPU devices
Supported more Inplace execution for Consistent SBP inference under nn.Graph

6. A Larger Number of Operators

Newly added operators: cumsum, meshgrid, linspace, diagonal, movedim, roialign, nms, arccos, and roll
Newly added operators: masked_fill, floordiv, glu, pool1d, pool2d, and pool3d
Newly added unfold and fold operators: Adding Unfold and Fold Ops into OneFlow
Achieved automatic data type promotion of operators: [Automatic Type Promotion of Operators in OneFlow
Added expand and repeat operators: Added Expand and Repeat Operators into OneFlow
Supported one-click switching for the current torchvision library models by the command import oneflow as torch

7. User-Defined autograd.Function

Users can customize autograd.Function just like using Torch.

8. Added Basic Serving Functionality

Serving functionality of models is provided by OneFlow as Nvidia Triton’s backend.

9. Added Some Functionalities of Tensor (ConsistentTensor)

Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
Supported Tensor’s conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
Supported constructing ConsistentTensor from numpy
oneflow.from_numpy()
oneflow.numel()
tensor.expand_as()

Model

Released flowvision 0.0.54.

1. Richer Visual Pre-training Models

Image Classification

CNN series: ResNet, DenseNet, VGG, ResNext, EfficientNet, etc
Vision Transformer series: ViT, PVT, Swin-Transformer, etc
Vision MLP series: Mlp-Mixer, Res-MLP, g-MLP, etc

Object Detection

SSD, SSDLite
Faster R-CNN
RetinaNet

Image Segmentation

FCN
DeepLabV3

Style Migration

StyleNet: Suport Styles sketch, candy, mosaic, rain_princess, and undie

2. Implemented Data Augmentation Operations Similar to torchvision

For data augmentation operations like CenterCrop and ColorJitter similar to torvhvision, developers can run import flowvision as torchvisionto execute in most scenarios.

3. Implemented Advanced Data Augmentation Opertations Similar to timm

Advanced data augmentation opertations implemented in flowvision.data:

Mixup
CutMix
Random-Erasing
AutoAugment
RandAugment
AugMix

4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model

flowvision.layers.attention

Implemented plug-and-play attention models like Non-Local, SELayer, CBAM, BAM, ECA, etc

flowvision.layers.blocks

Provided modules that might be used for model building like PatchEmb, Pooler, ConvBnAct, etc

flowvision.layers.regularization

Provided regularization modules such as drop-path, drop-block, and stochastic depth to improve model generalization ability
Provided separate files such as activation and weight_init to improve components like activation function and initialize method

OneFlow-ONNX Conversion

Updated OneFlow to ONNX toolkit:

Supported OneFlow model converting to ONNX model in CPU or GPU mode
Added test cases for operators and models to align all classification models in OneFlowVision library
Fixed onnx-runtime bugs during PReLU conversion
Compatible with v1.9.0 onnx-runtime library or later versions
Released v0.5.4 oneflow-onnx package, and developers can run pip install oneflow-onnx to experience

Full changelog link: https://github.com/Oneflow-Inc/oneflow

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.