OneEmbedding Allows Efficient Training of Large Recommender Models with Single GPU

12 min readAug 11, 2022

Written by Zheng Zekang, Guo Ran, Liu Juncheng, Yuan Jinhui; Translated by Hu Yanjun, Shen Jiali, Cheng Haoyuan, Jia Chuan

Personalized recommendations have been a main source of information for people today. Unlike in the past, when people could only access what they wanted by searching, nowadays media outlets can suggest relevant items to users after identifying their fields of interest based on recommendation algorithms.

For users, recommender systems improve their online experience. For businesses, the quick match between people and their needed information has turned users into customers, and thus gives rise to business empires at trillion-dollar market values.

Recommender systems are supporting short video feeds, search ads, and online shopping, and the driving factor behind recommender systems is deep learning models.

However, the accumulation of massive data and the increasingly frequent user data iterations pose serious challenges to the extensibility and training speed of these systems. It turns out that the general deep learning frameworks are not enough to meet the needs of industrial recommender systems. We must further customize general deep learning frameworks or even develop specialized recommender systems.

To tackle these problems, the OneFlow team has recently released OneEmbedding, an efficient, extensible, and highly flexible recommender system component. It is as easy to use as general deep learning frameworks. But its performance is much better than that of general frameworks, even surpassing HugeCTR, a specialized recommendation framework developed by NVIDIA.

Specifically, OneEmbedding outperforms HugeCTR by a significant margin in FP32 and Automatic Mixed Precision (AMP) training on both the DCN and DeepFM models, and delivers equal performance compared to HugeCTR on the DLRM model, which is deeply optimized by HugeCTR to the verge of overfitting.

(All experiments are conducted in the same testing environment: CPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2；CPU Memory 1920GB；GPU NVIDIA A100-SXM-80GB * 8；SSD Intel SSD D7P5510 Series 3.84TB * 4)

To train a recommender system model with large TB-level table in OneFlow, users only need to configure the Embedding table with the following snippet:

# self.embedding = nn.Embedding(vocab_size, embedding_vec_size)
self.embedding = flow.one_embedding.MultiTableEmbedding(
                     "sparse_embedding",
                     embedding_dim=embedding_vec_size,
                     dtype=flow.float,
                     key_type=flow.int64,
                     tables=tables,
                     store_options=store_options,
                 )

Here are some common cases of search ads recommender models that are constructed with OneEmbedding: https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems

Challenges facing large recommender systems

Normally speaking, building a recommender system entails sparse features such as gender, age, and behaviors. Using the feature ID, the system finds the corresponding Embedding vector of the feature in the Embedding vocabulary table via the lookup function, and then passes the vector downstream for further use.

The popular public dataset Criteo 1TB contains around one billion feature IDs. If embedding_dims is set to 128, then it requires 512 GB storage to accommodate the Embedding parameters. Moreover, if using the Adam optimizer, the required storage will spike to 1536 GB because of the need to store the extra status variables:m and v. In actual application scenarios, the data size can be several orders of magnitude larger than that in Criteo, so it requires much larger space in models.

A major challenge facing large recommender systems is finding a way to support the lookup and update of large-size Embedding more efficiently and economically. Different tradeoffs among scale, cost, and efficiency result in the following three common solutions.

The earliest and most frequently adopted solution is to deploy the Embedding on CPUs, taking advantage of the inexpensively large memory capacity of CPUs to accommodate more parameters. The bright side is that there is nearly no limit for the model size. However, the drawback is pretty huge. The CPU is nowhere near the GPU in terms of computation performance and bandwidth, making Embedding a major bottleneck. That means it will take dozens or even hundreds of CPU servers to support an industrial recommender system.

Since GPU is a perfect tool for dense computation, some people suggest that we should use GPUs to train large Embedding models. But then we are exposed to one problem: GPUs are expensive and have a small memory size. If we use NVIDIA A100 to train 128-dimensional Embedding vectors based on Criteo, we will need at least 13 of them since each A100 only has a memory size of 40 GB. Distributed Embedding entails a technique that is called “model parallelism”. Ideally, to train larger models, we only need to use more GPUs.

The fact is, the cost of GPU compared to that of CPU is very high, and the main computation amount in recommerder models is relatively small. Model parallelism only allows larger Embedding scales, but it does not bring much training speed gain. In contrast, it slows the training down due to the introduction of communication between multiple devices, so it is usually only suitable for small-scale clusters.

In order to increase transmission bandwidths between GPUs, interconnect technologies such as NVSwitch and Infiniband network with higher bandwidths than Ethernet were developed. However, on the one hand, adopting these technologies means additional costs. On the other hand, many users don’t have the infrastructure that meet the conditions for such renovation and upgrading.

Is it possible to have it both ways?

Our answer is “yes”. OneFlow has designed OneEmbedding to solve the problems above. OneEmbedding enables single-GPU training of TB-level models through hierarchical storage and imposes no limit on model size through horizontal expansion. Built on OneFlow’s automatic pipeline mechanism, kernel optimization, and quantitative compression of communication, OneEmbedding can deliver great performance. As easy to use as Pytorch, it outperforms TorchRec by a factor of 3 on DLRM model. What’s more, when enables mixed precision, which is not supported by TorchRec, OneEmbedding outperforms TorchRec by more than 7 times.

(TorchRec’s performance results on 8 A100 GPUs: https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results)

Core advantages of OneEmbedding

Hierarchical storage: support single-GPU training of TB-level models

Utilizing the spatial and temporal locality of data, multi-level cache allows us to strike a good balance between performance and cost. OneEmbedding implements multi-level cache based on this idea, and thus enables users to train TB-level models even if they only have one GPU.

Users can deploy Embedding to GPU memory, CPU memory and even SSD. This solution can give full play to the low-cost advantage of CPU memory and SSD. It can not only expand the scale of Embedding parameters but also use GPU memory as a cache device to achieve better performance.

OneEmbedding will dynamically cache frequently accessed items to GPU memory and evicts less frequently accessed items to underlying storage such as CPU memory or SSD. On the premise that the data follows a power-law distribution, OneEmbedding can keep the GPU cache hit rate at a high level based on an effective cache management algorithm.

It is worth noting that OneEmbedding only uses CPU memory and SSD as storage devices, and all computations are performed on GPU. Currently, OneEmbedding provides three preset storage configs:

Use GPU memory to store all model parameters
Use CPU memory as Embedding parameter storage device and GPU as cache
Use SSD as Embedding parameter storage device and GPU as cache

# Use SSD as storage device and GPU as cache
store_options = flow.one_embedding.make_cached_ssd_store_options(
                    cache_budget_mb=cache_memory_budget_mb,
                    persistent_path=persistent_path,
                    capacity=vocab_size,
                )

Users can configure with just a few lines of code according to the actual hardware equipment, and achieve the optimization of scale, efficiency and cost at one time.

In order to offset the latency of CPU and SSD fetching data, OneEmbedding introduces techniques such as pipelining and data prefetching. This ensures the same high efficiency of pure GPU training while using CPU memory and SSD as the storage backend.

We test these three storage configs separately. The test case is consistent with the DLRM model of MLPerf, and the parameter size is about 90GB. When using SSD and CPU memory as storage devices, the GPU cache size we configure is 12 GB per GPU. This means only a part of the parameters are stored in the GPU memory. Others are stored in the CPU memory or SSD and will be dynamically swapped into the GPU cache as the training goes on. The test results are shown as follows:

*(Testing environment: CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz * 2；CPU Memory 512GB；GPU NVIDIA A100-PCIE-40GB * 4；SSD Intel SSD D7P5510 Series 7.68TB * 4)*

We can tell from the test results that:

(1) The full GPU memory solution delivers the optimal performance, but the largest model it can train is only 160 GB in theory because the GPU memory capacity is only 4x 40GB.

(2) Compared with the full GPU memory solution, the GPU cache plus CPU memory solution only suffers little performance loss, but it can scale the ceiling of parameter size to the CPU memory capacity, often hundreds of GB to several TB.

(3) If you can accept more performance loss, the GPU cache plus SSD memory solution can scale the ceiling of parameter size to the SSD capacity, and the largest model being trained can reach tens of TB or more.

If we want to conduct a complete training of the DLRM model on a server with a single NVIDIA A30–24GB GPU, it’s obvious that 24GB memory can’t directly support training a 90GB model. Instead, with the help of hierarchical storage that specifies CPU memory as the storage device and GPU memory as the cache, it’s no longer a problem to train 90GB or larger models.

Scale-out: multi-GPU linear acceleration to break the ceiling of model training

The hierarchical storage technology helps OneEmbedding to improve the Embedding parameter size ceiling on the single-GPU device, enabling it to train even TB-level models with sufficient memory capacity. If the model size is further scaled to significantly exceed the CPU memory capacity, users can utilize OneFlow’s parallel capability to scale it out to the multi-node multi-GPU to train even larger models based on hierarchical storage technology.

In a recommender system, the model parameter is much smaller than the Embedding parameter. Therefore, we generally specify the Embedding as model parallelism and the model as data parallelism. In this way, the multi-node multi-GPU method can further increase the Embedding size.

The detailed implementation process is: 1) each rank is responsible for storing a part of the Embedding, and the feature IDs are entered into each rank. But there may exist repeated IDs, so it’s essential to eliminate the repeated ones(i.e.ID Shuffle); 2) each rank queries Embedding with its unique ID and obtains its corresponding local data. Then, after the data distributed across all ranks is merged, each rank gets complete Embedding data(i.e.Embedding Shuffle); 3) each rank completes the entire model training process in data parallelism.

The following figure displays the model’s throughput for nodes with different numbers of GPUs when OneEmbedding adopts the pure full memory solution to train the DLRM model (the blue column denotes the GPU utilizing AMP to compute, while the orange one the GPU utilizing FP32 to compute).

*(Testing environment: CPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2; CPU Memory 1920GB; GPU NVIDIA A100-SXM-80GB * 8; SSD Intel SSD D7P5510 Series 3.84TB * 4)*

As the number of GPUs increases, the model’s throughput soars significantly. In the AMP case, the single-GPU node’s throughput reaches 6 million, but the 8-GPU node’s throughput soars to almost 40 million.

Pipeline mechanism: auto overlap compute and data transfer

In the DLRM model, the Dense Feature from the Embedding goes to the Bottom MLP, while the Sparse Feature gets the corresponding feature by querying the Embedding. Next, the Sparse Feature and the Dense Feature conduct feature crosses after entering Interaction and finally enter the Top MLP.

Embedding-related operations include Embedding Lookup and Embedding Update. Since OneEmbedding implements a tiered storage mechanism, there may be cases where the feature IDs do not hit the cache, which may slow the training speed as it takes a longer time to pull the data.

To avoid this defect, OneEmbedding adds an Embedding Prefetch operation to ensure that both Embedding Lookup and Embedding Update are performed on the GPU. Since there is no dependency between the data prefetch of the previous and next iterations, the Embedding data needed for the next iteration can be prefetched while the current iteration is being computed, allowing the computation and prefetching to be conducted at the same time.

When the Embedding data is being queried and exchanged, Dense Features that are not related to the Embedding operation can enter the Bottom MLP for computation, overlapping with the former two operations in time. The full timing of the overlapped execution is shown in the figure below.

Controlling such a complex data pipeline remains a major challenge for traditional deep learning frameworks. Besides, in the actual recommendation scenario, users’ data is changing constantly, which requires the pipeline mechanism to process dynamic data.

The Actor mechanism of OneFlow makes all these problems very simple. Each Actor can implement distributed coordination work via its internal state machine and message mechanism. By dispatching multiple storage blocks to each Actor, different Actors can work at the same time, thus implementing a pipeline between actors by overlapping their respective working times. We only need to dispatch the Embedding operation to a single stream, allowing the system to build the pipeline automatically.

Kernel optimization: approaching GPU’s optimal performance

OneFlow has not only deeply optimized general operators but also added the implementations of multiple high-performance CUDA operators given popular recommender system models’ features.

For the feature crosses in DLRM and DCN, OneFlow has implemented the FusedDotFeatureInteraction and FusedCrossFeatureInteraction operators respectively.

*(FusedCrossFeatureInteraction operator, picture from “Deep & Cross Network for Ad Click Predictions”)*

For multiple fully connected layers in the model, OneFlow has implemented the FusedMLP operator based on the cublasLt library.

For the fully connected layers with Dropout operation, OneFlow has deeply customized its ReluDropout operation. Specifically, OneFlow will store the forward mask in the form of the bitmask and implement reverse operator fusion by specifying the parameter alpha as alpha = dropout_scale in cublasLt matrix multiplication for backpropagation.

Embedding quantization: improving communication efficiency

In the communication process of model training, a lot of efforts have been done recently to quantize and compress data to save communication volume and improve communication efficiency. This feature is also accessible in OneEmbedding.

In parallel training, each Rank needs to exchange their Embedding data via communication. We first convert the data type from floating point to int8 and then restore it via dequantization technology after data is swapped.

The following figure shows the DLRM model’s throughputs before and after being quantized in FP32 and AMP respectively. Note that the DLRM model is trained on the machine adopting the full GPU memory solution.

Model precision comparsion before and after quantization (AUC):

The testing results demonstrate that, without model precision loss, the throughput for quantized communication has witnessed a 64% rise in the FP32 case and a 13% rise in the AMP case when it is compared to that in the default communication mode.

Easy-to-use: build large-scale recommendation models as easily as in PyTorch

Now, OneEmbedding has been built in OneFlow as an internal extension component. That means users can enjoy the flexibility of the general OneFlow framework to build their recommendation models while using OneEmbedding’s advanced features.

class DLRMModule(nn.Module):
    def __init__(self, args):
        super(DLRMModule, self).__init__()
        self.bottom_mlp = FusedMLP(...)
        self.embedding = OneEmbedding(...)
        self.interaction = FusedDotInteraction(...)
        self.top_mlp = FusedMLP(...)
    def forward(self, sparse_feature, dense_feature): 
        dense_fields = self.bottom_mlp(dense_feature)
        embedding = self.embedding(sparse_feature)
        features = self.interaction(dense_fields, embedding)
        return self.top_mlp(features)

Finally, it is worth mentioning that OneEmbedding encodes the feature IDs via a built-in encoding mechanism and supports the dynamic insertion of new data. There is no need to plan the Embedding capacity in advance or specially process the feature IDs in the dataset. This dynamic mechanism is designed to support various incremental training scenarios while reducing the usage burden.

At the moment, OneFlow’s model library provides a collection of models being constructed based on OneEmbedding, such as DLRM, DeepFM, xDeepFM, DCN, PNN, MMoE. In the future, we will add more recommender models to it.(https://github.com/Oneflow-Inc /models/tree/main/RecommenderSystems).

Conclusion

OneEmbedding is a component designed to train large-scale recommender system models, and its features spanning flexible tiered storage, highly optimized data pipeline, and ease of scale-out enable users to easily train TB-level recommender models.

Currently, OneFlow provides some model samples for you to try out the OneEmbedding with only one click. Later, OneFlow will also launch the Flow-Recommender, a recommender system model library covering all mainstream models. It supports both distributed training and inference, so stay tuned for it.

OneEmbedding address：https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems
OneEmbedding documentation：https://docs.oneflow.org/master/cookies/one_embedding.html
OneEmbedding API documentation：https://oneflow.readthedocs.io/en/master/one_embedding.html

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.