Written by Luyang Zhao; Translated by Xiaozhen Liu, Chenyang Xu
This article introduces the principles and workflow of DataLoader in OneFlow/PyTorch, as well as how multiprocessing dataloader works.
Introduction to DataLoader
In simple terms, DataLoader is a data loader for processing batch data and labels in each iteration of a dataset. It is a necessity in deep learning.
As described in PyTorch documentation, the DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset. It supports single- or multi-process loading.
DataLoader Principles
Core Formation
- DataLoader
- Dataset
- Sampler
- Fetcher
The following is a brief summary of the working principle of DataLoader:
- DataLoader is the core responsible for data loading; DataLoaderIter is the basic execution unit. Each time dataloader enters iter, it will process the specific data loading process through DataloaderIter;
- Dataset is the base class for all datasets. Any custom dataset needs to inherit it and and override the getitem method to define the way of fetching data;
- Sampler is the sampler responsible for getting the index. Each iteration will generate the index of the dataset to be sampled through the Sampler;
- Fetcher is more like a collector of data. According to the batch index generated by the Sampler, Fetcher fetches the corresponding data in the dataset, packages the acquired data into a final usable form through the corresponding collate_fn method, and returns it to the dataloader.
Usage Examples
- MNIST
Here we use an example of PyTorch train classification network with mnist dataset to illustrate the usage of DataLoader:
It can be seen that dataset1 and dataset2 are respectively the training set and the test set of the dataset. In PyTorch, they are defined by torchvision.datasets.MNIST.
MNIST inherits from VisionDataset, and VisionDataset inherits from torch.utils.data.Dataset.
In MNIST, the most important getitem method of the data set is implemented to get the corresponding data according to the index:
In OneFlow, oneflow.utils.data corresponds to torch.utils.data; flowvision corresponds to torchvision. Their ways of use are almost the same. For example, for the MNIST dataset in OneFlow, you can directly use it through flow.utils.vision.datasets.MNIST.
After dataset1 and dataset2 is defined, the dataloader (train_loader, test_loader) used for training and test is passed in.
After that, in the train/test loop, it can iterate the dataloader to obtain the data and label of each iter:
2. ImageNet
Here we still use the imagenet data set training in the official examples of PyTorch as an example:
We can see that the general process is similar to the above mnist:
- Construct the dataset, here we construct the dataset through datasets.ImageFolder. ImageFolder is used to read/process image datasets stored in folders:
It can be seen that it inherits from DatasetFolder, and the main parameters during initialization are:
- root: image folder path
- transform: transform process for the PIL images read by the loader, such as the above-mentioned Resize, CenterCrop, etc.
- loader: an image loader used to load images according to path. Usually the default loader is PIL
The most important getitem method in dataset is implemented in DatasetFolder:
Getitem determines how to get the corresponding data according to the index.
2. With a multi-machine distributed training, the Sampler needs to use the DistributedSampler class specially designed for distributed training (otherwise no special settings are needed, and we can just use the default one); There is a detail here. The training set and the test set made different transforms to the dataset. The training set uses RandomResizedCrop and RandomHorizontalFlip; The test set uses Resize and CenterCrop. After transform, they are finally transformed into Tensor through the ToTensor method.
3. Construct the dataloader (train_loader, val_loader) for training and test. Then you can use it directly in the train/eval loop:
Dataloader Workflow
The following is the main process:
Dataset
Any custom dataset must inherit the Dataset class and implement _getitem__
method, which is used to define the way to get data based on the index.
At the same time, the custom dataset can optionally override the len method to determine the size of the dataset.
DataLoader
Dataloader is the core of the entire data processing process.
In each iteration of dataloader, the most important thing is to get data and label through the above __iter__
method. In __iter__
, the corresponding DataLoaderIter instance is obtained through the _get_iterator
method.
- Under the single process, the DataLoaderIter instance is
_SingleProcessDataLoaderIter
; - Under the multi-process, the DataLoaderIter instance is
_MultiProcessingDataLoaderIter
. They all inherit from_Base DataLoaderIter
.
DataLoaderIter
DataLoaderIter is responsible for the processing of specific transactions in each iteration of dataloader.
At each iteration, the __next__
method of _BaseDataLoaderIter is called, which in turn calls the _next_data
method implemented by the self class to get the data. Take _SingleProcessDataLoaderIter
as an example:
index = self._next_index()
gets the index of the dataset for this iteration via the Sampler.self._dataset_fetcher.fetch(index)
fetches the corresponding data based on the index via Fetcher.
Fetcher
As the data collector, Fetcher will slice, collect, and pack the data from the dataset into a complete available batch based on the index generated by the Sampler and return it to the dataloader for use.
Similar to DataLoaderIter(_BaseDataLoaderIter), Fetcher has a base class BaseDatasetFetcher. Based on various data types, we can enter different subclasses. Here we use the common _MapDatasetFetcher
subclass of _MapDatasetFetcher
as an example to see the main work of Fetcher.
It can be seen that:
data = [self.dataset[idx] for idx in possibly_batched_index]
return self.collate_fn(data)
- Based on the index list of batch, slice the corresponding data in the dataset, and a list of the fetched batch data is returned;
- Based on the incoming or custom
collate_fn
method, collect and process the batch data, and package it into a Tensor that can be used directly for training/test.
Multiprocessing Dataloader Principles and Workflows
Principles
Under a common single process, the data processing of each iteration by a dataloader is iter-by-iter and synchronous. Due to Python’s lack of actual multi-threaded execution, the single process is usually slow for a dataloader.
Under the multi-process, we can start multiple worker processes through Python multiprocessing. For example, after four worker processes are started, in theory, four iterations of datasets can be processed per unit time to accelerate the data processing/loading process.
Under the single process for the dataloader, since the data processing is iter-by-iter, the processing of the next iteration needs to wait for the current interation to complete.
The main difference between multi-process dataloader and single process dataloader is that under multi-process, we can start multiple worker processes through Python’s multiprocessing module to accelerate the process.
Here we take four-work processes dataloader as an example:
The main thread of dataloader first sends the current iteration’s task to worker1, then sends the next iteration’s task to worker2, then to worker3, and worker4. This step is mainly completed in L1024-L1026 of dataloader.py.
After sending the index one after another, these 4 workers can work in parallel and finish their iteration’s task. Then they stuff the processing result into a Queue, and the main thread of dataloader can take the data from the Queue.
The specific workflow of each worker is actually similar to the workflow of a single-process dataloader. The following is the difference between multi-process and single-process dataloader, and how multiple workers work together.
Workflows
_MultiProcessingDataLoaderIter
_next_data()
⬇️
_get_data() ➡️ _try_get_data()
⬇️
_process_data() ➡️ _try_put_index()
Here each worker works independently. The main code is in the _worker_loop() method of oneflow/python/oneflow/utils/data/_utils/worker.py
.
Each worker is in its own worker loop. They will start working once r = index_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
gets the index in index_queue.
idx, index = r
>> data = fetcher.fetch(index)
part is no different from the workflow of the previously described single-process dataloader.
When the processed data is fetched, it will be put into the data_queue of the main thread of the dataloader.
data_queue.put((idx, data))
waits for the dataloader's main thread to fetch the result from the queue.
The above is the main workflow of multi-process dataloader.
Conclusion
By reviewing the workflows of the dataloader/dataset, we hope you can understand the working principle of dataloader/dataset under OneFlow’s and PyTorch’s dynamic graph mode.
Aligning with PyTorch’s dataloader/dataset is only the first step. We still face many problems such as efficiency bottlenecks. Even with the multi-process DataLoader, in some cases, calling C++ operators to perform transforms under Python or during image decoding may still have performance problems. This results in low GPU usage or we need to wait for CPU data processing. Therefore, more efficient solutions (such as Dali, etc.) need to be considered in the future.
I hope this article will help you in your deep learning projects😊. If you want to experience the functions of OneFlow, you can follow the method described in this article. If you have any questions or comments💡 about use, please feel free to leave a comment in the comments section below. Please do the same if you have any comments, remarks or suggestions for improvement. In future articles, we’ll introduce more details of OneFlow.
Related articles:
Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.
Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.