5 min readMar 26, 2021

[Deep learning pipeline] Dataset management

As there is lots of engineering when we work on deep learning projects, I will present some tips I learned in practice for data/job/model management.

First of all, let’s see how dataset is managed. Dataset management here only means how to store our training/validation/test data.

Organize the data in large files

At the very beginning, we may organize the files in separate small files. For example of the widely-used ImageNet dataset, it contains 1.2 million images. The simple approach is to have a folder containing 1.2 image files. This is feasible, but a large number of small files could cause issues:

It could be slow to read data as each file means each file open. This could be costly when we have a large amount of such operations.
It could be slow to copy the data as there will also be lots of IO interrupt to access each files.
Disk could be easy to fail as there is lots of disk IO interrupt.

Thus, a recommended approach is organize multiple small files into one large file. One solution is to manage each small file as one line in a TSV format. TSV format means the data are organized line by line, and each line contains multiple columns, each of which are separated by tab character. A tip here is that each column value should not contain tab character. For example of image, the data will be the encoded byte string in this TSV file, and we encode it by base64.b64encode to make sure there is no tab character. In this way, the 1.2 million images will be organized as 1 file.

Support random seek for this large file

A common strategy for training is to scan data randomly one by one. Thus, we need to support random seek to this large file. As for TSV format, one solution is to have a separate file to contain the line offset. That is, it contains the offset of each line in the TSV file, so that if we need to access i-th line, we first query the i-th offset from such file and then seek and read the i-th row of the TSV file. The index file can simply in text mode or in binary mode for fast access. For binary mode, a good thing is that we don’t need to load all the offsets, but can seek the i-th offset by calculating the offset of such binary index file.

Separate a huge dataset into multiple large file

When the dataset is huge, for example of billion-level or even large scale, it is better not to organize all files into one single huge file. This huge file could be terabytes-level, which also results in some problems. For example, we would like to do some test on the dev machine, and a small portion of the data is enough. If it is huge, we have to copy all the bits locally. If it is split into multiple smaller file, which is not that small, we can just have one split locally for test. Next question is how small or how large each split file should be. Typically, I would suggest each large file should be 10GB~500GB, depending on how good the hardware is. If it is super good, each split can be larger, otherwise smaller.

Since we have multiple large files to organize one dataset, we need to have a mechanism to tell how many files we have. The goal is to always allow the training code to have a view of a single file. That is, from the training code perspective, we still have a single dataset, although, in physical layer, we store the data into multiple large files. One approach here is to have two files: one file contains a list of separate file names, and the other contains the row index for each file. The reason we have the second file is to easily shuffle the data or sample a subset of the data. The index file could contain multiple rows, and each row contains 2 integers. One integer is the index of the data file, and the other is the row index of that file.

At this point, one may suggest to use some db format, e.g. LMDB or Leveldb. This is also a good solution, but normally it might be hard to control how the data are organized to store huge dataset, e.g. terabytes. In this huge dataset, we may expect it to be split into multiple large files, and we would like to control which split we’d like to access. For example, we may expect different machines to access different large files due to some infrastructure issue. Although in this case, we can still use, e.g. LMDB format to have multiple datasets and each corresponds to one machine, the code has to understand how many datasets we have, which may introduce some coupling between the data management and the model training. Ideally, the training only sees one dataset and then scan the data based on some sampling strategy. We’d better to reduce the coupling as much as possible between how data is stored and how to do training. Some other approach could be some memory DB solution, which might be too hard for engineers/researchers to start with. Thus, the suggestion here is to have a simple way to manage the data so that it is easy to adapt to the environment we have.

Name the file by a pre-defined choice

For different datasets, we may have the same kind of data partitions. It could be training set, validation set, or testing dataset. The suggestion here is to name them in a consistent way. For example of the training dataset, we can always name it train.tsv, if we use the tsv format. Do not name it as Train.tsv or train_abc.tsv. For validation subset, we can name it val.tsv. The reason is that if the name is consistent, we can easily do some batch preprocessing on any dataset or on all dataset. Otherwise, different datasets have to be handled kind of differently. For each dataset, we can name it any term, and each dataset means one folder.

Backup the data every night

This would be quite important if you lose some data. When that happens, you will thank yourself. One suggestion tool is to use crontab for this routine work.

Organize the data in large files

Support random seek for this large file

Separate a huge dataset into multiple large file

Name the file by a pre-defined choice

Backup the data every night

Written by Jianfeng Wang