An Overview of the Data-Loader Landscape: Discussion

4 Jun 2024


(1) Iason Ofeidis, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};

(2) Diego Kiedanski, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven {Equal contribution};

(3) Leandros TassiulasLevon Ghukasyan, Activeloop, Mountain View, CA, USA, Department of Electrical Engineering, and Yale Institute for Network Science, Yale University, New Haven.


In this work, we used time as the main tool to compare the performance among different libraries. There are several things to say about this. Firstly, we noticed that running times are quite variable and depend on background processes that are hard to control. At the same time, access to multi-GPU resources is expensive, which limits the number of experiments that can be run. Ideally, we would have run more than three repetitions of each experiment with more parameters (more workers, more batch sizes), but we did not have the resources for it. Since we are making all our opensource code, we invite readers to run the benchmarks on their own hardware and report the results. At the same time, libraries are updated fairly often, and a change in version can dramatically increase or decrease its performance.

In light of the above points, we encourage the reader to internalize the qualitative aspects of this paper but beware that the numbers obtained here are prone to change.

Secondly, an aspect that is harder to compare is the ease of use of the libraries considered in this project. Most of the libraries included in this benchmark do not have comprehensive documentation and rely primarily on concrete examples. Consequently, implementation in these libraries is not trivial and prone to inefficiencies. One of the advantages of making our code open source is that we allow any developer to identify and improve on our code. This is particularly relevant as we expect that the benchmarks created in this project could be used as boilerplate code for the community.

We note that there does not seem to be a better library than all others. Instead, each one has its own strengths. Consider the example of FFCV: it seems to be the fastest in our experiments, but the lack of support for label transformations prevents it from being adopted in projects that require such features.

We hope to analyze the interplay between filtering and training across multiple GPUs in future work. At the same time, it would be interesting to explore the scaling capabilities of these libraries as the number of GPUs increases. Similarly, it would be of great interest to benchmark data loading libraries in terms of performance on the shuffling step in the DL training workflow, as this can have a significant impact on the total training time, and its implementation is a non-trivial problem, where there are several kinds of approaches.

The research on libraries that provide data loading from a remote storage and that they show comparable results with the local storage experiments incentivized us to explore the idea of formulating and designing a caching policy for data streaming over a network. In that setting, reducing the times a data point (e.g., image) needs to be transferred can significantly shorten the overall training time (and possibly costs if network usage is paid). The idea of caching a network dataset while training is not new (Mohan et al., 2020). Still, it is often assumed that the whole dataset can be cached when discussing training and streaming data. Furthermore, it is assumed that all the samples will be used once per epoch (Kumar & Sivathanu, 2020) as is traditionally the case. We are interested in exploring what happens when the cache size is small, and also in removing the requirement of using every datapoint once per epoch. Such a formulation should borrow from active learning, data summarization, and curriculum learning.

This paper is available on arxiv under CC 4.0 license.