I’m in the process of transferring data files from local storage to HPC. Each data file comprises a large number of small files, totaling over 10 million individual files. Despite having ample storage space available in terms of GBs within the allocated project space, I’m encountering a ‘STORAGE LIMIT EXCEEDED’ error.
My understanding is that this error may be due to the file system reaching its limit on the maximum number of files it can accommodate. To address this issue, I attempted to create a compressed archive (tar) of one of the uploaded datasets. But, this compression process is exceptionally slow.
I’m seeking advice on more efficient and expeditious methods to accomplish this data transfer task.
If you haven’t yet tried the standard compression. Check out the block below, from Meta, showing performance and compression ratio as compere to other compression methods. To use standard compression install zstd as shown zstd · PyPI. Ubuntu zstd installation: sudo apt-get install zstd
Smaller and faster data compression with Zstandard - Engineering at Meta
You have likely correctly identified your issue in that the large number of files is causing you to exhaust your inode limit before you reach a size quota limit.
There are many factors that affect collecting large numbers of small files in tar or zip files. File system performance is a limiting factor in how fast you can collect all these smaller files into a larger archive. Since space does not appear to be an issue, I would consider not using any high levels of compression during the creation of the archive. I do not believe tar does compression by default, and I would defer compression to a separate job step if you are using something like “–use-compress-program” or inlining a gzip. If you were creating zip files, you could tune the default compression level at the time of creation.
If your files are spread across multiple directories, you could split the tar creation across multiple directories, and then later create a tar of tars. Gathering smaller numbers of files should go faster, although depending on the file system and how your multiple files are spread across multiple physical storage devices, a collection of tar jobs running simultaneously could result in the same poor file system performance. Again, this all depends on the file system your system is using, and if your different files and directories are spread across multiple sources or not.
A better approach might be to look at the source of your data, possibly restructuring the application so that it is generating fewer files that are larger in size.
As Jeff has said, it’s the number of files that is the issue. File systems have a discrete minimum size for a single file, so the administrators usually set a quote on the number of files in addition to the maximum disk size used.
For the creation of the archive, it’s the same problem you would have if you listed all those files. The only real way to make this somewhat faster will be to subdivide the large set into smaller sets and create those archives. It may still be slow and this is a good opportunity to review your data, reduce redundancy, and perhaps convert the large number of files into larger datasets that can be used for the same purpose as the large number of small files.
There may be some ideas in Handling large collections of files - Alliance Doc and compression - How to compress a bunch of files efficiently? - Super User