Is there a way to process archived files without fully extracting them to disk?



For a particular type of analysis, I have a large set of *.tar.gz files. While they are relatively modest in size (100MB-1GB), they are full of roughly 1kB files. In testing, the overhead of extracting hundreds of thousands of files to disk is far more expensive than the processing I need to do on the data itself.

Is there a way of directly processing the data inside the tar file, without having to extract it first?


One approach you could take would be to extract files to a temporary location in memory. Given the size you are working with, the available space on the ram-disk that is mounted by default at /dev/shm should be enough, as long as you make sure to clean these files up after they are analyzed, before extracting the next set.


@jkingsley /dev/shm, seems like a system specific detail. Is there a more generic name for this?


/dev/shm has been a standard feature of linux installs for at least a decade (I don’t actually have an introduction date, but I have seen references as early as 2006). Unless it was specifically removed for some reason, I would expect it on any modern system.


@jkinsley think this covers some of that … (linux kernal 2.6) … most common linux distros do have it on by defualt, but it is an optional config.


If you use python (or your language of choice) there is a core module called tarfile that can do wonders to:

  1. read a tarfile into memory
  2. either edit members in place and write to memory (and then update file) or write to new thing.

For example, I just wrote up this little snippet to read a .tar.gz into memory, check permissions, and change if necessary. I’ll also include it here:

import tarfile
import tempfile
import stat
import os

tar_file = "input.tar.gz"
tar =, "r:gz")
members = tar.getmembers()

file_permission = stat.S_IRUSR | stat.S_IWUSR
folder_permission = stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR

# Let's pretend we want to edit, and write to new tar
if len(members) > 0:
    fd, tmp_tar = tempfile.mkstemp(prefix=("%s.fixed." % tar_file))
    fixed_tar =, "w:gz")

    # Then process members
    for member in members:

        # add o+rwx for directories
        if member.isdir() and not member.issym():
            member.mode = folder_permission | member.mode
            extracted = tar.extractfile(member)
            fixed_tar.addfile(member, extracted)

        # add o+rw for plain files
        elif member.isfile() and not member.issym():
            member.mode = file_permission | member.mode
            extracted = tar.extractfile(member)
            fixed_tar.addfile(member, extracted)

    # Rename the fixed tar to be the old name
    os.rename(tmp_tar, tar_file)

That example is from the original Singularty source code, and there are other examples to:

If you have a specific need or example I’d be happy to help! We can also try outside of Python.


Rather than extracting the files, you could consider doing your analysis in a language that supports directly manipulating tar files. For example, python has a tarfile module, which has a streaming mode. This will allow you to go through and process your files, without having to ever have them reach a disk.