Ask.Cyberinfrastructure

How I can improve the performance of my job that needs to perform many I/O operations with a very large text file

scheduler
programming-for-hpc

#1

I am working in a large Linux cluster. All our data is stored in a project space. My input (text) file is very large and the values that I need to read in are located in 3 large groups. I need to read a single value from each group and perform some analysis. So my program performs many “seek” - type operations to read the data that needs to be processed. I benchmarked my application and it looks like this seek/read operation is the bottleneck in my code. How I can improve the performance of this job?

CURATOR: Katia


#2

The best approach in this case is to open this file 3 times and store 3 file handlers. The values that need to be read from the first group would use the first file handle, for the second value - the second etc.
This way you will avoid the seek operation and the reading will be performed sequentially.
Depending on the size of your file you can also try to first move this input file to the scratch directory local to the compute node where your job executes - this may improve the IO speed.