With MPI jobs using an Isilon Storage system for scratch, are there I/o issues that can be avoided?



When running a storage intensive MPI job that uses an Isilon Storage system for scratch (working storage), are there any I/O bottlenecks that can be avoided?

Are there any clear ways of identifying them when they happen?

CURATOR: jpessin1


This would depend on the MPI job’s use of storage. If the MPI job uses parallel I/O to a single file, you are out of luck. As far as I know, Isilon’s NFS implementation knows nothing about parallel I/O (stuck at NFS 4.0 I believe) so whatever the app falls back to would be how it does I/O.

If the MPI job has a serial step which collects and writes as a single process, then it’s effectively the same as any other job where you just want to optimize single-node I/O. Making sure the MPI job is the only job running on the node is probably enough to allow a single node to saturate the connection to the isilon cluster, which is probably theoretically peaked at 1 GB/s.

If the MPI job processes all do read/write but with each process having its own file for writing, then you’ll probably get decent load balancing for free if you have enough nodes in the Isilon cluster and are using smartconnect.

To look for I/O bottlenecks, I typically check the node(s) doing the I/O and watch with dstat, iostat or iotop. If things are slower than I’d expect (compared to what a dd if=/dev/zero of=/path/to/isilon will do) then the next place to look is to strace or otherwise check the app to see how it is writing. One really common mistake I see is people writing a script with a loop which opens file, writes a line, closes file. The Isilon metadata slowness really shows up in that kind of access. In general with the Isilon avoiding metadata access and writing large blocks seems to help.