We recently had our entire BeegFS file system get corrupted due to the a sudden blow-up of IOPS. The error logs included, for example:
[root@oss1 ~]# grep "Apr 5" /var/log/messages | grep "Too many open files" | wc -l
17466
[root@oss2 ~]# grep "Apr 5" /var/log/messages | grep "Too many open files" | wc -l
14838
[root@oss3 ~]# grep "Apr 5" /var/log/messages | grep "Too many open files" | wc -l
17826
On RHEL9-based (we’re running Rocky Linux 9) systems, the default is:
[root@oss1 ~]# sysctl -a | grep fs.file-max
fs.file-max = 9223372036854775807 or 9 Quintillion files
And is confirmed here: What is the default value and the max value range for fs.file-max in Red Hat Enterprise Linux? - Red Hat Customer Portal (requires RedHat account to view)
This has led to internal discussions on what is a good number to set for parallel file systems? We’re almost finished repairing our BeegFS file system and we are thinking of setting it to 1 million as a starting point, since we have a relatively small cluster, filesystem (~1 PB) and userbase. What have other HPC system adminstrators used for their parallel file systems?