Can you share some advice on troubleshooting issues related to running complex simulations in a distributed environment like the ACCESS HPC systems?

Can you share some advice on troubleshooting issues related to running complex simulations in a distributed environment like the ACCESS HPC systems?

For troubleshooting complex simulations on ACCESS HPC systems, I suggest:

  1. Understand your error messages: Often, the error messages thrown by the system can give a good starting point for identifying the issue.
  2. Isolate and reproduce the problem: Try to simplify the simulation or create a minimal example that reproduces the error.
  3. Use debugging tools: Tools like gdb, Valgrind, or Allinea DDT can help identify memory leaks, threading issues, or other tricky problems.
  4. Review HPC-specific documentation: Resources like the ACCESS user guide and the HPC Carpentry lessons may provide advice tailored to your specific issue.
  5. Seek assistance: Use forums like StackOverflow or the ACCESS support for assistance, providing them with your error logs, code snippets, and what you’ve tried so far.