What tools or methods would you recommend for measuring both inter-node GPU communication and intra-node GPU communication in a high-performance computing or distributed computing environment?
Ohio State University developed Micro-Benchmarks which is quite popular. You can use it with various parallel programming models like MPI, NCCL, etc.
NVIDIA also provides GPU operational and performance health via the DCGM tooling:
GPU Monitoring Tools to Maximize Application Performance and System Utilization - OSC Workshop Series with NVIDIA
1 Like
Tools such as:
- AMD CodeXL: for AMD GPUs
- ComScribe
- NVIDIA Nsight Systems
Methods include: Network monitoring, Performance counters(of the Operating System)
1 Like