Inter-node and intra-node GPU communication

What tools or methods would you recommend for measuring both inter-node GPU communication and intra-node GPU communication in a high-performance computing or distributed computing environment?

Ohio State University developed Micro-Benchmarks which is quite popular. You can use it with various parallel programming models like MPI, NCCL, etc.

NVIDIA also provides GPU operational and performance health via the DCGM tooling:
GPU Monitoring Tools to Maximize Application Performance and System Utilization - OSC Workshop Series with NVIDIA

1 Like

Tools such as:

  1. AMD CodeXL: for AMD GPUs
  2. ComScribe
  3. NVIDIA Nsight Systems

Methods include: Network monitoring, Performance counters(of the Operating System)

1 Like