Greetings to all,
I’ve created two virtual machines on VMware ESXi 8.0 with PVRDMA for RoCEv2 and successfully tested RoCE functionality between them using ib_send_bw and ib_write_bw, achieving bandwidth speeds of up to 100 GBit/s. Now, I’d like to check RoCE between a physical Nvidia DGX1 machine with four ConnectX-4 cards and a virtual machine. I plan to use four single-port HCAs of Mellanox cards as separate interfaces, each carrying the same untagged VLAN 1 for RoCE. Bonding will be configured only for TCP traffic with VLAN 500. However, I’m unable to perform RoCE testing. I can ping from the VM to the DGX, but tools like ib_write_bw fails and rping is not working. Is it possible to perform RoCE testing between a virtual machine with PVRDMA based rocep2s3f1 adapter and a physical Nvidia DGX1 machine with mlx5_0-3 adapters? How to correctly setup PFC on the virtual machine and DGX1?
Output of ibdev2netdev command on DGX1
mlx5_0 port 1 ==> bond1 (Up)
mlx5_1 port 1 ==> bond0 (Up)
mlx5_2 port 1 ==> bond1 (Up)
mlx5_3 port 1 ==> bond0 (Up)
My setup consists of two Dell PowerEdge R6525 ESXi hosts with two Mellanox ConnectX-6 Single Port NICs connected to a Cumulus OS-based MLAG peerlinked switches (Mellanox Spectrum SN3700). RoCE is enabled in lossless mode, and PFC priority is set to 3.
I appreciate any insights or suggestions that you can provide, and I look forward to hearing from the experts here. Thank you in advance for your help!
Best regards,
Shakhizat