RDMA网络简介

RDMA(Remote Direct Memory Access)全称远程直接数据存取,就是为了解决网络传输中服务器端数据处理的延迟而产生的。RDMA通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能。它消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和CPU周期用于改进应用系统性能。

驱动下载

https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed

判断驱动是否安装成功:

1
2
3
# ibdev2netdev
mlx5_0 port 1 ==> ens2f0 (Up) #表示该网口插了网线
mlx5_1 port 1 ==> ens2f1 (Down) #表示该网口没有插网线

ibdev2netdev命令输出以上类似信息表明网卡驱动安装成功

CentOS7开源驱动安装与卸载

Working with RDMA in RedHat/CentOS 7.*

  • 安装:
    1
    2
    3
    yum groupinfo "Infiniband Support"
    yum groupinstall "Infiniband Support"
    yum --setopt=group_package_types=optional groupinstall "Infiniband Support"
  • 卸载:
    1
    yum -y groupremove "Infiniband Support"
  • 开启RDMA服务
    1
    2
    systemctl start rdma
    systemctl enable rdma

吞吐量测试

写吞吐量

在RDMA驱动安装时会安装一些RDMA工具,可以使用ib_send_bw测试写吞吐量

服务器A(server):

1
ib_write_bw -a -d mlx5_0

服务器B(client):

1
ib_write_bw -a -d mlx5_0 192.168.2.1(server端ip)

读吞吐量

读吞吐量的测试与写吞吐量测试相同,只是使用命令换为ib_read_bw

延时测试

测试同样分为读写,测试工具为ib_read_latib_write_lat

  • Performance Tuning for Mellanox Adapters

带宽统计

在使用RDMA时,发送和接收的数据带宽可以在app中自己进行收集,这样我们的程序发送和接收的数据量会很清楚。
如果想知道当前RDMA网卡所发送和接收的带宽可以通过sysfs下的相关节点获取。

  • 发送数据量(byte):/sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data
  • 接收数据量(byte):/sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data

port_xmit_dataport_rcv_data的数值是实际的1/4,因此实际的带宽是在其基础之上乘以4,应该是为了防止数据溢出

port_xmit_data: (RO) Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter
port_rcv_data: (RO) Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.

来自: Documentation/ABI/stable/sysfs-class-infiniband

1
2
3
4
5
6
pma_cnt_ext->port_xmit_data =
cpu_to_be64(MLX5_SUM_CNT(out, transmitted_ib_unicast.octets,
transmitted_ib_multicast.octets) >> 2);
pma_cnt_ext->port_rcv_data =
cpu_to_be64(MLX5_SUM_CNT(out, received_ib_unicast.octets,
received_ib_multicast.octets) >> 2);

file: drivers/infiniband/hw/mlx5/mad.c

网络联通性测试

由于当前网卡只支持Ethernet模式,因此只能使用ibv_rc_pingpong进行ping测试。

  • https://community.mellanox.com/s/article/RoCE-Debug-Flow-for-Linux

Server

1
2
3
4
# ibdev2netdev
mlx4_0 port 1 ==> enp1s0 (Down)
mlx5_0 port 1 ==> ens2f0 (Up)
mlx5_1 port 1 ==> ens2f1 (Up)
1
2
3
4
5
# ibv_rc_pingpong -d mlx5_0 -g 0
local address: LID 0x0000, QPN 0x00011a, PSN 0xd775ee, GID fe80::e42:a1ff:fe41:2d36
remote address: LID 0x0000, QPN 0x0009df, PSN 0xa7f02f, GID fe80::1e34:daff:fe79:c0d
8192000 bytes in 0.01 seconds = 5126.01 Mbit/sec
1000 iters in 0.01 seconds = 12.78 usec/iter

Client

1
2
3
4
5
# ibdev2netdev
mlx5_0 port 1 ==> p5p1 (Down)
mlx5_1 port 1 ==> p5p2 (Up)
mlx5_2 port 1 ==> p4p1 (Down)
mlx5_3 port 1 ==> p4p2 (Down)
1
2
3
4
5
# ibv_rc_pingpong -d mlx5_1 -g 0 192.168.2.4
local address: LID 0x0000, QPN 0x0009df, PSN 0xa7f02f, GID fe80::1e34:daff:fe79:c0d
remote address: LID 0x0000, QPN 0x00011a, PSN 0xd775ee, GID fe80::e42:a1ff:fe41:2d36
8192000 bytes in 0.01 seconds = 5376.21 Mbit/sec
1000 iters in 0.01 seconds = 12.19 usec/iter

ibping

测试ib模式下网络的连通性。

mlx5计数器和状态参数

在sysfs文件系统可以查看/sys/class/infiniband/

  • Understanding mlx5 Linux Counters and Status Parameters
  • InfiniBand Port Counters

Linux内核说明文档:https://www.kernel.org/doc/html/latest/admin-guide/abi-stable.html#abi-file-stable-sysfs-class-infiniband

counters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 excessive_buffer_overrun_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 link_downed
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 link_error_recovery
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 local_link_integrity_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 multicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 multicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_data
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_remote_physical_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_rcv_switch_relay_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_constraint_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_data
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_discards
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 port_xmit_wait
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 symbol_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 unicast_rcv_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 unicast_xmit_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 VL15_dropped

Counter Description:

CounterDescriptionInfiniBand Spec NameGroup
port_rcv_dataThe total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.PortRcvDataInformative
port_rcv_packetsTotal number of packets (this may include packets containing Errors. This is 64 bit counter.PortRcvPktsInformative
port_multicast_rcv_packetsTotal number of multicast packets, including multicast packets containing errors.PortMultiCastRcvPktsInformative
port_unicast_rcv_packetsTotal number of unicast packets, including unicast packets containing errors.PortUnicastRcvPktsInformative
port_xmit_dataThe total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.PortXmitDataInformative
port_xmit_packetsport_xmit_packets_64Total number of packets transmitted on all VLs from this port. This may include packets with errors.This is 64 bit counter.PortXmitPktsInformative
port_rcv_switch_relay_errorsTotal number of packets received on the port that were discarded because they could not be forwarded by the switch relay.PortRcvSwitchRelayErrorsError
port_rcv_errorsTotal number of packets containing an error that were received on the port.PortRcvErrorsInformative
port_rcv_constraint_errorsTotal number of packets received on the switch physical port that are discarded.PortRcvConstraintErrorsError
local_link_integrity_errorsThe number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors.LocalLinkIntegrityErrorsError
port_xmit_waitThe number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).PortXmitWaitInformative
port_multicast_xmit_packetsTotal number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors.PortMultiCastXmitPktsInformative
port_unicast_xmit_packetsTotal number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors.PortUnicastXmitPktsInformative
port_xmit_discardsTotal number of outbound packets discarded by the port because the port is down or congested.PortXmitDiscardsError
port_xmit_constraint_errorsTotal number of packets not transmitted from the switch physical port.PortXmitConstraintErrorsError
port_rcv_remote_physical_errorsTotal number of packets marked with the EBP delimiter received on the port.PortRcvRemotePhysicalErrorsError
symbol_errorTotal number of minor link errors detected on one or more physical lanes.SymbolErrorCounterError
VL15_droppedNumber of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) of the port.VL15DroppedError
link_error_recoveryTotal number of times the Port Training state machine has successfully completed the link error recovery process.LinkErrorRecoveryCounterError
link_downedTotal number of times the Port Training state machine has failed the link error recovery process and downed the link.LinkDownedCounterError

hw_counters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ls -lsh /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
total 0
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 duplicate_request
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 implied_nak_seq_err
0 -rw-r--r-- 1 root root 4.0K 5月 28 16:42 lifespan
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 local_ack_timeout_err
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 np_cnp_sent
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 np_ecn_marked_roce_packets
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 out_of_buffer
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 out_of_sequence
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 packet_seq_err
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 req_remote_invalid_request
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_cqe_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_cqe_flush_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_local_length_error
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 resp_remote_access_errors
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rnr_nak_retry_err
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rp_cnp_handled
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rp_cnp_ignored
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_atomic_requests
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_icrc_encapsulated
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_read_requests
0 -r--r--r-- 1 root root 4.0K 5月 24 15:28 rx_write_requests

HW Counters Description:

CounterDescriptionGroup
duplicate_requestNumber of received packets. A duplicate request is a request that had been previously executed.Error
implied_nak_seq_errNumber of time the requested decided an ACK. with a PSN larger than the expected PSN for an RDMA read or response.Error
lifespanThe maximum period in ms which defines the aging of the counter reads. Two consecutive reads within this period might return the same valuesInformative
local_ack_timeout_errThe number of times QP’s ack timer expired for RC, XRC, DCT QPs at the sender side.The QP retry limit was not exceed, therefore it is still recoverable error.Error
np_cnp_sentThe number of CNP packets sent by the Notification Point when it noticed congestion experienced in the RoCEv2 IP header (ECN bits).The counters was added in MLNX_OFED 4.1Informative
np_ecn_marked_roce_packetsThe number of RoCEv2 packets received by the notification point which were marked for experiencing the congestion (ECN bits where ‘11’ on the ingress RoCE traffic) .The counters was added in MLNX_OFED 4.1Informative
out_of_bufferThe number of drops occurred due to lack of WQE for the associated QPs.Error
out_of_sequenceThe number of out of sequence packets received.Error
packet_seq_errThe number of received NAK sequence error packets. The QP retry limit was not exceeded.Error
req_cqe_errorThe number of times requester detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1Error
req_cqe_flush_errorThe number of times requester detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1Error
req_remote_access_errorsThe number of times requester detected remote access errors.The counters was added in MLNX_OFED 4.1Error
req_remote_invalid_requestThe number of times requester detected remote invalid request errors.The counters was added in MLNX_OFED 4.1Error
resp_cqe_errorThe number of times responder detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1Error
resp_cqe_flush_errorThe number of times responder detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1Error
resp_local_length_errorThe number of times responder detected local length errors.The counters was added in MLNX_OFED 4.1Error
resp_remote_access_errorsThe number of times responder detected remote access errors.The counters was added in MLNX_OFED 4.1Error
rnr_nak_retry_errThe number of received RNR NAK packets. The QP retry limit was not exceeded.Error
rp_cnp_handledThe number of CNP packets handled by the Reaction Point HCA to throttle the transmission rate.The counters was added in MLNX_OFED 4.1Informative
rp_cnp_ignoredThe number of CNP packets received and ignored by the Reaction Point HCA. This counter should not raise if RoCE Congestion Control was enabled in the network. If this counter raise, verify that ECN was enabled on the adapter. See HowTo Configure DCQCN (RoCE CC) values for ConnectX-4 (Linux).The counters was added in MLNX_OFED 4.1Error
rx_atomic_requestsThe number of received ATOMIC request for the associated QPs.Informative
rx_dct_connectThe number of received connection request for the associated DCTs.Informative
rx_read_requestsThe number of received READ requests for the associated QPs.Informative
rx_write_requestsThe number of received WRITE requests for the associated QPs.Informative
rx_icrc_encapsulatedThe number of RoCE packets with ICRC errors.This counter was added in MLNX_OFED 4.4 and kernel 4.19Error
roce_adp_retransCounts the number of adaptive retransmissions for RoCE trafficThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_adp_retrans_toCounts the number of times RoCE traffic reached timeout due to adaptive retransmissionThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_slow_restartCounts the number of times RoCE slow restart was usedThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_slow_restart_cnpsCounts the number of times RoCE slow restart generated CNP packetsThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
roce_slow_restart_transCounts the number of times RoCE slow restart changed state to slow restartThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0Informative
  • duplicate_request:(Duplicated packets)接收报文数,重复请求是先前已执行的请求。
  • out_of_sequence:(Drop out of sequence)接收到的乱序包的数量,说明此时已经产生了丢包
  • packet_seq_err:(NAK sequence rcvd)接收到的NAK序列错误数据包的数量,未超过QP重试限制。

带宽监测工具——netdata

netdata可以查看RDMA网卡的带宽,但是展示的发送和接收的数据是通过/sys/class/infiniband下的节点获取的,因此实际带宽数据是其展示数据的4倍

netdata_rdma_ib

插件源码:https://github.com/netdata/netdata/blob/master/collectors/proc.plugin/sys_class_infiniband.c

网卡工作模式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:0e42:a1ff:fe41:2d36
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 25 Gb/sec (1X EDR)
link_layer: Ethernet

Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:0e42:a1ff:fe41:2d37
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 25 Gb/sec (1X EDR)
link_layer: Ethernet
  • link_layer: 工作模式,Ethernet为IP模式,还有IB(infiniband)模式。
  • 工作模式切换:HowTo Change Port Type in Mellanox ConnectX-3 Adapter

常用命令

  • ibstat: 查询InfiniBand设备的基本状态
  • ibstatus: 网卡信息
  • ibv_devinfo:网卡设备信息(ibv_devinfo -d mlx5_0 -v)
  • ibv_devices:查看本主机的infiniband设备
  • ibnodes:查看网络中的infiniband设备
  • show_gids:看看网卡支持的roce版本
  • show_counters:网卡端口统计数据,比如发送接受数据大小
  • mlxconfig: 网卡配置(mlxconfig -d mlx5_1 q查询网卡配置信息)

双网口作用

双网口:指一个物理网卡上的两个网络接口

  1. 可以捆绑,比单口效率高多了。同时上两个不同的网络网,有一个不同时,另一个也在同时工作实现网络备份。
  2. 服务器必备2个或2个以上的网口,一个用于网路接入,另一个作为输入。
  3. 家用PC机用2个的网口的网卡,可以实现服务器的初级功能,接入网络然后输入,并管理输入端的网路和数据。
  4. 双口的可以做负载均衡,单口的无此功能。
  5. 双口的可以连接两个网络,可以做网关,单口的直接无法做到此点。当然,如果用两个单口网卡,也可以实现某些双口网卡的同样效果,但在转换速度上还是和双口网卡略有差异。

参考

  • How to install support for Mellanox Infiniband hardware on RHEL6
  • Mellanox Technologies Ltd. Public Repository
  • infiniband带宽测试方法1 ib_read/write_bw/lat
  • ib_write_bw 和 ib_read_bw 测试 RDMA 的读写处理确定带宽
  • ibverbs文档翻译
  • Introduction to Programming Infiniband RDMA
  • NFSv4 RDMA and Session Extensions
  • RDMA_Aware_Programming_user_manual.pdf
  • infiniband网卡安装、使用总结
  • Port Type Management