RDMA网络简介
RDMA
(Remote Direct Memory Access)全称远程直接数据存取
,就是为了解决网络传输中服务器端数据处理的延迟而产生的。RDMA通过网络把资料直接传入计算机的存储区,将数据从一个系统快速移动到远程系统存储器中,而不对操作系统造成任何影响,这样就不需要用到多少计算机的处理功能。它消除了外部存储器复制和上下文切换的开销,因而能解放内存带宽和CPU周期用于改进应用系统性能。
驱动下载
https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed
判断驱动是否安装成功:
1 | ibdev2netdev |
ibdev2netdev
命令输出以上类似信息表明网卡驱动安装成功
CentOS7开源驱动安装与卸载
Working with RDMA in RedHat/CentOS 7.*
- 安装:
1
2
3yum groupinfo "Infiniband Support"
yum groupinstall "Infiniband Support"
yum --setopt=group_package_types=optional groupinstall "Infiniband Support" - 卸载:
1
yum -y groupremove "Infiniband Support"
- 开启RDMA服务
1
2systemctl start rdma
systemctl enable rdma
吞吐量测试
写吞吐量
在RDMA驱动安装时会安装一些RDMA工具,可以使用ib_send_bw
测试写吞吐量
服务器A(server):
1 | ib_write_bw -a -d mlx5_0 |
服务器B(client):
1 | ib_write_bw -a -d mlx5_0 192.168.2.1(server端ip) |
读吞吐量
读吞吐量的测试与写吞吐量测试相同,只是使用命令换为ib_read_bw
延时测试
测试同样分为读写,测试工具为ib_read_lat
、ib_write_lat
- Performance Tuning for Mellanox Adapters
带宽统计
在使用RDMA时,发送和接收的数据带宽可以在app中自己进行收集,这样我们的程序发送和接收的数据量会很清楚。
如果想知道当前RDMA网卡所发送和接收的带宽可以通过sysfs下的相关节点获取。
- 发送数据量(byte):
/sys/class/infiniband/mlx5_0/ports/1/counters/port_xmit_data
- 接收数据量(byte):
/sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_data
注:port_xmit_data
和port_rcv_data
的数值是实际的1/4,因此实际的带宽是在其基础之上乘以4
,应该是为了防止数据溢出
port_xmit_data: (RO) Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter
port_rcv_data: (RO) Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.来自:
Documentation/ABI/stable/sysfs-class-infiniband
1 | pma_cnt_ext->port_xmit_data = |
file: drivers/infiniband/hw/mlx5/mad.c
网络联通性测试
由于当前网卡只支持Ethernet
模式,因此只能使用ibv_rc_pingpong
进行ping测试。
- https://community.mellanox.com/s/article/RoCE-Debug-Flow-for-Linux
Server
1 | ibdev2netdev |
1 | ibv_rc_pingpong -d mlx5_0 -g 0 |
Client
1 | ibdev2netdev |
1 | ibv_rc_pingpong -d mlx5_1 -g 0 192.168.2.4 |
ibping
测试ib模式下网络的连通性。
mlx5计数器和状态参数
在sysfs文件系统可以查看/sys/class/infiniband/
- Understanding mlx5 Linux Counters and Status Parameters
- InfiniBand Port Counters
Linux内核说明文档:https://www.kernel.org/doc/html/latest/admin-guide/abi-stable.html#abi-file-stable-sysfs-class-infiniband
counters
1 | ls -lsh /sys/class/infiniband/mlx5_0/ports/1/counters/ |
Counter Description:
Counter | Description | InfiniBand Spec Name | Group |
---|---|---|---|
port_rcv_data | The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. | PortRcvData | Informative |
port_rcv_packets | Total number of packets (this may include packets containing Errors. This is 64 bit counter. | PortRcvPkts | Informative |
port_multicast_rcv_packets | Total number of multicast packets, including multicast packets containing errors. | PortMultiCastRcvPkts | Informative |
port_unicast_rcv_packets | Total number of unicast packets, including unicast packets containing errors. | PortUnicastRcvPkts | Informative |
port_xmit_data | The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. | PortXmitData | Informative |
port_xmit_packetsport_xmit_packets_64 | Total number of packets transmitted on all VLs from this port. This may include packets with errors.This is 64 bit counter. | PortXmitPkts | Informative |
port_rcv_switch_relay_errors | Total number of packets received on the port that were discarded because they could not be forwarded by the switch relay. | PortRcvSwitchRelayErrors | Error |
port_rcv_errors | Total number of packets containing an error that were received on the port. | PortRcvErrors | Informative |
port_rcv_constraint_errors | Total number of packets received on the switch physical port that are discarded. | PortRcvConstraintErrors | Error |
local_link_integrity_errors | The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors. | LocalLinkIntegrityErrors | Error |
port_xmit_wait | The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). | PortXmitWait | Informative |
port_multicast_xmit_packets | Total number of multicast packets transmitted on all VLs from the port. This may include multicast packets with errors. | PortMultiCastXmitPkts | Informative |
port_unicast_xmit_packets | Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors. | PortUnicastXmitPkts | Informative |
port_xmit_discards | Total number of outbound packets discarded by the port because the port is down or congested. | PortXmitDiscards | Error |
port_xmit_constraint_errors | Total number of packets not transmitted from the switch physical port. | PortXmitConstraintErrors | Error |
port_rcv_remote_physical_errors | Total number of packets marked with the EBP delimiter received on the port. | PortRcvRemotePhysicalErrors | Error |
symbol_error | Total number of minor link errors detected on one or more physical lanes. | SymbolErrorCounter | Error |
VL15_dropped | Number of incoming VL15 packets dropped due to resource limitations (e.g., lack of buffers) of the port. | VL15Dropped | Error |
link_error_recovery | Total number of times the Port Training state machine has successfully completed the link error recovery process. | LinkErrorRecoveryCounter | Error |
link_downed | Total number of times the Port Training state machine has failed the link error recovery process and downed the link. | LinkDownedCounter | Error |
hw_counters
1 | ls -lsh /sys/class/infiniband/mlx5_0/ports/1/hw_counters/ |
HW Counters Description:
Counter | Description | Group |
---|---|---|
duplicate_request | Number of received packets. A duplicate request is a request that had been previously executed. | Error |
implied_nak_seq_err | Number of time the requested decided an ACK. with a PSN larger than the expected PSN for an RDMA read or response. | Error |
lifespan | The maximum period in ms which defines the aging of the counter reads. Two consecutive reads within this period might return the same values | Informative |
local_ack_timeout_err | The number of times QP’s ack timer expired for RC, XRC, DCT QPs at the sender side.The QP retry limit was not exceed, therefore it is still recoverable error. | Error |
np_cnp_sent | The number of CNP packets sent by the Notification Point when it noticed congestion experienced in the RoCEv2 IP header (ECN bits).The counters was added in MLNX_OFED 4.1 | Informative |
np_ecn_marked_roce_packets | The number of RoCEv2 packets received by the notification point which were marked for experiencing the congestion (ECN bits where ‘11’ on the ingress RoCE traffic) .The counters was added in MLNX_OFED 4.1 | Informative |
out_of_buffer | The number of drops occurred due to lack of WQE for the associated QPs. | Error |
out_of_sequence | The number of out of sequence packets received. | Error |
packet_seq_err | The number of received NAK sequence error packets. The QP retry limit was not exceeded. | Error |
req_cqe_error | The number of times requester detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1 | Error |
req_cqe_flush_error | The number of times requester detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1 | Error |
req_remote_access_errors | The number of times requester detected remote access errors.The counters was added in MLNX_OFED 4.1 | Error |
req_remote_invalid_request | The number of times requester detected remote invalid request errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_cqe_error | The number of times responder detected CQEs completed with errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_cqe_flush_error | The number of times responder detected CQEs completed with flushed errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_local_length_error | The number of times responder detected local length errors.The counters was added in MLNX_OFED 4.1 | Error |
resp_remote_access_errors | The number of times responder detected remote access errors.The counters was added in MLNX_OFED 4.1 | Error |
rnr_nak_retry_err | The number of received RNR NAK packets. The QP retry limit was not exceeded. | Error |
rp_cnp_handled | The number of CNP packets handled by the Reaction Point HCA to throttle the transmission rate.The counters was added in MLNX_OFED 4.1 | Informative |
rp_cnp_ignored | The number of CNP packets received and ignored by the Reaction Point HCA. This counter should not raise if RoCE Congestion Control was enabled in the network. If this counter raise, verify that ECN was enabled on the adapter. See HowTo Configure DCQCN (RoCE CC) values for ConnectX-4 (Linux).The counters was added in MLNX_OFED 4.1 | Error |
rx_atomic_requests | The number of received ATOMIC request for the associated QPs. | Informative |
rx_dct_connect | The number of received connection request for the associated DCTs. | Informative |
rx_read_requests | The number of received READ requests for the associated QPs. | Informative |
rx_write_requests | The number of received WRITE requests for the associated QPs. | Informative |
rx_icrc_encapsulated | The number of RoCE packets with ICRC errors.This counter was added in MLNX_OFED 4.4 and kernel 4.19 | Error |
roce_adp_retrans | Counts the number of adaptive retransmissions for RoCE trafficThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_adp_retrans_to | Counts the number of times RoCE traffic reached timeout due to adaptive retransmissionThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_slow_restart | Counts the number of times RoCE slow restart was usedThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_slow_restart_cnps | Counts the number of times RoCE slow restart generated CNP packetsThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
roce_slow_restart_trans | Counts the number of times RoCE slow restart changed state to slow restartThe counter was added in MLNX_OFED rev 5.0-1.0.0.0 and kernel v5.6.0 | Informative |
duplicate_request
:(Duplicated packets)接收报文数,重复请求是先前已执行的请求。out_of_sequence
:(Drop out of sequence)接收到的乱序包的数量,说明此时已经产生了丢包packet_seq_err
:(NAK sequence rcvd)接收到的NAK序列错误数据包的数量,未超过QP重试限制。
带宽监测工具——netdata
netdata
可以查看RDMA网卡的带宽,但是展示的发送和接收的数据是通过/sys/class/infiniband
下的节点获取的,因此实际带宽数据是其展示数据的4倍
插件源码:https://github.com/netdata/netdata/blob/master/collectors/proc.plugin/sys_class_infiniband.c
网卡工作模式
1 | ibstatus |
link_layer
: 工作模式,Ethernet为IP模式,还有IB(infiniband)模式。- 工作模式切换:HowTo Change Port Type in Mellanox ConnectX-3 Adapter
常用命令
ibstat
: 查询InfiniBand设备的基本状态ibstatus
: 网卡信息ibv_devinfo
:网卡设备信息(ibv_devinfo -d mlx5_0 -v)ibv_devices
:查看本主机的infiniband设备ibnodes
:查看网络中的infiniband设备show_gids
:看看网卡支持的roce版本show_counters
:网卡端口统计数据,比如发送接受数据大小mlxconfig
: 网卡配置(mlxconfig -d mlx5_1 q查询网卡配置信息)
双网口作用
双网口
:指一个物理网卡上的两个网络接口
- 可以捆绑,比单口效率高多了。同时上两个不同的网络网,有一个不同时,另一个也在同时工作实现网络备份。
- 服务器必备2个或2个以上的网口,一个用于网路接入,另一个作为输入。
- 家用PC机用2个的网口的网卡,可以实现服务器的初级功能,接入网络然后输入,并管理输入端的网路和数据。
- 双口的可以做负载均衡,单口的无此功能。
- 双口的可以连接两个网络,可以做网关,单口的直接无法做到此点。当然,如果用两个单口网卡,也可以实现某些双口网卡的同样效果,但在转换速度上还是和双口网卡略有差异。
参考
- How to install support for Mellanox Infiniband hardware on RHEL6
- Mellanox Technologies Ltd. Public Repository
- infiniband带宽测试方法1 ib_read/write_bw/lat
- ib_write_bw 和 ib_read_bw 测试 RDMA 的读写处理确定带宽
- ibverbs文档翻译
- Introduction to Programming Infiniband RDMA
- NFSv4 RDMA and Session Extensions
- RDMA_Aware_Programming_user_manual.pdf
- infiniband网卡安装、使用总结
- Port Type Management