Benchmark ConnectX®-3 Pro VMA

RDMA support

RDMA allows for communication between systems but can bypass the overhead associated with the operating system kernel, so applications have reduced latency and much lower CPU utilization.

libvma
Linux user space library for network socket acceleration based on RDMA compatible network adaptors

socket ip layer ?

Update driver

1
2
3
4
5
6
7
8
9
10
11
apt install module-init-tools dkms
modprobe -r mlx4_en
modprobe -r mlx4_core
modprobe mlx4_en mlx4_core
modinfo mlx4_core
filename: /lib/modules/4.4.0-22-generic/updates/dkms/mlx4_core.ko
version: 3.3-1.0.0
license: Dual BSD/GPL
description: Mellanox ConnectX HCA low-level driver
author: Roland Dreier
srcversion: E63C21909524B057DF41E97

NIC info

1
2
3
4
5
6
7
8
9
10
11
ethtool -i enp5s0
driver: mlx4_en
version: 3.3-1.0.0 (31 May 2016)
firmware-version: 2.36.5000
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

NIC offload

Features
TSO (TCP Segmentation Offload) on
GSO (Generic Segmentation Offload) on
GRO (Generic Receive Offload) on
rx-checksumming on
tx-checksumming on
scatter-gather on
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ ethtool --offload enp5s0 rx on tx on
Actual changes:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ipv6: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp6-segmentation: on

$ ethtool -k enp4s0d1 | grep -v fixed
Features for enp4s0d1:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ipv6: on
scatter-gather: on
tx-scatter-gather: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
tx-nocache-copy: off
loopback: off
rx-fcs: off
tx-vlan-stag-hw-insert: off
rx-vlan-stag-hw-parse: on

#enable TSO
$ ethtool -K enp4s0 tso on
$ ethtool -K enp4s0 gso off

Transmit:

  • TSO
  • GSO (software ?)

Receive:

  • GRO

Interrupt mitigation or interrupt coalescing

rx-frames[-irq] rx-usecs[-irq] tx-frames[-irq] tx-usecs[-irq]

The frames parameters specify how many packets are received/transmitted before generating an interrupt. The usecs parameters specify how many microseconds after at least 1 packet is received/transmitted before generating an interrupt. The [-irq] parameters are the corresponding delays in updating the status when the interrupt is disabled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ ethtool -c enp4s0d1
Coalesce parameters for enp4s0d1:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 400000
pkt-rate-high: 450000

rx-usecs: 0
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 16
tx-usecs-irq: 0
tx-frames-irq: 256

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 128
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

#reduce lantency,improve cpu usage
$ ethtool -C enp5s0 adaptive-rx off rx-usecs 0 rx-frames 0
$ ethtool -C enp5s0d1 adaptive-rx off rx-usecs 0 rx-frames 0

if you need to improve value of rx-usecs, improve latency and througput, and suggestion improve
net.ipv4.tcp_tso_win_divisor = 30 (default 3) too.

What is Scatter-Gather DMA
Replace LRO with GRO
Scatter-gather or vectored I/O DMA allows the transfer of data to and from multiple memory areas in a single DMA transaction. It is equivalent to the chaining together of multiple simple DMA requests. The motivation is to off-load multiple input/output interrupt and data copy tasks from the CPU

About GSO

Many people have observed that a lot of the savings in TSO come from traversing the networking stack once rather than many times for each super-packet. These savings can be obtained without hardware support.
[GSO like TSO is only effective if the MTU is significantly less than the
maximum value of 64K. So only the case where the MTU was set to 1500 is
of interest.

TSO 是使得网络协议栈能够将大块 buffer 推送至网卡,然后网卡执行分片工作,这样减轻了 CPU 的负荷,但 TSO 需要硬件来实现分片功能;而性能上的提高,主要是因为延缓分片而减轻了 CPU 的负载,因此,可以考虑将 TSO 技术一般化,因为其本质实际是延缓分片,这种技术,在 Linux 中被叫做 GSO(Generic Segmentation Offload),它比 TSO 更通用,原因在于它不需要硬件的支持分片就可使用,对于支持 TSO 功能的硬件,则先经过 GSO 功能,然后使用网卡的硬件分片能力执行分片;而对于不支持 TSO 功能的网卡,将分片的执行,放在了将数据推送的网卡的前一刻,也就是在调用驱动的 xmit 函数前。(https://www.ibm.com/developerworks/cn/linux/l-cn-network-pt/#icomments)

if mtu 1500, kernel will split large data to 1448 bytes per packages , after enable gso (hardware), There are some 2896 bytes packages in tcpdump, and split operation in nic device.

20 bytes of IPv4 header information and 20 bytes of TCP header information,Linux and Mac OS are further limited to 1448 bytes as they also carry a 12-byte time stamp
1500-20-20-12=1448



Segmentation and Checksum Offloading: Turning Off with ethtool

DMA Ring Buffer Sizes and statistics

The pre-set maximum is 8192 packets. The default is 256 packets. For purposes of LAN testing, using the smallest possible ring size provides the best results

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
ethtool -S enp5s0

# ethtool -S eth0 | grep rx_no_buffer_count
rx_no_buffer_count: 100590
# ethtool -S eth0 | grep rx_missed_errors
rx_missed_errors: 9188

reduce buffer,in this old machine, 4096 is maxinum

#1GbE
$ ethtool -G eth0 rx 2048 tx 2048

$ ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 2048
RX Mini: 0
RX Jumbo: 0
TX: 2048

#10GbE
$ ethtool -g enp4s0
Ring parameters for enp4s0:
Pre-set maximums:
RX: 8192
RX Mini: 0
RX Jumbo: 0
TX: 8192
Current hardware settings:
RX: 1024
RX Mini: 0
RX Jumbo: 0
TX: 512

Enable RSS/RFS

RSS Contemporary NICs support multiple receive and transmit descriptor queues (multi-queue). you can show it from /etc/interrupt

Receive Flow Steering (RFS) extends RPS behavior to increase the CPU cache hit rate and thereby reduce network latency
Enabling the RFS requires enabling the ‘ntuple’ flag via the ethtool,RFS requires the kernel to be compiled with the CONFIG_RFS_ACCEL option. This options is available in kernels 2.6.39 and above. Furthermore, RFS requires Device Managed Flow Steering support.

1
2
ethtool -K enp5s0 ntuple on
ethtool -K enp5s0d1 ntuple on

RSS RFS XPS

Added Transmit Packet Steering (XPS) support
RFS does not support UDP
RSS support of fragmented IP datagram
XOR RSS Hash Function

Enable Receive Packet Steering

Maybe mellanox not support hardware acceleration
Receive Packet Steering (RPS) is logically a software implementation of RSS

RPS has some advantages over RSS:

  • it can be used with any NIC,
  • software filters can easily be added to hash over new protocols,
  • it does not increase hardware device interrupt rate (although it doesintroduce inter-processor interrupts (IPIs)).
    kernel networking scaling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
printf '%x\n' "$((2#10101))" # core 0,2,4
15
printf '%x\n' "$((2#10101000000))" core 6,8,10
540

for i in $(ls /sys/class/net/enp5s0/queues/ | grep rx-);
do
echo 15 > /sys/class/net/enp5s0/queues/$i/rps_cpus;
cat /sys/class/net/enp5s0/queues/$i/rps_cpus;
done

for i in $(ls /sys/class/net/enp5s0d1/queues/ | grep rx-);
do
echo 540 > /sys/class/net/enp5s0/queues/$i/rps_cpus;
cat /sys/class/net/enp5s0/queues/$i/rps_cpus;
done

Enable autonegotiation

1
2
3
4
5
6
7
8
9
10
11
ethtool -A enp5s0 autoneg on #could not work for mellanox
Cannot set device pause parameters: Invalid argument

ethtool -A enp5s0 rx on
ethtool -A enp5s0 tx on

ethtool -a $nic0
Pause parameters for enp5s0:
Autonegotiate: off
RX: on
TX: on

CPU info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
root@ubuntu-16:/sys/class/net/enp5s0/device# cat /sys/class/net/enp5s0/device/numa_node 
0
root@ubuntu-16:/sys/class/net/enp5s0d1/device# cat /sys/class/net/enp5s0d1/device/numa_node
0

cat /sys/class/net/enp5s0/device/local_cpus
0000,00000000,00000000,00000000,00000555

#same with
printf '%x\n' "$((2#10101010101))"
555

# CPU Tubro mode
```bash
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
3200000

# Disable cstate in grub.conf
intel_idle.max_cstate=0 processor.max_cstate=1

# disable irqbalance
/etc/init.d/irqbalance stop
[ ok ] Stopping irqbalance (via systemctl): irqbalance.service.

# install numactl collectl irqstat
apt install numactl collectl
git clone https://github.com/lanceshelton/irqstat

#lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 1200.187
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 4795.72
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0,2,4,6,8,10
NUMA node1 CPU(s): 1,3,5,7,9,11

That means you ‘d better use cpus in numa node0, I guess, if numa node1 read/write data to the PCIE device, the data has to pass through QPI(Intel QuickPath Interconnect)
In my benchmark ,numactl will bind all applications on numa node0.

OPS

  • Disabled hyper thread
  • Enable turbo mode
  • Enable numa, preferred each node
  • Disabled irqbalance

CPU Affinity

1
2
3
4
5
6
7
printf '%x\n' "$((2#10101010101))"
555 #numa_node0

awk -F: '$0~/enp5/ || $0~/mlx4/ {print $1}' /proc/interrupts | while read line
do
echo 555 > /proc/irq/$line/smp_affinity
done

sysctl.conf for ipv4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
net.ipv4.tcp_sack = 0 #Disable the TCP timestamps option for better CPU utilization
net.core.netdev_budget = 300
net.ipv4.tcp_timestamps=0 #Disable the TCP timestamps option for better CPU utilization
net.core.netdev_max_backlog=250000 #Increase the maximum length of processor input queues

#Increase the TCP maximum and default buffer sizes
net.core.rmem_max=4194304
net.core.wmem_max=4194304
net.core.rmem_default=4194304
net.core.wmem_default=4194304
net.core.optmem_max=4194304

#Increase memory thresholds to prevent packet dropping
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 65536 4194304"

net.ipv4.tcp_low_latency=1 #Enable low latency mode for TCP
net.ipv4.tcp_adv_win_scale=1 #A value of 1 means the socket buffer will be divided evenly between TCP windows size and application.

txqueueleng

This queue parameter is mostly applicable for high-speed WAN transfers. For low-latency networks, the default setting of 1000 is sufficient. The receiving end is configured with the sysctl setting net.core.netdev_max_backlog. The default for this setting is also 1000 and does not need to be modified unless there is significant latency.

1
2
3
ip link set dev enp5s0 txqueuelen 1000
ip link set dev enp5s0d1 txqueuelen 1000
ip link set dev bond0 txqueuelen 2000

Enable jumbo frames

1
2
3
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_base_mss = 512
net.ipv4.ip_no_pmtu_disc = 0


Configure libvma

Install libvma in linux
libvma.conf
Using ConnectX-3 Pro with VMA over Ubuntu 16.04 Inbox Driver
Mellanox OpenFabrics Enterprise Distribution for Linux MLNX_OFED

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 apt-get install dkms infiniband-diags libibverbs* ibacm librdmacm* libmlx4* libmlx5* mstflint libibcm.* libibmad.* libibumad* opensm srptools libmlx4-dev librdmacm-dev rdmacm-utils ibverbs-utils perftest vlan ibutils
apt-get install libnl-3-200 libnl-route-3-200 libnl-route-3-dev libnl-utils
git clone https://github.com/Mellanox/libvma.git $ cd libvma
./autogen.sh
./configure --prefix=/usr
make -j 12
make install
ldconfig
echo options mlx4_core log_num_mgm_entry_size=-1 > /etc/modprobe.d/mlnx.conf
echo 1000000000 > /proc/sys/kernel/shmmax
echo 800 > /proc/sys/vm/nr_hugepage
ulimit -l unlimited
LD_PRELOAD=libvma.so <test_app>

$ cat /etc/modprobe.d/mlx4.conf
# mlx4_core gets automatically loaded, load mlx4_en also (LP: #1115710)
softdep mlx4_core post: mlx4_en
#options mlx4_en inline_thold=0
options mlx4_core port_type_array=2,2 num_vfs=2 probe_vf=0 enable_64b_cqe_eqe=0 log_num_mgm_entry_size=-1

Benchmark

Architecture

Network_Arch
CPU is not bottle neck, 2620 was enough for quad/dual 10GbE

sockperf

sockperf: == version #2.7-53.git6f0f32c1fb02 ==
MTU 1500
single core server/client, single port

./sockperf server –tcp -p 11111
./sockperf server -p 11111 #udp
numactl -C 2 /opt/sockperf/bin/sockperf tp -m 256 -i 192.168.12.201 -p 11111 -t 20
numactl -C 2 /opt/sockperf/bin/sockperf pp -m 256 -i 192.168.12.201 -p 11111 -t 20

Throughput (MBps)

msg size bytes TCP TP MBps TCP TP VMA UDP TP UDP TP VMA
256 289.160 695.808 78.632 766.952
512 420.863 1125.157 154.019 1055.240
1024 1129.297 1130.277 301.421 1119.500
2048 1131.170 1130.346 381.049 1123.660
4096 1131.160 1130.387 564.496 1141.008

Package num (msg/s)
msg/s=(send+rec msg) per second

msg size bytes TCP PP TCP PP VMA TCP PP 3x TCP PP 3x VMA Intel X540 ConnectX-3 ConnectX-3 VMA
256 61813 261712 181845 702355 53663 63698 258556
512 65832 225369 154585 615563 53414 52425 223479
1024 51411 182045 158244 503944 53067 56812 179406
2048 40918 130620 130946 353721 36214 35772 127449

AVG Latency(us)

msg size bytes TCP PP TCP PP VMA TCP PP 3x TCP PP 3x VMA Intel X540 ConnectX-3 ConnectX-3 VMA
256 14.611 3.798 16.7707 4.268 17.072 15.644 3.845
512 16.162 4.434 19.8913 4.866 18.667 19.013 4.456
1024 17.433 5.455 30.7607 5.9526 18.789 17.547 5.576
2048 24.768 7.849 28.7957 8.49 27.556 27.347 7.857

UDP performance

msg size bytes UDP PP UDP PP VMA
256 65921 311823
512 68160 261528
1024 62480 203539
1472 62233 171311
2048 49457 46734
msg size bytes UDP PP UDP PP VMA
256 9.474 3.201
512 9.635 3.757
1024 10.800 4.822
1472 16.016 5.815
2048 20.161 19.953

netperf

numactl -C 2 netserver -D -4 -v 2 -p 5000 -f -L 192.168.12.201
numactl -N 0 ./netperf -n 6 -p 5000 -H 192.168.12.201 -c -C -t TCP_STREAM -l 15 -T 2,2 – -m $((2$i))
numactl -N 0 ./netperf -n 6 -p 5000 -H 192.168.12.201 -c -C -t TCP_RR -l 20 -T 2,2 – -m $((2
$i))
numactl -N 0 ./netperf -n 6 -p 5000 -H 192.168.12.201 -c -C -t UDP_RR -l 20 -T 2,2 – -m $((2**$i))

msg size bytes TCP RR TCP RR VMA UDP RR UDP RR VMA
256 43428.69 129390.97 49805.09 146545.15
512 37705.07 107808.85 46393.12 125839.11
1024 35005.28 90787.06 39591.57 98669.99
2048 23688.22 63992.85 22633.36 19961.70
4096 22933.14 51851.17 20772.47 18604.53

redis-benchmark

LD_PRELOAD=libvma.so VMA_STATS_FD_NUM=500 redis-server –port 7777 –protected-mode no –maxmemory 12000mb –maxmemory-samples 10 –maxmemory-policy allkeys-lru
numactl -C 2 redis-benchmark -r 10000000 -n 20000000 -t get,set,lpush,lpop -P 16 -q -h 192.168.12.201 -p 7777 -d 16

set/get/lpush/lpop records/s

size bytes set get lpush lpop set vma get vma lpush vma lpop vma
2 909628.44 995668.88 1125492.38 1092836.38 1051801.25 1383604.25 1427857.50 1412230.00
16 877077.62 975181.56 1126506.75 1058537.12 960937.88 1332001.25 1403213.38 1408054.00
64 765814.06 963205.56 881834.19 1040203.94 785360.88 1114082.00 1179384.38 1258890.88
256 596801.12 714030.69 787928.94 864416.31 614420.50 943129.31 868847.50 970167.38
size bytes set x4 get x4 lpush x4 lpop x4 set x4 vma get x4 lpush x4 lpop x4
2 1179965.09 1580834.26 2055718.79 2001333.44 2432077.24 3466123.12 3255039.99 3349659.25
16 1205777.63 1656787.75 1685172.72 2167017.68 2212556.32 3204048.56 3243653 3355042
64 1125920.34 1491794.08 1603009.35 2040705.93 2004204.13 3079092.74 2744756.94 3079859.44
256 1172390.68 1236795.19 1024634.36 1474094.21 1717803 2521588 1659403.82 2402496.31

Intel X540

size bytes set x4 get x4 lpush x4 lpop x4
2 1512821.64 2118883.93 1834559.35 1960360.28
16 1449761.53 1851285.18 1910151.44 2017602.97
64 1199011.03 1773449.28 1612987.21 1806331.65
256 1097519.72 1121691.17 1145548.86 1398369.94

ConnectX-3

size bytes set x4 VMA get x4 VMA lpush x4 VMA lpop x4 VMA
2 2455271.19 3476625.19 3360328.38 3401881.00
16 2461504.31 3309726.63 3224818.00 3322884.63
64 1954063.35 2876987.38 2729820.24 3035352.45
256 1498470.37 2356658.57 1650766.50 2362893.44

Because not enough memory in single node, so I can’ t benchamrk 6xredis-server benchmark with 6xredis-benchmark
echo 1024 > /proc/sys/net/core/somaxconn
vm.overcommit_memory = 1
echo never > /sys/kernel/mm/transparent_hugepage/enabled (reduce performance when you enable VMA, cancel)
13562:M 10 Jun 01:02:39.054 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
13562:M 10 Jun 01:02:39.054 # Server started, Redis version 3.2.0
13562:M 10 Jun 01:02:39.054 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
13562:M 10 Jun 01:02:39.054 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

  • In VMA environment, if you disable Transparent Huge Pages, redis performance will decreased.

qperf

1
2
3
4
5
6
 VMA PANIC: ib_ctx_handler73:ib_ctx_handler() ibv device 0x1523ee0 pd allocation failure (ibv context 0x15266d0) (errno=13 Permission denied)
terminate called without an active exception
VMA PANIC: ib_ctx_handler73:ib_ctx_handler() ibv device 0x1523ee0 pd allocation failure (ibv context 0x15266d0) (errno=13 Permission denied)
terminate called without an active exception
VMA PANIC: ib_ctx_handler73:ib_ctx_handler() ibv device 0x1523ee0 pd allocation failure (ibv context 0x15266d0) (errno=13 Permission denied)
terminate called without an active exception

###Another function
ethernet brief

Comparison of 40G RDMA and Traditional Ethernet Technologies
Performance Tuning Guidelines for Mellanox Network Adapters
10G 82599EB 网卡测试优化