Bug #6512
closedUpgrade to 2.3.1 causes network performance degradation (with High CPU usage by NIC kernel tasks)
0%
Description
Hi,
Currently we have a 2-node pfsense system working in active/passive HA. This cluster was running pfsense v2.2.3, and recently we upgraded the slave node to v2.3.1-1, and forced a failover from master to slave in order to test things..
This system is running a mix of normal pf filtering, some IPSec tunnels, and also an HAProxy instance exposing a few (+-10) frontend+backends.
However, the upgraded node (when running as master), shows a clear network performance degradation: While node-1 (the one still running v2.2.3) can easily forward traffic at +250Mb/s, the alternate node (the one running v2.3) tops at +-80Mb/s.
While diagnosing the issue we’ve found node running pfSense v2.3 to have a high load under such a ‘low’ traffic (ie. 80Mb/s), and high CPU usage by network drivers, as show below:
[2.3.1-RELEASE][root@]/root: top -nCHSIzs1 last pid: 28317; load averages: 4.07, 4.23, 4.37 up 2+11:40:04 16:22:50 311 processes: 9 running, 282 sleeping, 20 waiting Mem: 31M Active, 502M Inact, 385M Wired, 883M Buf, 5020M Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND 0 root -92 - 0K 240K CPU1 1 21.8H 99.37% kernel{nfe0 taskq} 0 root -92 - 0K 240K CPU2 2 29.4H 73.29% kernel{em0 taskq} 0 root -92 - 0K 240K CPU0 0 18.6H 44.78% kernel{em1 taskq} 12 root -72 - 0K 336K WAIT 0 65:15 14.60% intr{swi1: netisr 0} 438 nobody 22 0 30184K 4404K select 3 121:18 4.79% dnsmasq 28430 root 21 0 43756K 17440K kqread 3 51:40 1.46% haproxy 12 root -72 - 0K 336K WAIT 0 31:51 1.37% intr{swi1: pfsync} 90479 root 20 0 25720K 7176K select 2 23:43 0.59% openvpn 49607 root 20 0 14516K 2320K select 0 28:31 0.29% syslogd 30713 root 20 0 16676K 2736K bpf 0 18:55 0.10% filterlog 28317 root 21 0 21856K 2992K CPU2 2 0:00 0.10% top
Obviously, firewall rules, services configuration, IPSec tunnels, etc. are configured the same on both nodes. And we’ve compared system values (like tunables, /boot/loader.conf*, and runtime sysctl values across both nodes).. So it looks like an issue with regard to pfSense 2.3 kernel code changes/enhancements.
Both nodes are identical hardware, consisting on:
FreeBSD 10.3-RELEASE-p3 #2 1988fec(RELENG_2_3_1): Wed May 25 14:14:46 CDT 2016 root@ce23-amd64-builder:/builder/pfsense-231/tmp/obj/builder/pfsense-231/tmp/FreeBSD-src/sys/pfSense amd64 FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 CPU: Dual-Core AMD Opteron(tm) Processor 2216 (2393.69-MHz K8-class CPU) Origin="AuthenticAMD" Id=0x40f12 Family=0xf Model=0x41 Stepping=2 Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT> Features2=0x2001<SSE3,CX16> AMD Features=0xea500800<SYSCALL,NX,MMX+,FFXSR,RDTSCP,LM,3DNow!+,3DNow!> AMD Features2=0x1f<LAHF,CMP,SVM,ExtAPIC,CR8> SVM: NAsids=64 real memory = 6442450944 (6144 MB) avail memory = 6194679808 (5907 MB) Event timer "LAPIC" quality 400 ACPI APIC Table: <SUN X4200 M2> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 2 package(s) x 2 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 2 cpu3 (AP): APIC ID: 3
The network interface cards available at both nodes are as follows:
nfe0: NVIDIA nForce4 CK804 MCP9 Networking Adapter nfe1: NVIDIA nForce4 CK804 MCP9 Networking Adapter em0: Intel(R) PRO/1000 (82546EB) em1: Intel(R) PRO/1000 (82546EB) lagg0: LACP lagg with em0, em1 & nfe0 attached.
The pfSense version running at node-1 (the one not yet upgraded) is:
[2.2.3-RELEASE][root@]/root: uname -a FreeBSD 10.1-RELEASE-p13 FreeBSD 10.1-RELEASE-p13 #0 c77d1b2(releng/10.1)-dirty: Tue Jun 23 17:00:47 CDT 2015 root@pfs22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10 amd64
The pfSense version running at node-2 (the upgraded one) is:
[2.3.1-RELEASE][root@]/root: uname -a FreeBSD xxxx.aaa.com 10.3-RELEASE-p3 FreeBSD 10.3-RELEASE-p3 #2 1988fec(RELENG_2_3_1): Wed May 25 14:14:46 CDT 2016 root@ce23-amd64-builder:/builder/pfsense-231/tmp/obj/builder/pfsense-231/tmp/FreeBSD-src/sys/pfSense amd64
Here are some additional statistics we’ve collected, just in case this may help diagnose:
- Packets with errors at em0 and em1 interfaces
[2.3.1-RELEASE][root@]/root: sysctl dev.em.0.mac_stats. | grep 'buff\|missed' dev.em.0.mac_stats.recv_no_buff: 28924720 dev.em.0.mac_stats.missed_packets: 1109472 [2.3.1-RELEASE][root@]/root: sysctl dev.em.1.mac_stats. | grep 'buff\|missed' dev.em.1.mac_stats.recv_no_buff: 2873803 dev.em.1.mac_stats.missed_packets: 79003
- Networks errors
[2.3.1-RELEASE][root@]/root: netstat -ihw 1 input (Total) output packets errs idrops bytes packets errs bytes colls 52k 30 0 32M 55k 0 36M 0 46k 0 0 24M 48k 0 30M 0 50k 64 0 31M 54k 0 35M 0 45k 35 0 26M 48k 0 31M 0 48k 19 0 28M 52k 0 33M 0 45k 2 0 29M 48k 0 33M 0 50k 0 0 30M 53k 0 35M 0 50k 0 0 33M 53k 0 37M 0 43k 9 0 28M 45k 0 32M 0 53k 12 0 34M 56k 0 39M 0 50k 0 0 30M 53k 0 34M 0 44k 0 0 26M 47k 0 30M 0
- Number of interrupts are very high for NICs
[2.3.1-RELEASE][root@]/root: vmstat -i interrupt total rate irq44: nfe1 68575583 318 irq4: uart0 2298 0 irq14: ata0 143914 0 irq20: ohci0 26 0 irq21: ehci0 2 0 irq22: nfe0 225192527 1044 irq56: em0 121230546 562 irq57: em1 305005131 1414 cpu0:timer 242940061 1126 irq256: mpt0 763114 3 cpu1:timer 95960989 445 cpu2:timer 135271696 627 cpu3:timer 133771488 620 Total 1328857375 6164
We have been investigating whether an package of pfsense 2.3.1 or FreeBSD 10.3-release-p3 can cause problems, but we are unable to determine the cause of the problem.