Project

General

Profile

Bug #7166

During bandwidth test 4860 with 2.4 got Fatal trap 12: page fault while in kernel mode

Added by Constantine Kormashev 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Operating System
Target version:
Start date:
01/27/2017
Due date:
% Done:

100%

Affected version:
2.4
Affected Architecture:

Description

During bandwidth test 4860 on today 2.4 got `Fatal trap 12: page fault while in kernel mode`
FreeBSD pfSense.localdomain 11.0-RELEASE-p6 FreeBSD 11.0-RELEASE-p6 #85 8370c2ed409(RELENG_2_4): Thu Jan 26 14:39:07 CST 2017 :/builder/ce/tmp/obj/builder/ce/tmp/FreeBSD-src/sys/pfSense amd64
Trace in attachment
There are not any settings besides IP on LAN/WAN, 1-2 rules on both interfaces and a couple routes
Perhaps same as https://redmine.pfsense.org/issues/6257

4860-trace.log Magnifier (96.8 KB) Constantine Kormashev, 01/27/2017 06:51 AM

dns.pcap (226 Bytes) Constantine Kormashev, 01/27/2017 09:14 AM

exchange.pcap (10.5 KB) Constantine Kormashev, 01/27/2017 09:14 AM

citrix.pcap (87.8 KB) Constantine Kormashev, 01/27/2017 09:14 AM

http_browsing.pcap (34.6 KB) Constantine Kormashev, 01/27/2017 09:14 AM

http_get.pcap (41.8 KB) Constantine Kormashev, 01/27/2017 09:14 AM

https.pcap (170 KB) Constantine Kormashev, 01/27/2017 09:14 AM

Oracle.pcap (60.7 KB) Constantine Kormashev, 01/27/2017 09:14 AM

mail_pop.pcap (15.9 KB) Constantine Kormashev, 01/27/2017 09:14 AM

rtp_250k_rtp_only_1.pcap (164 KB) Constantine Kormashev, 01/27/2017 09:14 AM

rtp_160k.pcap (1.1 MB) Constantine Kormashev, 01/27/2017 09:14 AM

History

#1 Updated by Constantine Kormashev 8 months ago

I can reproduce this bug.
It happens when I use especial traffic pattern for cisco t-rex which included several pcaps with real traffic:

Oracle.pcap
Video_Calls.pcap
rtp_160k.pcap
rtp_250k_rtp_only_1.pcap
rtp_250k_rtp_only_2.pcap
smtp.pcap
Voice_calls_rtp_only.pcap
citrix.pcap
dns.pcap
exchange.pcap
http_browsing.pcap
http_get.pcap
http_post.pcap
https.pcap
mail_pop.pcap

And I noticed that just simple diffetent size UDP packets even huge number of them can not reproduce this error.
For traffic pattern I wrote above than more it volume that faster bug happens.

#2 Updated by Constantine Kormashev 8 months ago

Adding "hw.igb.num_queues=1" to /boot/local.conf helps resolving this issue.
sysctl hw.igb.num_queues
hw.igb.num_queues: 1

#3 Updated by Luiz Souza 8 months ago

Seems like a know bug in FreeBSD (or sort of): https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208409#c11

Also duplicated with #7149 (I'll keep both open for now as they have different information about the bug).

#4 Updated by Luiz Souza 8 months ago

The FreeBSD PR also suggest that disabling the LEGACY_TX support (and ALTQ support altogether) would also fix the crashes.

#5 Updated by Luiz Souza 8 months ago

  • Status changed from New to Feedback

This commit fix a few obvious issues in igb: https://github.com/pfsense/FreeBSD-src/commit/215ddb035593bc4cee275b9dbbf8fc3a7579aee1

Please update and test for regressions.

#6 Updated by Vladimir Lind 8 months ago

Tests repeated as instructed by Constantine - SG4860 did not crash with 2.4 built on Mon Jan 30 22:08:41 CST 2017

#7 Updated by Constantine Kormashev 8 months ago

I noticed with new firmware SG4860 uses CPU resources on 25% more than on previous version.
Now it is 185% CPU IDLE but earlier it was 305% CPU IDLE. Interruptions get 212% per 1G/s flow

Typical top -CSP for new firmware:

last pid: 13314; load averages: 2.12, 2.18, 1.85 up 1+23:27:59 06:55:00
70 processes: 2 running, 65 sleeping, 2 zombie, 1 waiting
CPU 0: 0.0% user, 0.0% nice, 0.0% system, 80.9% interrupt, 19.1% idle
CPU 1: 0.0% user, 0.0% nice, 0.0% system, 82.7% interrupt, 17.3% idle
CPU 2: 0.0% user, 0.0% nice, 0.6% system, 29.0% interrupt, 70.4% idle
CPU 3: 0.0% user, 0.0% nice, 0.6% system, 19.8% interrupt, 79.6% idle
Mem: 7600K Active, 103M Inact, 260M Wired, 26M Buf, 7511M Free
Swap: 1459M Total, 1459M Free

PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
12 root 45 -64 - 0K 720K WAIT -1 333:05 211.69% intr
11 root 4 155 ki31 0K 64K RUN 0 183.0H 186.44% idle
7 root 1 -16 - 0K 16K - 2 1:55 0.80% rand_harv
0 root 37 -16 - 0K 592K swapin 3 10:56 0.62% kernel
57695 root 1 20 0 20012K 3320K CPU3 3 0:02 0.18% top
6 root 1 -16 - 0K 16K pftm 2 1:05 0.15% pf purge
15 root 5 -68 - 0K 80K - 0 0:19 0.03% usb
4 root 2 -16 - 0K 32K - 2 0:05 0.01% cam
18020 root 1 20 0 37616K 8032K kqread 1 0:07 0.01% nginx
24 root 1 16 - 0K 16K syncer 3 0:07 0.01% syncer
22 root 2 -16 - 0K 32K psleep 2 0:03 0.01% bufdaemon
18743 root 2 20 0 29120K 12888K select 2 0:16 0.01% ntpd
18284 root 1 20 0 12468K 2348K nanslp 3 0:00 0.01% cron

#8 Updated by Constantine Kormashev 8 months ago

Constantine Kormashev wrote:

I noticed with new firmware SG4860 uses CPU resources on 25% more than on previous version.
Now it is 185% CPU IDLE but earlier it was 305% CPU IDLE. Interruptions get 212% per 1G/s flow

And thee is same picture for other traffic types: 290% IDLE instead 360% IDLE for 1518b frames

And huge performance degradation for small size frames (64b) 105000pps instead 198000pps and for random frame sizes (64-1518) 110000 pps instead 133000 pps

#9 Updated by Luiz Souza 8 months ago

This may be the tradeoff of the fix, in reality won't disable the multiple queues but only one is going to be used and because of that there are cases where you have to drop the locks in one CPU acquire the lock in another CPU, this has a price...

I'll check the fix with the FreeBSD, maybe someone come up with a better fix.

#10 Updated by Luiz Souza 7 months ago

The next build has a different fix for this issue, it probably has better performance too.

Could you, please, check what is the degradation, if any, of this new fix ?

#11 Updated by Constantine Kormashev 7 months ago

I updated 4860 on last firmware and made tests. And I got very good result.
There is not problem with performance and I could not reproduce issue which led to kernel panic.
I tested device during several hours and did not notice any troubles.

#12 Updated by Luiz Souza 7 months ago

Thank you again Constantine!

I'll upstream this fix.

#13 Updated by Luiz Souza 7 months ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

Fixed.

Also available in: Atom PDF