Bug #7166
closedDuring bandwidth test 4860 with 2.4 got Fatal trap 12: page fault while in kernel mode
Added by Constantine Kormashev almost 8 years ago. Updated almost 8 years ago.
100%
Description
During bandwidth test 4860 on today 2.4 got `Fatal trap 12: page fault while in kernel mode`
FreeBSD pfSense.localdomain 11.0-RELEASE-p6 FreeBSD 11.0-RELEASE-p6 #85 8370c2ed409(RELENG_2_4): Thu Jan 26 14:39:07 CST 2017 root@buildbot2.netgate.com:/builder/ce/tmp/obj/builder/ce/tmp/FreeBSD-src/sys/pfSense amd64
Trace in attachment
There are not any settings besides IP on LAN/WAN, 1-2 rules on both interfaces and a couple routes
Perhaps same as https://redmine.pfsense.org/issues/6257
Files
4860-trace.log (96.8 KB) 4860-trace.log | Constantine Kormashev, 01/27/2017 06:51 AM | ||
dns.pcap (226 Bytes) dns.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
exchange.pcap (10.5 KB) exchange.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
citrix.pcap (87.8 KB) citrix.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
http_browsing.pcap (34.6 KB) http_browsing.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
http_get.pcap (41.8 KB) http_get.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
https.pcap (170 KB) https.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
Oracle.pcap (60.7 KB) Oracle.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
mail_pop.pcap (15.9 KB) mail_pop.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
rtp_250k_rtp_only_1.pcap (164 KB) rtp_250k_rtp_only_1.pcap | Constantine Kormashev, 01/27/2017 09:14 AM | ||
rtp_160k.pcap (1.1 MB) rtp_160k.pcap | Constantine Kormashev, 01/27/2017 09:14 AM |
Updated by Constantine Kormashev almost 8 years ago
- File citrix.pcap citrix.pcap added
- File dns.pcap dns.pcap added
- File exchange.pcap exchange.pcap added
- File http_browsing.pcap http_browsing.pcap added
- File http_get.pcap http_get.pcap added
- File https.pcap https.pcap added
- File mail_pop.pcap mail_pop.pcap added
- File Oracle.pcap Oracle.pcap added
- File rtp_160k.pcap rtp_160k.pcap added
- File rtp_250k_rtp_only_1.pcap rtp_250k_rtp_only_1.pcap added
I can reproduce this bug.
It happens when I use especial traffic pattern for cisco t-rex which included several pcaps with real traffic:
Oracle.pcap
Video_Calls.pcap
rtp_160k.pcap
rtp_250k_rtp_only_1.pcap
rtp_250k_rtp_only_2.pcap
smtp.pcap
Voice_calls_rtp_only.pcap
citrix.pcap
dns.pcap
exchange.pcap
http_browsing.pcap
http_get.pcap
http_post.pcap
https.pcap
mail_pop.pcap
And I noticed that just simple diffetent size UDP packets even huge number of them can not reproduce this error.
For traffic pattern I wrote above than more it volume that faster bug happens.
Updated by Constantine Kormashev almost 8 years ago
Adding "hw.igb.num_queues=1" to /boot/local.conf helps resolving this issue.
sysctl hw.igb.num_queues
hw.igb.num_queues: 1
Updated by Luiz Souza almost 8 years ago
Seems like a know bug in FreeBSD (or sort of): https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208409#c11
Also duplicated with #7149 (I'll keep both open for now as they have different information about the bug).
Updated by Luiz Souza almost 8 years ago
The FreeBSD PR also suggest that disabling the LEGACY_TX support (and ALTQ support altogether) would also fix the crashes.
Updated by Luiz Souza almost 8 years ago
- Status changed from New to Feedback
This commit fix a few obvious issues in igb: https://github.com/pfsense/FreeBSD-src/commit/215ddb035593bc4cee275b9dbbf8fc3a7579aee1
Please update and test for regressions.
Updated by Vladimir Lind almost 8 years ago
Tests repeated as instructed by Constantine - SG4860 did not crash with 2.4 built on Mon Jan 30 22:08:41 CST 2017
Updated by Constantine Kormashev almost 8 years ago
I noticed with new firmware SG4860 uses CPU resources on 25% more than on previous version.
Now it is 185% CPU IDLE but earlier it was 305% CPU IDLE. Interruptions get 212% per 1G/s flow
Typical top -CSP for new firmware:
last pid: 13314; load averages: 2.12, 2.18, 1.85 up 1+23:27:59 06:55:00
70 processes: 2 running, 65 sleeping, 2 zombie, 1 waiting
CPU 0: 0.0% user, 0.0% nice, 0.0% system, 80.9% interrupt, 19.1% idle
CPU 1: 0.0% user, 0.0% nice, 0.0% system, 82.7% interrupt, 17.3% idle
CPU 2: 0.0% user, 0.0% nice, 0.6% system, 29.0% interrupt, 70.4% idle
CPU 3: 0.0% user, 0.0% nice, 0.6% system, 19.8% interrupt, 79.6% idle
Mem: 7600K Active, 103M Inact, 260M Wired, 26M Buf, 7511M Free
Swap: 1459M Total, 1459M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND
12 root 45 -64 - 0K 720K WAIT -1 333:05 211.69% intr
11 root 4 155 ki31 0K 64K RUN 0 183.0H 186.44% idle
7 root 1 -16 - 0K 16K - 2 1:55 0.80% rand_harv
0 root 37 -16 - 0K 592K swapin 3 10:56 0.62% kernel
57695 root 1 20 0 20012K 3320K CPU3 3 0:02 0.18% top
6 root 1 -16 - 0K 16K pftm 2 1:05 0.15% pf purge
15 root 5 -68 - 0K 80K - 0 0:19 0.03% usb
4 root 2 -16 - 0K 32K - 2 0:05 0.01% cam
18020 root 1 20 0 37616K 8032K kqread 1 0:07 0.01% nginx
24 root 1 16 - 0K 16K syncer 3 0:07 0.01% syncer
22 root 2 -16 - 0K 32K psleep 2 0:03 0.01% bufdaemon
18743 root 2 20 0 29120K 12888K select 2 0:16 0.01% ntpd
18284 root 1 20 0 12468K 2348K nanslp 3 0:00 0.01% cron
Updated by Constantine Kormashev almost 8 years ago
Constantine Kormashev wrote:
I noticed with new firmware SG4860 uses CPU resources on 25% more than on previous version.
Now it is 185% CPU IDLE but earlier it was 305% CPU IDLE. Interruptions get 212% per 1G/s flow
And thee is same picture for other traffic types: 290% IDLE instead 360% IDLE for 1518b frames
And huge performance degradation for small size frames (64b) 105000pps instead 198000pps and for random frame sizes (64-1518) 110000 pps instead 133000 pps
Updated by Luiz Souza almost 8 years ago
This may be the tradeoff of the fix, in reality won't disable the multiple queues but only one is going to be used and because of that there are cases where you have to drop the locks in one CPU acquire the lock in another CPU, this has a price...
I'll check the fix with the FreeBSD, maybe someone come up with a better fix.
Updated by Luiz Souza almost 8 years ago
The next build has a different fix for this issue, it probably has better performance too.
Could you, please, check what is the degradation, if any, of this new fix ?
Updated by Constantine Kormashev almost 8 years ago
I updated 4860 on last firmware and made tests. And I got very good result.
There is not problem with performance and I could not reproduce issue which led to kernel panic.
I tested device during several hours and did not notice any troubles.
Updated by Luiz Souza almost 8 years ago
Thank you again Constantine!
I'll upstream this fix.
Updated by Luiz Souza almost 8 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Fixed.