Bug #7166: During bandwidth test 4860 with 2.4 got Fatal trap 12: page fault while in kernel mode - pfSense - pfSense bugtracker

Actions

Copy link

Bug #7166

closed

During bandwidth test 4860 with 2.4 got Fatal trap 12: page fault while in kernel mode

Added by Constantine Kormashev over 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Luiz Souza

Category:

Operating System

Target version:

2.4.0

Start date:

01/27/2017

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

Release Notes:

Affected Version:

2.4

Affected Architecture:

Description

During bandwidth test 4860 on today 2.4 got `Fatal trap 12: page fault while in kernel mode`
FreeBSD pfSense.localdomain 11.0-RELEASE-p6 FreeBSD 11.0-RELEASE-p6 #85 8370c2ed409(RELENG_2_4): Thu Jan 26 14:39:07 CST 2017 root@buildbot2.netgate.com:/builder/ce/tmp/obj/builder/ce/tmp/FreeBSD-src/sys/pfSense amd64
Trace in attachment
There are not any settings besides IP on LAN/WAN, 1-2 rules on both interfaces and a couple routes
Perhaps same as https://redmine.pfsense.org/issues/6257

Files

Download all files

4860-trace.log (96.8 KB) 4860-trace.log		Constantine Kormashev, 01/27/2017 06:51 AM
dns.pcap (226 Bytes) dns.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
exchange.pcap (10.5 KB) exchange.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
citrix.pcap (87.8 KB) citrix.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
http_browsing.pcap (34.6 KB) http_browsing.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
http_get.pcap (41.8 KB) http_get.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
https.pcap (170 KB) https.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
Oracle.pcap (60.7 KB) Oracle.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
mail_pop.pcap (15.9 KB) mail_pop.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
rtp_250k_rtp_only_1.pcap (164 KB) rtp_250k_rtp_only_1.pcap		Constantine Kormashev, 01/27/2017 09:14 AM
rtp_160k.pcap (1.1 MB) rtp_160k.pcap		Constantine Kormashev, 01/27/2017 09:14 AM

Actions

Copy link

Updated by Constantine Kormashev over 8 years ago

File citrix.pcap citrix.pcap added
File dns.pcap dns.pcap added
File exchange.pcap exchange.pcap added
File http_browsing.pcap http_browsing.pcap added
File http_get.pcap http_get.pcap added
File https.pcap https.pcap added
File mail_pop.pcap mail_pop.pcap added
File Oracle.pcap Oracle.pcap added
File rtp_160k.pcap rtp_160k.pcap added
File rtp_250k_rtp_only_1.pcap rtp_250k_rtp_only_1.pcap added

I can reproduce this bug.
It happens when I use especial traffic pattern for cisco t-rex which included several pcaps with real traffic:

Oracle.pcap
Video_Calls.pcap
rtp_160k.pcap
rtp_250k_rtp_only_1.pcap
rtp_250k_rtp_only_2.pcap
smtp.pcap
Voice_calls_rtp_only.pcap
citrix.pcap
dns.pcap
exchange.pcap
http_browsing.pcap
http_get.pcap
http_post.pcap
https.pcap
mail_pop.pcap

And I noticed that just simple diffetent size UDP packets even huge number of them can not reproduce this error.
For traffic pattern I wrote above than more it volume that faster bug happens.

Actions

Copy link

Updated by Constantine Kormashev over 8 years ago

Adding "hw.igb.num_queues=1" to /boot/local.conf helps resolving this issue.
sysctl hw.igb.num_queues
hw.igb.num_queues: 1

Actions

Copy link

Updated by Luiz Souza over 8 years ago

Seems like a know bug in FreeBSD (or sort of): https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=208409#c11

Also duplicated with #7149 (I'll keep both open for now as they have different information about the bug).

Actions

Copy link

Updated by Luiz Souza over 8 years ago

The FreeBSD PR also suggest that disabling the LEGACY_TX support (and ALTQ support altogether) would also fix the crashes.

Actions

Copy link

Updated by Luiz Souza over 8 years ago

Status changed from New to Feedback

This commit fix a few obvious issues in igb: https://github.com/pfsense/FreeBSD-src/commit/215ddb035593bc4cee275b9dbbf8fc3a7579aee1

Please update and test for regressions.

Actions

Copy link

Updated by Vladimir Lind over 8 years ago

Tests repeated as instructed by Constantine - SG4860 did not crash with 2.4 built on Mon Jan 30 22:08:41 CST 2017

Actions

Copy link

Updated by Constantine Kormashev over 8 years ago

I noticed with new firmware SG4860 uses CPU resources on 25% more than on previous version.
Now it is 185% CPU IDLE but earlier it was 305% CPU IDLE. Interruptions get 212% per 1G/s flow

Typical top -CSP for new firmware:

last pid: 13314; load averages: 2.12, 2.18, 1.85 up 1+23:27:59 06:55:00
70 processes: 2 running, 65 sleeping, 2 zombie, 1 waiting
CPU 0: 0.0% user, 0.0% nice, 0.0% system, 80.9% interrupt, 19.1% idle
CPU 1: 0.0% user, 0.0% nice, 0.0% system, 82.7% interrupt, 17.3% idle
CPU 2: 0.0% user, 0.0% nice, 0.6% system, 29.0% interrupt, 70.4% idle
CPU 3: 0.0% user, 0.0% nice, 0.6% system, 19.8% interrupt, 79.6% idle
Mem: 7600K Active, 103M Inact, 260M Wired, 26M Buf, 7511M Free
Swap: 1459M Total, 1459M Free

PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   12 root         45 -64    -     0K   720K WAIT   -1 333:05 211.69% intr
   11 root          4 155 ki31     0K    64K RUN     0 183.0H 186.44% idle
    7 root          1 -16    -     0K    16K -       2   1:55   0.80% rand_harv
    0 root         37 -16    -     0K   592K swapin  3  10:56   0.62% kernel
57695 root          1  20    0 20012K  3320K CPU3    3   0:02   0.18% top
    6 root          1 -16    -     0K    16K pftm    2   1:05   0.15% pf purge
   15 root          5 -68    -     0K    80K -       0   0:19   0.03% usb
    4 root          2 -16    -     0K    32K -       2   0:05   0.01% cam
18020 root          1  20    0 37616K  8032K kqread  1   0:07   0.01% nginx
   24 root          1  16    -     0K    16K syncer  3   0:07   0.01% syncer
   22 root          2 -16    -     0K    32K psleep  2   0:03   0.01% bufdaemon
18743 root          2  20    0 29120K 12888K select  2   0:16   0.01% ntpd
18284 root          1  20    0 12468K  2348K nanslp  3   0:00   0.01% cron

Actions

Copy link

Updated by Constantine Kormashev over 8 years ago

Constantine Kormashev wrote:

I noticed with new firmware SG4860 uses CPU resources on 25% more than on previous version.
Now it is 185% CPU IDLE but earlier it was 305% CPU IDLE. Interruptions get 212% per 1G/s flow

And thee is same picture for other traffic types: 290% IDLE instead 360% IDLE for 1518b frames

And huge performance degradation for small size frames (64b) 105000pps instead 198000pps and for random frame sizes (64-1518) 110000 pps instead 133000 pps

Actions

Copy link

Updated by Luiz Souza over 8 years ago

This may be the tradeoff of the fix, in reality won't disable the multiple queues but only one is going to be used and because of that there are cases where you have to drop the locks in one CPU acquire the lock in another CPU, this has a price...

I'll check the fix with the FreeBSD, maybe someone come up with a better fix.

Actions

Copy link

#10

Updated by Luiz Souza about 8 years ago

The next build has a different fix for this issue, it probably has better performance too.

Could you, please, check what is the degradation, if any, of this new fix ?

Actions

Copy link

#11

Updated by Constantine Kormashev about 8 years ago

I updated 4860 on last firmware and made tests. And I got very good result.
There is not problem with performance and I could not reproduce issue which led to kernel panic.
I tested device during several hours and did not notice any troubles.

Actions

Copy link

#12

Updated by Luiz Souza about 8 years ago

Thank you again Constantine!

I'll upstream this fix.

Actions

Copy link

#13

Updated by Luiz Souza about 8 years ago

Status changed from Feedback to Resolved
% Done changed from 0 to 100

Fixed.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Bug #7166

During bandwidth test 4860 with 2.4 got Fatal trap 12: page fault while in kernel mode

Updated by Constantine Kormashev over 8 years ago

Updated by Constantine Kormashev over 8 years ago

Updated by Luiz Souza over 8 years ago

Updated by Luiz Souza over 8 years ago

Updated by Luiz Souza over 8 years ago

Updated by Vladimir Lind over 8 years ago

Updated by Constantine Kormashev over 8 years ago

Updated by Constantine Kormashev over 8 years ago

Updated by Luiz Souza over 8 years ago

Updated by Luiz Souza about 8 years ago

Updated by Constantine Kormashev about 8 years ago

Updated by Luiz Souza about 8 years ago

Updated by Luiz Souza about 8 years ago