Project

General

Profile

Actions

Bug #7149

closed

igb driver queue related crashes

Added by Tobias Wigand about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Interfaces
Target version:
Start date:
01/22/2017
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.4
Affected Architecture:
amd64

Description

Some 2.4 installations tend to crash out of nowhere related to igb driver queues.
Setting

hw.igb.num_queues=1

seems to stabilize the systems.
Affected are at least SG 2440 (https://forum.pfsense.org/index.php?topic=124225.0) and Supermicro C2758 boards (https://forum.pfsense.org/index.php?topic=123957.0)


Files

pfsensedmesg.txt (13.6 KB) pfsensedmesg.txt Philipp Haefelfinger, 01/25/2017 04:45 PM
Actions #1

Updated by Rolf Sommerhalder about 7 years ago

On Supermicro SuperServers 5018D-FN8T with X10SDV-TP8F motherboards, that feature six igb and two ix NICs, we experience also random crashes once every one or two days.

So far, we suspected that OpenBGP might trigger these crashes, as we get full feeds via BGP, and inject and update in the order of 700k routes into the kernel routing table.

This morning, we have added this potential work around on three systems. No crashes yet.

We will also cross-check if it also has an effect on the issue with Link Aggregation (LAGG) using igb NICs
https://redmine.pfsense.org/issues/7119

Actions #2

Updated by Rolf Sommerhalder about 7 years ago

Rolf Sommerhalder wrote:
...

This morning, we have added this potential work around on three systems. No crashes yet.

Fortunately, no crashes to report since adding the following work-around almost 24 hours ago, based also on this discussion https://forum.pfsense.org/index.php?topic=121212.0

[2.4.0-BETA][admin@fw]/boot: cat /boot/loader.conf.local 
hw.igb.num_queues="1" 
hw.ix.num_queues="1" 

Notes:
- By default, hw.igb.num_queues="0" which means that the igb driver configures the number of queues automatically
https://www.freebsd.org/cgi/man.cgi?query=igb&apropos=0&sektion=0&manpath=FreeBSD+11.0-RELEASE+and+Ports&arch=default&format=html
- The X10SDV-TP8F motherboards has two ix NICs. As the default is hw.ix.num_queues="8", we also restrain it to one queue per ix NIC, as for igb NICs.

Important: Reboot after this change, then verify:

[2.4.0-BETA][admin@fw]/boot: sysctl -a | grep num_queue
vfs.aio.num_queue_count: 0
hw.ix.num_queues: 1
hw.igb.num_queues: 1

Also, you may want to watch interrupt rates, CPU usage per igb0:que, errors, etc.:

 vmstat -i
 top -H -S
 netstat -ni

We will also cross-check if it also has an effect on the issue with Link Aggregation (LAGG) using igb NICs
https://redmine.pfsense.org/issues/7119

Unfortunately, this work-around does not solve that issue. Changes to LAGG interfaces still frequently "wedge" the firewall.

Actions #3

Updated by Philipp Haefelfinger about 7 years ago

I also can confirm this issue on my box as well.

I have 6 igb (Intel pro 1000) interfaces (4 on the asus mainboard and 2 on an intel 2-port nic).
The box randomly crashed without a trace. In many cases, there were no information what happened and I had to hard reset the box via ipmi.
Reproducing this behavior was easy. I just had to run a speedtest of my wan connection to trigger the crash / hang.
As soon as I added the queues=1 to the loader.conf.local the crash was no longer reproducible.

I submitted the few crash reports I got, so you probably may find a clue in there. You'll get the dmesg output attached below.
If I may help with more information, please let me know.

hope this helps

Actions #4

Updated by Renato Botelho about 7 years ago

  • Assignee set to Luiz Souza
Actions #6

Updated by Luiz Souza about 7 years ago

  • Status changed from New to Feedback

This commit fix a few obvious issues in igb: https://github.com/pfsense/FreeBSD-src/commit/215ddb035593bc4cee275b9dbbf8fc3a7579aee1

Please update and test for regressions.

Actions #7

Updated by Tobias Wigand about 7 years ago

Updated to the lastest snapshot (Mon Jan 30 22:08:41 CST 2017), set queues to 2 and tried this on a DMZ host for a few minutes:

hping3 -c 100 -d 120 -S -w 64 -p 443 --flood 192.168.xx.10

No crash this time. I'll monitor the system closely and report back should it crash again.

Actions #8

Updated by Tobias Wigand about 7 years ago

After completely removing the queues entry in loader.conf.local and more than 5 days uptime, I think this issue is resolved. At least for my Supermicro board.

Actions #9

Updated by Luiz Souza about 7 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

Fixed.

Actions

Also available in: Atom PDF