Project

General

Profile

Bug #7149

igb driver queue related crashes

Added by Tobias Wigand 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Interfaces
Target version:
Start date:
01/22/2017
Due date:
% Done:

100%

Affected version:
2.4
Affected Architecture:
amd64

Description

Some 2.4 installations tend to crash out of nowhere related to igb driver queues.
Setting

hw.igb.num_queues=1

seems to stabilize the systems.
Affected are at least SG 2440 (https://forum.pfsense.org/index.php?topic=124225.0) and Supermicro C2758 boards (https://forum.pfsense.org/index.php?topic=123957.0)

pfsensedmesg.txt Magnifier (13.6 KB) Philipp Haefelfinger, 01/25/2017 04:45 PM

History

#1 Updated by Rolf Sommerhalder 8 months ago

On Supermicro SuperServers 5018D-FN8T with X10SDV-TP8F motherboards, that feature six igb and two ix NICs, we experience also random crashes once every one or two days.

So far, we suspected that OpenBGP might trigger these crashes, as we get full feeds via BGP, and inject and update in the order of 700k routes into the kernel routing table.

This morning, we have added this potential work around on three systems. No crashes yet.

We will also cross-check if it also has an effect on the issue with Link Aggregation (LAGG) using igb NICs
https://redmine.pfsense.org/issues/7119

#2 Updated by Rolf Sommerhalder 8 months ago

Rolf Sommerhalder wrote:
...

This morning, we have added this potential work around on three systems. No crashes yet.

Fortunately, no crashes to report since adding the following work-around almost 24 hours ago, based also on this discussion https://forum.pfsense.org/index.php?topic=121212.0

[2.4.0-BETA][admin@fw]/boot: cat /boot/loader.conf.local 
hw.igb.num_queues="1" 
hw.ix.num_queues="1" 

Notes:
- By default, hw.igb.num_queues="0" which means that the igb driver configures the number of queues automatically
https://www.freebsd.org/cgi/man.cgi?query=igb&apropos=0&sektion=0&manpath=FreeBSD+11.0-RELEASE+and+Ports&arch=default&format=html
- The X10SDV-TP8F motherboards has two ix NICs. As the default is hw.ix.num_queues="8", we also restrain it to one queue per ix NIC, as for igb NICs.

Important: Reboot after this change, then verify:

[2.4.0-BETA][admin@fw]/boot: sysctl -a | grep num_queue
vfs.aio.num_queue_count: 0
hw.ix.num_queues: 1
hw.igb.num_queues: 1

Also, you may want to watch interrupt rates, CPU usage per igb0:que, errors, etc.:

 vmstat -i
 top -H -S
 netstat -ni

We will also cross-check if it also has an effect on the issue with Link Aggregation (LAGG) using igb NICs
https://redmine.pfsense.org/issues/7119

Unfortunately, this work-around does not solve that issue. Changes to LAGG interfaces still frequently "wedge" the firewall.

#3 Updated by Philipp Haefelfinger 8 months ago

I also can confirm this issue on my box as well.

I have 6 igb (Intel pro 1000) interfaces (4 on the asus mainboard and 2 on an intel 2-port nic).
The box randomly crashed without a trace. In many cases, there were no information what happened and I had to hard reset the box via ipmi.
Reproducing this behavior was easy. I just had to run a speedtest of my wan connection to trigger the crash / hang.
As soon as I added the queues=1 to the loader.conf.local the crash was no longer reproducible.

I submitted the few crash reports I got, so you probably may find a clue in there. You'll get the dmesg output attached below.
If I may help with more information, please let me know.

hope this helps

#4 Updated by Renato Botelho 8 months ago

  • Assignee set to Luiz Souza

#6 Updated by Luiz Souza 8 months ago

  • Status changed from New to Feedback

This commit fix a few obvious issues in igb: https://github.com/pfsense/FreeBSD-src/commit/215ddb035593bc4cee275b9dbbf8fc3a7579aee1

Please update and test for regressions.

#7 Updated by Tobias Wigand 8 months ago

Updated to the lastest snapshot (Mon Jan 30 22:08:41 CST 2017), set queues to 2 and tried this on a DMZ host for a few minutes:

hping3 -c 100 -d 120 -S -w 64 -p 443 --flood 192.168.xx.10

No crash this time. I'll monitor the system closely and report back should it crash again.

#8 Updated by Tobias Wigand 8 months ago

After completely removing the queues entry in loader.conf.local and more than 5 days uptime, I think this issue is resolved. At least for my Supermicro board.

#9 Updated by Luiz Souza 7 months ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

Fixed.

Also available in: Atom PDF