Bug #7149
closedigb driver queue related crashes
100%
Description
Some 2.4 installations tend to crash out of nowhere related to igb driver queues.
Setting
hw.igb.num_queues=1
seems to stabilize the systems.
Affected are at least SG 2440 (https://forum.pfsense.org/index.php?topic=124225.0) and Supermicro C2758 boards (https://forum.pfsense.org/index.php?topic=123957.0)
Files
Updated by Rolf Sommerhalder almost 8 years ago
On Supermicro SuperServers 5018D-FN8T with X10SDV-TP8F motherboards, that feature six igb and two ix NICs, we experience also random crashes once every one or two days.
So far, we suspected that OpenBGP might trigger these crashes, as we get full feeds via BGP, and inject and update in the order of 700k routes into the kernel routing table.
This morning, we have added this potential work around on three systems. No crashes yet.
We will also cross-check if it also has an effect on the issue with Link Aggregation (LAGG) using igb NICs
https://redmine.pfsense.org/issues/7119
Updated by Rolf Sommerhalder almost 8 years ago
Rolf Sommerhalder wrote:
...
This morning, we have added this potential work around on three systems. No crashes yet.
Fortunately, no crashes to report since adding the following work-around almost 24 hours ago, based also on this discussion https://forum.pfsense.org/index.php?topic=121212.0
[2.4.0-BETA][admin@fw]/boot: cat /boot/loader.conf.local hw.igb.num_queues="1" hw.ix.num_queues="1"
Notes:
- By default, hw.igb.num_queues="0" which means that the igb driver configures the number of queues automatically
https://www.freebsd.org/cgi/man.cgi?query=igb&apropos=0&sektion=0&manpath=FreeBSD+11.0-RELEASE+and+Ports&arch=default&format=html
- The X10SDV-TP8F motherboards has two ix NICs. As the default is hw.ix.num_queues="8", we also restrain it to one queue per ix NIC, as for igb NICs.
Important: Reboot after this change, then verify:
[2.4.0-BETA][admin@fw]/boot: sysctl -a | grep num_queue vfs.aio.num_queue_count: 0 hw.ix.num_queues: 1 hw.igb.num_queues: 1
Also, you may want to watch interrupt rates, CPU usage per igb0:que, errors, etc.:
vmstat -i top -H -S netstat -ni
We will also cross-check if it also has an effect on the issue with Link Aggregation (LAGG) using igb NICs
https://redmine.pfsense.org/issues/7119
Unfortunately, this work-around does not solve that issue. Changes to LAGG interfaces still frequently "wedge" the firewall.
Updated by Philipp Haefelfinger almost 8 years ago
- File pfsensedmesg.txt pfsensedmesg.txt added
I also can confirm this issue on my box as well.
I have 6 igb (Intel pro 1000) interfaces (4 on the asus mainboard and 2 on an intel 2-port nic).
The box randomly crashed without a trace. In many cases, there were no information what happened and I had to hard reset the box via ipmi.
Reproducing this behavior was easy. I just had to run a speedtest of my wan connection to trigger the crash / hang.
As soon as I added the queues=1 to the loader.conf.local the crash was no longer reproducible.
I submitted the few crash reports I got, so you probably may find a clue in there. You'll get the dmesg output attached below.
If I may help with more information, please let me know.
hope this helps
Updated by Luiz Souza almost 8 years ago
- Status changed from New to Feedback
This commit fix a few obvious issues in igb: https://github.com/pfsense/FreeBSD-src/commit/215ddb035593bc4cee275b9dbbf8fc3a7579aee1
Please update and test for regressions.
Updated by Anonymous almost 8 years ago
Updated to the lastest snapshot (Mon Jan 30 22:08:41 CST 2017), set queues to 2 and tried this on a DMZ host for a few minutes:
hping3 -c 100 -d 120 -S -w 64 -p 443 --flood 192.168.xx.10
No crash this time. I'll monitor the system closely and report back should it crash again.
Updated by Anonymous almost 8 years ago
After completely removing the queues entry in loader.conf.local and more than 5 days uptime, I think this issue is resolved. At least for my Supermicro board.
Updated by Luiz Souza almost 8 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Fixed.