Bug #4310
closedLimiters + HA results in hangs on secondary
Added by Chris Buechler almost 10 years ago. Updated over 6 years ago.
100%
Description
Configuring limiters on a firewall rule in 2.2 on a system using HA results in a kernel panic reboot loop. To replicate, on a basic HA setup with config sync and pfsync enabled, add a pair of limiters and put them on the default LAN rule. It'll panic upon applying the changes, and do so again after rebooting in an endless loop.
Files
redmine4310-crash.txt (167 KB) redmine4310-crash.txt | Chris Buechler, 06/20/2015 07:38 PM | ||
crash report 2.txt (162 KB) crash report 2.txt | Bernardo Pádua, 07/09/2015 06:38 PM | ||
crash report 1.txt (302 KB) crash report 1.txt | Bernardo Pádua, 07/09/2015 06:38 PM | ||
02.03.2016 22_18.txt (160 KB) 02.03.2016 22_18.txt | crash report | Manfred Bongard, 03/03/2016 04:30 AM | |
02.03.2016 22_18.txt (160 KB) 02.03.2016 22_18.txt | William St.Denis, 03/04/2016 10:06 AM | ||
Crash_02.04.2016.txt (150 KB) Crash_02.04.2016.txt | William St.Denis, 03/04/2016 10:10 AM | ||
pfsense_crashlog.txt (188 KB) pfsense_crashlog.txt | crashlog | Fabrizio Pappolla, 02/19/2018 05:29 AM |
Updated by Ermal Luçi almost 10 years ago
I think this happens because CARP packets are being sent to dummynet.
Before the kernel patch prevented this from happening.
Will investigate and fix it accordingly.
Updated by Ermal Luçi almost 10 years ago
- Status changed from Confirmed to Feedback
Patch committed.
Updated by Vitaliy Isarev almost 10 years ago
Hi, I have the same issue. I tried to update to the latest maintance version, but receive error after upgrade: "shared object libpcre.so.1 not found required by php"
Updated by Vitaliy Isarev almost 10 years ago
Ermal Luçi wrote:
Patch committed.
Can you post a link to the patch
Updated by Steve Wheeler over 9 years ago
We are seeing a number of reports that this is still an issue in 2.2.1. At least one customer ticket and also: https://forum.pfsense.org/index.php?topic=87541.0
Even if there is no longer a panic, though there is for some, the machine is no longer usable with the logs spammed with the erro message in the patch.
Updated by Jim Pingle over 9 years ago
- Status changed from Resolved to Confirmed
- Target version changed from 2.2.1 to 2.2.2
This is still a problem. Some cases still work but with TONS of console/log spam about pfsync_undefer_state rendering the console and logs practically unusable. There are still a couple people reporting panics as well, though they don't seem to be making it to the crash reporter (other reports are submitting fine, however).
Somewhat related, there appears to be a logic issue in the pfsync enable checkbox for those who upgraded from 2.0.x or before (before the HA settings were moved). The GUI shows the option unchecked, but the empty tag is in the config which is causing pfsync to be enabled. This can lead to these errors showing up on non-HA units until they enable and then disable pfsync.
Updated by Ermal Luçi over 9 years ago
- Status changed from Confirmed to Feedback
I pushed the messages under debug misc level and also another change to fix the root cause for it.
Updated by Chris Linstruth over 9 years ago
Looks good here. Not stressing it but enabling/disabling limiters on the cluster works, the limiters are doing what I ask, and the limited states are syncing. Thanks.
Updated by Chris Buechler over 9 years ago
Chris: that still working fine for you?
After running for a few hours, the secondary still hangs in one of our test setups. Console is non-responsive, ESX shows VM using 100% CPU. That happens after about 4 hours of run time. The primary is fine throughout, and the limiters work on both v4 and v6.
Updated by Chris Linstruth over 9 years ago
I haven't seen anything else but please understand that this is on a test bench not in production and I am not stressing it at all. If you look at ticket UJN-78146 you will see my description of a crash I have seen since starting to test yesterday's snapshot. Not sure if it is related. The HA pair has been up since but with no more than about 150 states.
Updated by Chris Linstruth over 9 years ago
A bit more info. See this thread:
https://forum.pfsense.org/index.php?topic=92128.0
Turning off the limiters makes that NAT translation work.
Updated by Chris Buechler over 9 years ago
- Target version changed from 2.2.2 to 2.2.3
this is better, though still the issue where the secondary may hit 100% CPU and hang in some circumstance. We'll revisit.
The issue with reflection and limiters is #4590
Updated by Ermal Luçi over 9 years ago
Patch was committed for this on tools repo and also the defer option in pfsync is now not used.
Both can be considered as the root cause of the issue here.
Updated by Chris Buechler over 9 years ago
- Subject changed from Limiters + HA results in kernel panic to Limiters + HA results in hangs on secondary
- Status changed from Feedback to Confirmed
- Affected Version changed from 2.2 to 2.2.x
no change, still hangs secondary within a couple hours
Updated by Ermal Luçi over 9 years ago
- Assignee changed from Ermal Luçi to Chris Buechler
Chris need to confirm this happens still or not.
Updated by Chris Buechler over 9 years ago
- Status changed from Confirmed to Feedback
I'm pretty sure it doesn't happen anymore, still have the test setup running to make sure. Given another ~48 hours, if it's still not an issue, this will be safe to consider fixed.
Updated by Chris Buechler over 9 years ago
- Status changed from Feedback to Confirmed
- Assignee changed from Chris Buechler to Ermal Luçi
no change, as long as you have some traffic passing through a limiter, the secondary hangs within ~1-4 hours.
Updated by Chris Buechler over 9 years ago
- File redmine4310-crash.txt redmine4310-crash.txt added
- Target version changed from 2.2.3 to 2.3
Tried after changing both hosts to use unicast pfsync, which had no impact. It seems to alternate between hanging the secondary, and triggering a kernel panic. Thus far using unicast, the secondary has only kernel paniced, not hung consuming 100% CPU. Same kernel panic happened on occasion when using multicast pfsync so not sure that's actually changed.
crash report attached
Updated by Bernardo Pádua over 9 years ago
- File crash report 2.txt crash report 2.txt added
- File crash report 1.txt crash report 1.txt added
This is also happening to me. I though the issue with the limiters was fixed in 2.2.2 and 2.2.3, so I posted a duplicate ticket on #4823. But I've now disabled my limiters and saw that the backup/secondary firewall stopped crashing.
I'm posting my crash dumps (two different times the backup crashed) here in case they are of any help.
Updated by Jim Thompson about 9 years ago
- Assignee changed from Ermal Luçi to Luiz Souza
Updated by James Starowitz almost 9 years ago
lastnight rolled out our 2.2.5 units using c2758s in HA, the units worked fine in a test lab, although once i put it into production the backup router would hang to the point that it could not access the webui, the hang occurs nearly every 5-10minutes, the backup router reboots and then crashs again soon after.
i disabled "state sync" on both the master and the backup and it stopped crashing.
because we use limiters per source ip each client ip has its own limit, as a result the HFSC work around cant really do what im doing with limiters
we rarely go into failover, but when we do seamless state transfers are a life saver.
i submitted the crash reports and opened a support ticket if you need the details.
Updated by Lee Shiry almost 9 years ago
This problem seems to get worse after upgrading to 2.2.6. Now the secondary still hangs even with the limiters and state sync disabled.
Updated by Manfred Bongard over 8 years ago
- File 02.03.2016 22_18.txt 02.03.2016 22_18.txt added
I have the same issue.
Panic on my HA-Backup between 10 minutes and 3 hours. I have send you several crash reports via gui.
Limiter + Vlan + NAT
Updated by William St.Denis over 8 years ago
- File 02.03.2016 22_18.txt 02.03.2016 22_18.txt added
I have noticed this issue as well. We have to disable sync when using limiters because it's crashing the system. I have attached my log as well
I am running Limiter + Vlan + NAT as well. When we started running limiters we noticed the WEB UI started to slow down and were getting an error /rc.filter_synchronize: An authentication failure occurred then we got the kernel panic.
Updated by William St.Denis over 8 years ago
- File Crash_02.04.2016.txt Crash_02.04.2016.txt added
Sorry wrong log. Here is the correct one
Updated by Luiz Souza over 8 years ago
- Target version changed from 2.3 to 2.3.1
Updated by Mikhail Platonov over 8 years ago
I have the same issue.
ha-backup crashed after 7 min
Updated by William St.Denis over 8 years ago
Does anyone have a work around to keep limiters and sync working? The only option I have come up with is to disable limiters or disable sync both aren't great.
Updated by Chris Buechler over 8 years ago
William St.Denis wrote:
Does anyone have a work around to keep limiters and sync working? The only option I have come up with is to disable limiters or disable sync both aren't great.
Those are your only two options for the time being. ALTQ can often be used in the same way as limiters and doesn't have such issues. Post to the forum if you'd like to discuss further.
Updated by Chris Buechler over 8 years ago
- Target version changed from 2.3.1 to 2.3.2
Updated by Jose Duarte over 8 years ago
From the tests we ran for the last couple of days we saw kernel panic using limiters in multiple vlans but no impact when using different queues inside those limiters.
Updated by Chris Buechler over 8 years ago
- Target version changed from 2.3.2 to 2.4.0
Updated by Luiz Souza almost 8 years ago
2.4 has a few new fixes for use-after-free pfsync states. The limiters issue is also fixed.
Updated by Jim Pingle almost 8 years ago
I updated a test cluster to a snapshot from a couple hours ago, which from the timestamp looks like it should have this fix, and both nodes got stuck in a panic loop.
Version String: FreeBSD 11.0-RELEASE-p3 #233 8ae63e9(RELENG_2_4): Sat Dec 10 03:56:41 CST 2016 root@buildbot2.netgate.com:/builder/ce/tmp/obj/builder/ce/tmp/FreeBSD-src/sys/pfSense Panic String: pfsync_undefer_state: unable to find deferred state
Same panic string on both nodes, slightly different backtrace.
Primary:
db:0:kdb.enter.default> bt Tracing pid 12 tid 100044 td 0xfffff80003500500 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe001a62e430 vpanic() at vpanic+0x19f/frame 0xfffffe001a62e4b0 panic() at panic+0x43/frame 0xfffffe001a62e510 pfsync_update_state() at pfsync_update_state+0x45b/frame 0xfffffe001a62e560 pf_test() at pf_test+0x1bcc/frame 0xfffffe001a62e7d0 pf_check_in() at pf_check_in+0x1d/frame 0xfffffe001a62e7f0 pfil_run_hooks() at pfil_run_hooks+0x8c/frame 0xfffffe001a62e880 ip_input() at ip_input+0x3eb/frame 0xfffffe001a62e8e0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a62e940 ether_demux() at ether_demux+0x15c/frame 0xfffffe001a62e970 ether_nh_input() at ether_nh_input+0x317/frame 0xfffffe001a62e9d0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a62ea30 ether_input() at ether_input+0x26/frame 0xfffffe001a62ea50 vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x708/frame 0xfffffe001a62eae0 vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0x110/frame 0xfffffe001a62eb20 intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe001a62eb60 ithread_loop() at ithread_loop+0xc6/frame 0xfffffe001a62ebb0 fork_exit() at fork_exit+0x85/frame 0xfffffe001a62ebf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001a62ebf0
Secondary:
db:0:kdb.enter.default> bt Tracing pid 12 tid 100032 td 0xfffff800034c1500 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe001a3cf240 vpanic() at vpanic+0x19f/frame 0xfffffe001a3cf2c0 panic() at panic+0x43/frame 0xfffffe001a3cf320 pfsync_update_state() at pfsync_update_state+0x45b/frame 0xfffffe001a3cf370 pf_test() at pf_test+0x245b/frame 0xfffffe001a3cf5e0 pf_check_out() at pf_check_out+0x1d/frame 0xfffffe001a3cf600 pfil_run_hooks() at pfil_run_hooks+0x8c/frame 0xfffffe001a3cf690 ip_output() at ip_output+0xd8b/frame 0xfffffe001a3cf7e0 ip_forward() at ip_forward+0x36b/frame 0xfffffe001a3cf880 ip_input() at ip_input+0x6da/frame 0xfffffe001a3cf8e0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a3cf940 ether_demux() at ether_demux+0x15c/frame 0xfffffe001a3cf970 ether_nh_input() at ether_nh_input+0x317/frame 0xfffffe001a3cf9d0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a3cfa30 ether_input() at ether_input+0x26/frame 0xfffffe001a3cfa50 vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x708/frame 0xfffffe001a3cfae0 vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0x110/frame 0xfffffe001a3cfb20 intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe001a3cfb60 ithread_loop() at ithread_loop+0xc6/frame 0xfffffe001a3cfbb0 fork_exit() at fork_exit+0x85/frame 0xfffffe001a3cfbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001a3cfbf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Updated by Vladimir Usov almost 8 years ago
Dear Luiz! Can we expect real fix in 2.4? We are waiting for it too long, and this is a really critical problem, since in a corporate environment you will always use both - HA and limiters. br Vladimir
Updated by James Kohout almost 8 years ago
I would agree with Vladimir. Just would like to know if this will be definitely be fixed in 2.4 or pushed out further.
Thanks
Updated by Jose Duarte almost 8 years ago
One more here, we always have limiters and HA and we are forced to use the queues. If someone makes a mistake of assigning a main limiter to a rule instant kernel panic...
Updated by Steve Y over 7 years ago
We are not noticing our secondary (which is also a VM) hang. However, our one limited rule traffic ends overnight, so possibly it recovers after the messages end?
Reiterating from the referenced forum thread, there is a checkbox "No pfSync" on firewall rules, but checking that doesn't avoid the error message. Nor does setting "State type" to None.
We didn't see this issue until upgrading from 2.2.6 to 2.3.1_5.
Updated by James Webb over 7 years ago
Still Producing issues for me. Had to re-install pfSense on both devices in HA after encountering this bug.
Updated by Sean Huggans over 7 years ago
Experiencing this after updating from 2.1.5 to 2.3.4. Constant Kernel messages in system logs as: "pfsync_undefer_state: unable to find deferred state".
We had a limiter in place to limit bandwidth of our backup server when replicating through an IPSec tunnel to a backup server offsite in order to prevent packet loss caused by taking up all the bandwidth of our WAN.
Didn't actually notice until users reported not being able to access resources on the other side of the tunnel - apparently once backup replications started to the remote host, it killed the tunnel it was replicating through/being limited on somehow.
Any plans to resolve this issue? Limiters are a very useful feature, as is HA obviously.
Updated by Matthew Brown over 7 years ago
Hmmm.... this is very much no not ideal. :( I was going to do this in a new environment as we have soft limits in our datacenters. It would be very useful to simply limit our incoming and outgoing speed to a set amount. If we are over our allotted speed for more than x number of hours we will be forced to upgrade our connection Speed.
Does anyone know if this issue was fixed with the release of 2.4? I don't really want to install "bleeding edge" tech, but I also don't want to have to tell my boss our database networking costs will double because we are 1MBps's over limit. ^^;;;
Updated by Scott Rosenberg over 7 years ago
Has this had any development recently?
This is the primary reason I can't use limiters in my HA setup, and the assignee hasn't commented in 6 months.
Updated by Jose Duarte over 7 years ago
For those still with problems you can use limiters in HA with any version w/out kernel panic but for that you need additional configuration.
1. Create a new limiter for both upload and download with the bandwidth limit. Name it with the name you want and _donotuse at the end (just for safety)
2. Create a new Queue inside of each limiter (When inside of the limiter "Add New Queue" green button)
3. Name the queues with the vlan/rule name and the bandwidth you set in the limit and with _up or _down (for reference) and set the weight to 100 for that queue to use 100% of the limiter
4. Assign the queues you created to the rules you want to limit the bandwidth. MAKE SURE YOU ASSIGN THE QUEUE AND NOT THE LIMITER, IF YOU CHOOSE THE LIMITER YOU WILL HAVE THE KERNEL PANIC IN THE 2nd MEMBER. That's why it's a better practice to use the name _donotuse in the limiters.
Notes:
You still need to create 1xlimiter + 1xqueue per each flow per rule
If you assign the same queues to multiple rules they will share the same "roof" defined in the limiter
You can create multiple queues for one limiter with different weight, very useful if you want to have, for example, a top limit of 400Mbit and give rule1 guaranteed 10% of those, rule2 50% and rule3 40%. If all of the rules/queues are being maxed out you will have a perfect bandwidth balance. If for example rule 2 and 3 don't have any traffic rule1 will be able to use the 400Mbit since we only define a guaranteed minimum.
Cheers.
Updated by Lars Jorgensen over 7 years ago
Jose Duarte wrote:
For those still with problems you can use limiters in HA with any version w/out kernel panic but for that you need additional configuration.
Thank you!
Confirmed working here. Great load off my chest as running without HA was never very fun.
Lars
Updated by Renato Botelho about 7 years ago
- Target version changed from 2.4.0 to 2.4.1
Updated by Jim Pingle about 7 years ago
- Target version changed from 2.4.1 to 2.4.2
Updated by Sander Naudts about 7 years ago
Why not change target version to 2.9.9... sorry just little frustrating that this doesn't get fixed.
Updated by Jim Pingle about 7 years ago
We expected to have more time before 2.4.1 but we need to have it out in a week or so, there isn't time to get to this and the other things we have to address for it.
And if you read above, there is a viable workaround if you use queues/child limiters and not the limiters directly.
Updated by Lars Jorgensen about 7 years ago
Sander Naudts wrote:
Why not change target version to 2.9.9... sorry just little frustrating that this doesn't get fixed.
It's not that much of a problem as long as you use the workaround described in comment #44. I've been running HA with limiters without any problems for three months now.
Updated by Jim Pingle about 7 years ago
- Target version changed from 2.4.2 to 2.4.3
Updated by Eero Volotinen almost 7 years ago
Lars Jorgensen wrote:
Sander Naudts wrote:
Why not change target version to 2.9.9... sorry just little frustrating that this doesn't get fixed.
It's not that much of a problem as long as you use the workaround described in comment #44. I've been running HA with limiters without any problems for three months now.
still issue with 2.4.2 .. please at least add note to gui that you cannot use pfsync with limiters. It saves lot of time .. it took something like 2 days to figure, why ha units were crashing..
Updated by Luiz Souza almost 7 years ago
- Status changed from Confirmed to Feedback
- % Done changed from 0 to 100
The crash is fixed on the last snapshot.
Tests are welcome.
Updated by Fabrizio Pappolla almost 7 years ago
- File pfsense_crashlog.txt pfsense_crashlog.txt added
Before open a new ticket, i will try here since the error looks really similar. My pfSense got bootloop, the problem was caused by a black out, the error was: "kernel panic pfsync_undefer_state: unable to find deferred state". I have not HA on, only limiter and PRIQ. Attacched you can find the crash log. pfSense Version 2.4.2-RELEASE-p1 (amd64)
Updated by Jim Pingle almost 7 years ago
Fabrizio Pappolla wrote:
Before open a new ticket, i will try here since the error looks really similar. My pfSense got bootloop, the problem was caused by a black out, the error was: "kernel panic pfsync_undefer_state: unable to find deferred state". I have not HA on, only limiter and PRIQ. Attacched you can find the crash log. pfSense Version 2.4.2-RELEASE-p1 (amd64)
The backtrace shows pfsync, so you must have that active. This has been fixed on 2.4.3, so additional problem reports on anything older are not helpful. Upgrade to a 2.4.3 snapshot and see if it is more stable there.
Updated by Jim Pingle over 6 years ago
- Status changed from Feedback to Resolved
Confirmed working by multiple tests and users.