Project

General

Profile

Bug #4310

Limiters + HA results in hangs on secondary

Added by Chris Buechler over 2 years ago. Updated 15 days ago.

Status:
Confirmed
Priority:
Very High
Category:
Limiters
Target version:
Start date:
01/27/2015
Due date:
% Done:

0%

Affected version:
2.2.x
Affected Architecture:

Description

Configuring limiters on a firewall rule in 2.2 on a system using HA results in a kernel panic reboot loop. To replicate, on a basic HA setup with config sync and pfsync enabled, add a pair of limiters and put them on the default LAN rule. It'll panic upon applying the changes, and do so again after rebooting in an endless loop.

redmine4310-crash.txt Magnifier (167 KB) Chris Buechler, 06/20/2015 07:38 PM

crash report 2.txt Magnifier (162 KB) Bernardo Pádua, 07/09/2015 06:38 PM

crash report 1.txt Magnifier (302 KB) Bernardo Pádua, 07/09/2015 06:38 PM

02.03.2016 22_18.txt Magnifier - crash report (160 KB) Dirk Bongard, 03/03/2016 04:30 AM

02.03.2016 22_18.txt Magnifier (160 KB) William St.Denis, 03/04/2016 10:06 AM

Crash_02.04.2016.txt Magnifier (150 KB) William St.Denis, 03/04/2016 10:10 AM

History

#1 Updated by Ermal Luçi about 2 years ago

I think this happens because CARP packets are being sent to dummynet.
Before the kernel patch prevented this from happening.

Will investigate and fix it accordingly.

#2 Updated by Ermal Luçi about 2 years ago

  • Status changed from Confirmed to Feedback

Patch committed.

#3 Updated by Vitaliy Isarev about 2 years ago

Hi, I have the same issue. I tried to update to the latest maintance version, but receive error after upgrade: "shared object libpcre.so.1 not found required by php"

#4 Updated by Vitaliy Isarev about 2 years ago

Ermal Luçi wrote:

Patch committed.

Can you post a link to the patch

#5 Updated by Chris Buechler about 2 years ago

  • Status changed from Feedback to Resolved

fixed

#6 Updated by Steve Wheeler about 2 years ago

We are seeing a number of reports that this is still an issue in 2.2.1. At least one customer ticket and also: https://forum.pfsense.org/index.php?topic=87541.0
Even if there is no longer a panic, though there is for some, the machine is no longer usable with the logs spammed with the erro message in the patch.

#7 Updated by Jim Pingle about 2 years ago

  • Status changed from Resolved to Confirmed
  • Target version changed from 2.2.1 to 2.2.2

This is still a problem. Some cases still work but with TONS of console/log spam about pfsync_undefer_state rendering the console and logs practically unusable. There are still a couple people reporting panics as well, though they don't seem to be making it to the crash reporter (other reports are submitting fine, however).

Somewhat related, there appears to be a logic issue in the pfsync enable checkbox for those who upgraded from 2.0.x or before (before the HA settings were moved). The GUI shows the option unchecked, but the empty tag is in the config which is causing pfsync to be enabled. This can lead to these errors showing up on non-HA units until they enable and then disable pfsync.

#8 Updated by Ermal Luçi about 2 years ago

  • Status changed from Confirmed to Feedback

I pushed the messages under debug misc level and also another change to fix the root cause for it.

#9 Updated by Chris Linstruth about 2 years ago

Looks good here. Not stressing it but enabling/disabling limiters on the cluster works, the limiters are doing what I ask, and the limited states are syncing. Thanks.

#10 Updated by Chris Buechler about 2 years ago

Chris: that still working fine for you?

After running for a few hours, the secondary still hangs in one of our test setups. Console is non-responsive, ESX shows VM using 100% CPU. That happens after about 4 hours of run time. The primary is fine throughout, and the limiters work on both v4 and v6.

#11 Updated by Chris Linstruth about 2 years ago

I haven't seen anything else but please understand that this is on a test bench not in production and I am not stressing it at all. If you look at ticket UJN-78146 you will see my description of a crash I have seen since starting to test yesterday's snapshot. Not sure if it is related. The HA pair has been up since but with no more than about 150 states.

#12 Updated by Chris Linstruth about 2 years ago

A bit more info. See this thread:

https://forum.pfsense.org/index.php?topic=92128.0

Turning off the limiters makes that NAT translation work.

#13 Updated by Chris Buechler about 2 years ago

  • Target version changed from 2.2.2 to 2.2.3

this is better, though still the issue where the secondary may hit 100% CPU and hang in some circumstance. We'll revisit.

The issue with reflection and limiters is #4590

#14 Updated by Ermal Luçi almost 2 years ago

Patch was committed for this on tools repo and also the defer option in pfsync is now not used.
Both can be considered as the root cause of the issue here.

#15 Updated by Chris Buechler almost 2 years ago

  • Subject changed from Limiters + HA results in kernel panic to Limiters + HA results in hangs on secondary
  • Status changed from Feedback to Confirmed
  • Affected version changed from 2.2 to 2.2.x

no change, still hangs secondary within a couple hours

#16 Updated by Ermal Luçi almost 2 years ago

  • Assignee changed from Ermal Luçi to Chris Buechler

Chris need to confirm this happens still or not.

#17 Updated by Chris Buechler almost 2 years ago

  • Status changed from Confirmed to Feedback

I'm pretty sure it doesn't happen anymore, still have the test setup running to make sure. Given another ~48 hours, if it's still not an issue, this will be safe to consider fixed.

#18 Updated by Chris Buechler almost 2 years ago

  • Status changed from Feedback to Confirmed
  • Assignee changed from Chris Buechler to Ermal Luçi

no change, as long as you have some traffic passing through a limiter, the secondary hangs within ~1-4 hours.

#19 Updated by Chris Buechler almost 2 years ago

Tried after changing both hosts to use unicast pfsync, which had no impact. It seems to alternate between hanging the secondary, and triggering a kernel panic. Thus far using unicast, the secondary has only kernel paniced, not hung consuming 100% CPU. Same kernel panic happened on occasion when using multicast pfsync so not sure that's actually changed.

crash report attached

#20 Updated by Bernardo Pádua almost 2 years ago

This is also happening to me. I though the issue with the limiters was fixed in 2.2.2 and 2.2.3, so I posted a duplicate ticket on #4823. But I've now disabled my limiters and saw that the backup/secondary firewall stopped crashing.

I'm posting my crash dumps (two different times the backup crashed) here in case they are of any help.

#21 Updated by Jim Thompson over 1 year ago

  • Assignee changed from Ermal Luçi to Luiz Otavio O Souza

#22 Updated by James Starowitz over 1 year ago

lastnight rolled out our 2.2.5 units using c2758s in HA, the units worked fine in a test lab, although once i put it into production the backup router would hang to the point that it could not access the webui, the hang occurs nearly every 5-10minutes, the backup router reboots and then crashs again soon after.

i disabled "state sync" on both the master and the backup and it stopped crashing.

because we use limiters per source ip each client ip has its own limit, as a result the HFSC work around cant really do what im doing with limiters

we rarely go into failover, but when we do seamless state transfers are a life saver.

i submitted the crash reports and opened a support ticket if you need the details.

#23 Updated by Lee Shiry about 1 year ago

This problem seems to get worse after upgrading to 2.2.6. Now the secondary still hangs even with the limiters and state sync disabled.

#24 Updated by Dirk Bongard about 1 year ago

I have the same issue.
Panic on my HA-Backup between 10 minutes and 3 hours. I have send you several crash reports via gui.
Limiter + Vlan + NAT

#25 Updated by William St.Denis about 1 year ago

I have noticed this issue as well. We have to disable sync when using limiters because it's crashing the system. I have attached my log as well
I am running Limiter + Vlan + NAT as well. When we started running limiters we noticed the WEB UI started to slow down and were getting an error /rc.filter_synchronize: An authentication failure occurred then we got the kernel panic.

#26 Updated by William St.Denis about 1 year ago

Sorry wrong log. Here is the correct one

#27 Updated by Luiz Otavio O Souza about 1 year ago

  • Target version changed from 2.3 to 2.3.1

#28 Updated by Mikhail Platonov about 1 year ago

I have the same issue.
ha-backup crashed after 7 min

#29 Updated by William St.Denis about 1 year ago

Does anyone have a work around to keep limiters and sync working? The only option I have come up with is to disable limiters or disable sync both aren't great.

#30 Updated by Chris Buechler about 1 year ago

William St.Denis wrote:

Does anyone have a work around to keep limiters and sync working? The only option I have come up with is to disable limiters or disable sync both aren't great.

Those are your only two options for the time being. ALTQ can often be used in the same way as limiters and doesn't have such issues. Post to the forum if you'd like to discuss further.

#31 Updated by Chris Buechler about 1 year ago

  • Target version changed from 2.3.1 to 2.3.2

#32 Updated by Jose Duarte 11 months ago

From the tests we ran for the last couple of days we saw kernel panic using limiters in multiple vlans but no impact when using different queues inside those limiters.

#33 Updated by Chris Buechler 10 months ago

  • Target version changed from 2.3.2 to 2.4.0

#34 Updated by Luiz Otavio O Souza 5 months ago

2.4 has a few new fixes for use-after-free pfsync states. The limiters issue is also fixed.

#35 Updated by Jim Pingle 5 months ago

I updated a test cluster to a snapshot from a couple hours ago, which from the timestamp looks like it should have this fix, and both nodes got stuck in a panic loop.

  Version String: FreeBSD 11.0-RELEASE-p3 #233 8ae63e9(RELENG_2_4): Sat Dec 10 03:56:41 CST 2016
    root@buildbot2.netgate.com:/builder/ce/tmp/obj/builder/ce/tmp/FreeBSD-src/sys/pfSense
  Panic String: pfsync_undefer_state: unable to find deferred state

Same panic string on both nodes, slightly different backtrace.

Primary:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100044 td 0xfffff80003500500
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe001a62e430
vpanic() at vpanic+0x19f/frame 0xfffffe001a62e4b0
panic() at panic+0x43/frame 0xfffffe001a62e510
pfsync_update_state() at pfsync_update_state+0x45b/frame 0xfffffe001a62e560
pf_test() at pf_test+0x1bcc/frame 0xfffffe001a62e7d0
pf_check_in() at pf_check_in+0x1d/frame 0xfffffe001a62e7f0
pfil_run_hooks() at pfil_run_hooks+0x8c/frame 0xfffffe001a62e880
ip_input() at ip_input+0x3eb/frame 0xfffffe001a62e8e0
netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a62e940
ether_demux() at ether_demux+0x15c/frame 0xfffffe001a62e970
ether_nh_input() at ether_nh_input+0x317/frame 0xfffffe001a62e9d0
netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a62ea30
ether_input() at ether_input+0x26/frame 0xfffffe001a62ea50
vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x708/frame 0xfffffe001a62eae0
vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0x110/frame 0xfffffe001a62eb20
intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe001a62eb60
ithread_loop() at ithread_loop+0xc6/frame 0xfffffe001a62ebb0
fork_exit() at fork_exit+0x85/frame 0xfffffe001a62ebf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001a62ebf0

Secondary:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100032 td 0xfffff800034c1500
kdb_enter() at kdb_enter+0x3b/frame 0xfffffe001a3cf240
vpanic() at vpanic+0x19f/frame 0xfffffe001a3cf2c0
panic() at panic+0x43/frame 0xfffffe001a3cf320
pfsync_update_state() at pfsync_update_state+0x45b/frame 0xfffffe001a3cf370
pf_test() at pf_test+0x245b/frame 0xfffffe001a3cf5e0
pf_check_out() at pf_check_out+0x1d/frame 0xfffffe001a3cf600
pfil_run_hooks() at pfil_run_hooks+0x8c/frame 0xfffffe001a3cf690
ip_output() at ip_output+0xd8b/frame 0xfffffe001a3cf7e0
ip_forward() at ip_forward+0x36b/frame 0xfffffe001a3cf880
ip_input() at ip_input+0x6da/frame 0xfffffe001a3cf8e0
netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a3cf940
ether_demux() at ether_demux+0x15c/frame 0xfffffe001a3cf970
ether_nh_input() at ether_nh_input+0x317/frame 0xfffffe001a3cf9d0
netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a3cfa30
ether_input() at ether_input+0x26/frame 0xfffffe001a3cfa50
vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x708/frame 0xfffffe001a3cfae0
vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0x110/frame 0xfffffe001a3cfb20
intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe001a3cfb60
ithread_loop() at ithread_loop+0xc6/frame 0xfffffe001a3cfbb0
fork_exit() at fork_exit+0x85/frame 0xfffffe001a3cfbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001a3cfbf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

#36 Updated by Vladimir Usov 4 months ago

Dear Luiz! Can we expect real fix in 2.4? We are waiting for it too long, and this is a really critical problem, since in a corporate environment you will always use both - HA and limiters. br Vladimir

#37 Updated by James Kohout 4 months ago

I would agree with Vladimir. Just would like to know if this will be definitely be fixed in 2.4 or pushed out further.
Thanks

#38 Updated by Jose Duarte 3 months ago

One more here, we always have limiters and HA and we are forced to use the queues. If someone makes a mistake of assigning a main limiter to a rule instant kernel panic...

#39 Updated by Steve Yates about 1 month ago

We are not noticing our secondary (which is also a VM) hang. However, our one limited rule traffic ends overnight, so possibly it recovers after the messages end?

Reiterating from the referenced forum thread, there is a checkbox "No pfSync" on firewall rules, but checking that doesn't avoid the error message. Nor does setting "State type" to None.

We didn't see this issue until upgrading from 2.2.6 to 2.3.1_5.

#40 Updated by James Webb 15 days ago

Still Producing issues for me. Had to re-install pfSense on both devices in HA after encountering this bug.

Also available in: Atom PDF