Limiters + HA results in hangs on secondary
Configuring limiters on a firewall rule in 2.2 on a system using HA results in a kernel panic reboot loop. To replicate, on a basic HA setup with config sync and pfsync enabled, add a pair of limiters and put them on the default LAN rule. It'll panic upon applying the changes, and do so again after rebooting in an endless loop.
#6 Updated by Steve Wheeler about 2 years ago
We are seeing a number of reports that this is still an issue in 2.2.1. At least one customer ticket and also: https://forum.pfsense.org/index.php?topic=87541.0
Even if there is no longer a panic, though there is for some, the machine is no longer usable with the logs spammed with the erro message in the patch.
#7 Updated by Jim Pingle about 2 years ago
- Status changed from Resolved to Confirmed
- Target version changed from 2.2.1 to 2.2.2
This is still a problem. Some cases still work but with TONS of console/log spam about pfsync_undefer_state rendering the console and logs practically unusable. There are still a couple people reporting panics as well, though they don't seem to be making it to the crash reporter (other reports are submitting fine, however).
Somewhat related, there appears to be a logic issue in the pfsync enable checkbox for those who upgraded from 2.0.x or before (before the HA settings were moved). The GUI shows the option unchecked, but the empty tag is in the config which is causing pfsync to be enabled. This can lead to these errors showing up on non-HA units until they enable and then disable pfsync.
#10 Updated by Chris Buechler about 2 years ago
Chris: that still working fine for you?
After running for a few hours, the secondary still hangs in one of our test setups. Console is non-responsive, ESX shows VM using 100% CPU. That happens after about 4 hours of run time. The primary is fine throughout, and the limiters work on both v4 and v6.
#11 Updated by Chris Linstruth about 2 years ago
I haven't seen anything else but please understand that this is on a test bench not in production and I am not stressing it at all. If you look at ticket UJN-78146 you will see my description of a crash I have seen since starting to test yesterday's snapshot. Not sure if it is related. The HA pair has been up since but with no more than about 150 states.
#19 Updated by Chris Buechler almost 2 years ago
- File redmine4310-crash.txt added
- Target version changed from 2.2.3 to 2.3
Tried after changing both hosts to use unicast pfsync, which had no impact. It seems to alternate between hanging the secondary, and triggering a kernel panic. Thus far using unicast, the secondary has only kernel paniced, not hung consuming 100% CPU. Same kernel panic happened on occasion when using multicast pfsync so not sure that's actually changed.
crash report attached
#20 Updated by Bernardo Pádua almost 2 years ago
This is also happening to me. I though the issue with the limiters was fixed in 2.2.2 and 2.2.3, so I posted a duplicate ticket on #4823. But I've now disabled my limiters and saw that the backup/secondary firewall stopped crashing.
I'm posting my crash dumps (two different times the backup crashed) here in case they are of any help.
#22 Updated by James Starowitz over 1 year ago
lastnight rolled out our 2.2.5 units using c2758s in HA, the units worked fine in a test lab, although once i put it into production the backup router would hang to the point that it could not access the webui, the hang occurs nearly every 5-10minutes, the backup router reboots and then crashs again soon after.
i disabled "state sync" on both the master and the backup and it stopped crashing.
because we use limiters per source ip each client ip has its own limit, as a result the HFSC work around cant really do what im doing with limiters
we rarely go into failover, but when we do seamless state transfers are a life saver.
i submitted the crash reports and opened a support ticket if you need the details.
#25 Updated by William St.Denis about 1 year ago
- File 02.03.2016 22_18.txt added
I have noticed this issue as well. We have to disable sync when using limiters because it's crashing the system. I have attached my log as well
I am running Limiter + Vlan + NAT as well. When we started running limiters we noticed the WEB UI started to slow down and were getting an error /rc.filter_synchronize: An authentication failure occurred then we got the kernel panic.
#30 Updated by Chris Buechler about 1 year ago
William St.Denis wrote:
Does anyone have a work around to keep limiters and sync working? The only option I have come up with is to disable limiters or disable sync both aren't great.
Those are your only two options for the time being. ALTQ can often be used in the same way as limiters and doesn't have such issues. Post to the forum if you'd like to discuss further.
#35 Updated by Jim Pingle 5 months ago
I updated a test cluster to a snapshot from a couple hours ago, which from the timestamp looks like it should have this fix, and both nodes got stuck in a panic loop.
Version String: FreeBSD 11.0-RELEASE-p3 #233 8ae63e9(RELENG_2_4): Sat Dec 10 03:56:41 CST 2016 firstname.lastname@example.org:/builder/ce/tmp/obj/builder/ce/tmp/FreeBSD-src/sys/pfSense Panic String: pfsync_undefer_state: unable to find deferred state
Same panic string on both nodes, slightly different backtrace.
db:0:kdb.enter.default> bt Tracing pid 12 tid 100044 td 0xfffff80003500500 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe001a62e430 vpanic() at vpanic+0x19f/frame 0xfffffe001a62e4b0 panic() at panic+0x43/frame 0xfffffe001a62e510 pfsync_update_state() at pfsync_update_state+0x45b/frame 0xfffffe001a62e560 pf_test() at pf_test+0x1bcc/frame 0xfffffe001a62e7d0 pf_check_in() at pf_check_in+0x1d/frame 0xfffffe001a62e7f0 pfil_run_hooks() at pfil_run_hooks+0x8c/frame 0xfffffe001a62e880 ip_input() at ip_input+0x3eb/frame 0xfffffe001a62e8e0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a62e940 ether_demux() at ether_demux+0x15c/frame 0xfffffe001a62e970 ether_nh_input() at ether_nh_input+0x317/frame 0xfffffe001a62e9d0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a62ea30 ether_input() at ether_input+0x26/frame 0xfffffe001a62ea50 vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x708/frame 0xfffffe001a62eae0 vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0x110/frame 0xfffffe001a62eb20 intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe001a62eb60 ithread_loop() at ithread_loop+0xc6/frame 0xfffffe001a62ebb0 fork_exit() at fork_exit+0x85/frame 0xfffffe001a62ebf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001a62ebf0
db:0:kdb.enter.default> bt Tracing pid 12 tid 100032 td 0xfffff800034c1500 kdb_enter() at kdb_enter+0x3b/frame 0xfffffe001a3cf240 vpanic() at vpanic+0x19f/frame 0xfffffe001a3cf2c0 panic() at panic+0x43/frame 0xfffffe001a3cf320 pfsync_update_state() at pfsync_update_state+0x45b/frame 0xfffffe001a3cf370 pf_test() at pf_test+0x245b/frame 0xfffffe001a3cf5e0 pf_check_out() at pf_check_out+0x1d/frame 0xfffffe001a3cf600 pfil_run_hooks() at pfil_run_hooks+0x8c/frame 0xfffffe001a3cf690 ip_output() at ip_output+0xd8b/frame 0xfffffe001a3cf7e0 ip_forward() at ip_forward+0x36b/frame 0xfffffe001a3cf880 ip_input() at ip_input+0x6da/frame 0xfffffe001a3cf8e0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a3cf940 ether_demux() at ether_demux+0x15c/frame 0xfffffe001a3cf970 ether_nh_input() at ether_nh_input+0x317/frame 0xfffffe001a3cf9d0 netisr_dispatch_src() at netisr_dispatch_src+0xa5/frame 0xfffffe001a3cfa30 ether_input() at ether_input+0x26/frame 0xfffffe001a3cfa50 vmxnet3_rxq_eof() at vmxnet3_rxq_eof+0x708/frame 0xfffffe001a3cfae0 vmxnet3_legacy_intr() at vmxnet3_legacy_intr+0x110/frame 0xfffffe001a3cfb20 intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe001a3cfb60 ithread_loop() at ithread_loop+0xc6/frame 0xfffffe001a3cfbb0 fork_exit() at fork_exit+0x85/frame 0xfffffe001a3cfbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001a3cfbf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
#39 Updated by Steve Yates about 1 month ago
We are not noticing our secondary (which is also a VM) hang. However, our one limited rule traffic ends overnight, so possibly it recovers after the messages end?
Reiterating from the referenced forum thread, there is a checkbox "No pfSync" on firewall rules, but checking that doesn't avoid the error message. Nor does setting "State type" to None.
We didn't see this issue until upgrading from 2.2.6 to 2.3.1_5.