Bug #14373
closedSystem crashes or may become unresponsive with Captive Portal
0%
Description
Symptoms
Captive Portal gets stuck (no internet or network access), sometimes service restart can fix it. Sometimes this doesn't help at all and they need to reboot the Firewall. Sometimes system is unresponsive even thru the serial console.
Crash dump attached, device in question is 7100 if it makes sense
Files
Updated by Kristof Provost over 1 year ago
Summarising the discussions we've had so far: it appears that the issue is that something is holding the PF_RULES lock. There's a configuration thread trying to take or having taken the PF_RULES write lock, but the real culprit may be this one:
Tracing command kernel pid 0 tid 100008 td 0xfffffe0011fdc000 cpustop_handler() at cpustop_handler+0x28/frame 0xfffffe0011ddbdf0 ipi_nmi_handler() at ipi_nmi_handler+0x39/frame 0xfffffe0011ddbe00 trap() at trap+0x3f/frame 0xfffffe0011ddbf20 nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe0011ddbf20 --- trap 0x13, rip = 0xffffffff80dd9992, rsp = 0xfffffe00107a4cc0, rbp = 0xfffffe00107a4cc0 --- lock_delay() at lock_delay+0x12/frame 0xfffffe00107a4cc0 _mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xc1/frame 0xfffffe00107a4d30 cnputsn() at cnputsn+0xd8/frame 0xfffffe00107a4d70 putchar() at putchar+0x14a/frame 0xfffffe00107a4e00 kvprintf() at kvprintf+0xf5/frame 0xfffffe00107a4f20 _vprintf() at _vprintf+0x8c/frame 0xfffffe00107a5010 printf() at printf+0x53/frame 0xfffffe00107a5070 _mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x5a/frame 0xfffffe00107a5080 thread_lock_flags_() at thread_lock_flags_+0xeb/frame 0xfffffe00107a50e0 propagate_priority() at propagate_priority+0x58/frame 0xfffffe00107a5120 turnstile_wait() at turnstile_wait+0x323/frame 0xfffffe00107a5160 __rw_wlock_hard() at __rw_wlock_hard+0x3f8/frame 0xfffffe00107a5210 inp_smr_lock() at inp_smr_lock+0xa9/frame 0xfffffe00107a5240 in_pcblookup_hash() at in_pcblookup_hash+0x6b/frame 0xfffffe00107a5280 in_pcblookup_mbuf() at in_pcblookup_mbuf+0x18/frame 0xfffffe00107a52a0 tcp_input_with_port() at tcp_input_with_port+0x5d4/frame 0xfffffe00107a5400 tcp_input() at tcp_input+0xb/frame 0xfffffe00107a5410 ip_input() at ip_input+0x229/frame 0xfffffe00107a5470 netisr_dispatch_src() at netisr_dispatch_src+0x2a6/frame 0xfffffe00107a54c0 ether_demux() at ether_demux+0x144/frame 0xfffffe00107a54f0 dummynet_send() at dummynet_send+0x12d/frame 0xfffffe00107a5530 dummynet_io() at dummynet_io+0x3db/frame 0xfffffe00107a5580 pf_test_eth() at pf_test_eth+0x12f4/frame 0xfffffe00107a5a20 pf_eth_check_in() at pf_eth_check_in+0x25/frame 0xfffffe00107a5a40 pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe00107a5a80 ether_demux() at ether_demux+0x4c/frame 0xfffffe00107a5ab0 ether_nh_input() at ether_nh_input+0x353/frame 0xfffffe00107a5b10 netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00107a5b60 ether_input() at ether_input+0x69/frame 0xfffffe00107a5bc0 ether_demux() at ether_demux+0x9e/frame 0xfffffe00107a5bf0 ether_nh_input() at ether_nh_input+0x353/frame 0xfffffe00107a5c50 netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00107a5ca0 ether_input() at ether_input+0x69/frame 0xfffffe00107a5d00 iflib_rxeof() at iflib_rxeof+0xbdb/frame 0xfffffe00107a5e00 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00107a5e40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00107a5ec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe00107a5ef0 fork_exit() at fork_exit+0x7e/frame 0xfffffe00107a5f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00107a5f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Updated by Jim Pingle over 1 year ago
- Assignee set to Kristof Provost
- Target version set to 2.7.0
- Plus Target Version set to 23.09
Updated by Kristof Provost over 1 year ago
That backtrace has me suspecting that this may actually be a fix: https://cgit.freebsd.org/src/commit/?id=7b92493ab1d464263ccdf4494b187edbe19864dc
Sadly that didn't land in time to make it into 23.05.
Updated by Lev Prokofev over 1 year ago
The config uploaded to the file drop for internal testing - folder 1328742557
Updated by Kristof Provost over 1 year ago
Mark doesn't think his fix would affect this.
Having looked a bit more, I have a different theory.
Thread 100008 holds the PF_RULES read lock, and is now blocked waiting for the inp_lock.
Thread 100125 is trying to get the PF_RULES write lock, so no more read locks can be acquired.
Thread 100153 holds the inp_lock, and is stuck waiting for the PF_RULES read lock.
That neatly deadlocks us. We can break this lock by avoiding the PF_RULES -> inp_lock dependency in thread 100008 if we release the PF_RULES lock before calling dummynet_io. That's pretty easy to do (and a good idea all by itself anyway). Proposed fix in https://reviews.freebsd.org/D40067
Updated by Kristof Provost over 1 year ago
- Status changed from New to Feedback
Fixed upstream in https://cgit.freebsd.org/src/commit/?id=bdd47177528b5beacabb4837bfac0e9de92aae74 and cherry-picked into devel-main (not yet to plus-devel-main, that'll come with future merges of devel-main to plus-devel-main).
Updated by Flole Systems over 1 year ago
So long story short: 23.05 is another release that's broken at kernel level? 23.01 was the one with the IPv6 crashes, so the last one that isn't broken is the now-unsupported 22.05?
Any plans to fix it/re-release so there's at least a single plus release that is not randomly crashing/deadlocking and supported?
Updated by Gerhard Gröschl over 1 year ago
yeah, just as a reminder:
Captive Portal started crashing on our sites with 22.05 already. We waited eagerly for two updates but it only got worse unfortunately.
Updated by Jim Pingle over 1 year ago
- Plus Target Version changed from 23.09 to 23.05.1
Updated by Jim Pingle over 1 year ago
- Subject changed from System crashed and sometimes became unresponsive with enabled Captive portal. to System crashes or may become unresponsive with Captive Portal
Updating subject for release notes.
Updated by Christian McDonald over 1 year ago
- Status changed from Feedback to Resolved
Updated by Jim Thompson over 1 year ago
Gerhard Gröschl wrote in #note-8:
yeah, just as a reminder:
Captive Portal started crashing on our sites with 22.05 already. We waited eagerly for two updates but it only got worse unfortunately.
there is a 23.05.1 coming
Updated by Gerhard Gröschl over 1 year ago
thx guys, we really appreciate your work very much!