Project

General

Profile

Actions

Bug #14373

closed

System crashes or may become unresponsive with Captive Portal

Added by Lev Prokofev 11 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Category:
Captive Portal
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
23.05.1
Release Notes:
Default
Affected Version:
2.7.0
Affected Architecture:

Description

Symptoms

Captive Portal gets stuck (no internet or network access), sometimes service restart can fix it. Sometimes this doesn't help at all and they need to reboot the Firewall. Sometimes system is unresponsive even thru the serial console.

Crash dump attached, device in question is 7100 if it makes sense


Files

textdump.tar-1.0 (375 KB) textdump.tar-1.0 Lev Prokofev, 05/11/2023 11:27 AM
Actions #1

Updated by Kristof Provost 11 months ago

Summarising the discussions we've had so far: it appears that the issue is that something is holding the PF_RULES lock. There's a configuration thread trying to take or having taken the PF_RULES write lock, but the real culprit may be this one:

Tracing command kernel pid 0 tid 100008 td 0xfffffe0011fdc000
cpustop_handler() at cpustop_handler+0x28/frame 0xfffffe0011ddbdf0
ipi_nmi_handler() at ipi_nmi_handler+0x39/frame 0xfffffe0011ddbe00
trap() at trap+0x3f/frame 0xfffffe0011ddbf20
nmi_calltrap() at nmi_calltrap+0x8/frame 0xfffffe0011ddbf20
--- trap 0x13, rip = 0xffffffff80dd9992, rsp = 0xfffffe00107a4cc0, rbp = 0xfffffe00107a4cc0 ---
lock_delay() at lock_delay+0x12/frame 0xfffffe00107a4cc0
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xc1/frame 0xfffffe00107a4d30
cnputsn() at cnputsn+0xd8/frame 0xfffffe00107a4d70
putchar() at putchar+0x14a/frame 0xfffffe00107a4e00
kvprintf() at kvprintf+0xf5/frame 0xfffffe00107a4f20
_vprintf() at _vprintf+0x8c/frame 0xfffffe00107a5010
printf() at printf+0x53/frame 0xfffffe00107a5070
_mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x5a/frame 0xfffffe00107a5080
thread_lock_flags_() at thread_lock_flags_+0xeb/frame 0xfffffe00107a50e0
propagate_priority() at propagate_priority+0x58/frame 0xfffffe00107a5120
turnstile_wait() at turnstile_wait+0x323/frame 0xfffffe00107a5160
__rw_wlock_hard() at __rw_wlock_hard+0x3f8/frame 0xfffffe00107a5210
inp_smr_lock() at inp_smr_lock+0xa9/frame 0xfffffe00107a5240
in_pcblookup_hash() at in_pcblookup_hash+0x6b/frame 0xfffffe00107a5280
in_pcblookup_mbuf() at in_pcblookup_mbuf+0x18/frame 0xfffffe00107a52a0
tcp_input_with_port() at tcp_input_with_port+0x5d4/frame 0xfffffe00107a5400
tcp_input() at tcp_input+0xb/frame 0xfffffe00107a5410
ip_input() at ip_input+0x229/frame 0xfffffe00107a5470
netisr_dispatch_src() at netisr_dispatch_src+0x2a6/frame 0xfffffe00107a54c0
ether_demux() at ether_demux+0x144/frame 0xfffffe00107a54f0
dummynet_send() at dummynet_send+0x12d/frame 0xfffffe00107a5530
dummynet_io() at dummynet_io+0x3db/frame 0xfffffe00107a5580
pf_test_eth() at pf_test_eth+0x12f4/frame 0xfffffe00107a5a20
pf_eth_check_in() at pf_eth_check_in+0x25/frame 0xfffffe00107a5a40
pfil_run_hooks() at pfil_run_hooks+0x97/frame 0xfffffe00107a5a80
ether_demux() at ether_demux+0x4c/frame 0xfffffe00107a5ab0
ether_nh_input() at ether_nh_input+0x353/frame 0xfffffe00107a5b10
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00107a5b60
ether_input() at ether_input+0x69/frame 0xfffffe00107a5bc0
ether_demux() at ether_demux+0x9e/frame 0xfffffe00107a5bf0
ether_nh_input() at ether_nh_input+0x353/frame 0xfffffe00107a5c50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00107a5ca0
ether_input() at ether_input+0x69/frame 0xfffffe00107a5d00
iflib_rxeof() at iflib_rxeof+0xbdb/frame 0xfffffe00107a5e00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00107a5e40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00107a5ec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe00107a5ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00107a5f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00107a5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Actions #2

Updated by Jim Pingle 11 months ago

  • Assignee set to Kristof Provost
  • Target version set to 2.7.0
  • Plus Target Version set to 23.09
Actions #3

Updated by Kristof Provost 11 months ago

That backtrace has me suspecting that this may actually be a fix: https://cgit.freebsd.org/src/commit/?id=7b92493ab1d464263ccdf4494b187edbe19864dc

Sadly that didn't land in time to make it into 23.05.

Actions #4

Updated by Lev Prokofev 11 months ago

The config uploaded to the file drop for internal testing - folder 1328742557

Actions #5

Updated by Kristof Provost 11 months ago

Mark doesn't think his fix would affect this.

Having looked a bit more, I have a different theory.
Thread 100008 holds the PF_RULES read lock, and is now blocked waiting for the inp_lock.
Thread 100125 is trying to get the PF_RULES write lock, so no more read locks can be acquired.
Thread 100153 holds the inp_lock, and is stuck waiting for the PF_RULES read lock.

That neatly deadlocks us. We can break this lock by avoiding the PF_RULES -> inp_lock dependency in thread 100008 if we release the PF_RULES lock before calling dummynet_io. That's pretty easy to do (and a good idea all by itself anyway). Proposed fix in https://reviews.freebsd.org/D40067

Actions #6

Updated by Kristof Provost 11 months ago

  • Status changed from New to Feedback

Fixed upstream in https://cgit.freebsd.org/src/commit/?id=bdd47177528b5beacabb4837bfac0e9de92aae74 and cherry-picked into devel-main (not yet to plus-devel-main, that'll come with future merges of devel-main to plus-devel-main).

Actions #7

Updated by Flole Systems 11 months ago

So long story short: 23.05 is another release that's broken at kernel level? 23.01 was the one with the IPv6 crashes, so the last one that isn't broken is the now-unsupported 22.05?

Any plans to fix it/re-release so there's at least a single plus release that is not randomly crashing/deadlocking and supported?

Actions #8

Updated by Gerhard Gröschl 11 months ago

yeah, just as a reminder:
Captive Portal started crashing on our sites with 22.05 already. We waited eagerly for two updates but it only got worse unfortunately.

Actions #9

Updated by Jim Pingle 10 months ago

  • Plus Target Version changed from 23.09 to 23.05.1
Actions #10

Updated by Jim Pingle 10 months ago

  • Subject changed from System crashed and sometimes became unresponsive with enabled Captive portal. to System crashes or may become unresponsive with Captive Portal

Updating subject for release notes.

Actions #11

Updated by Jim Pingle 10 months ago

  • Affected Version set to 2.7.0
Actions #12

Updated by Christian McDonald 10 months ago

  • Status changed from Feedback to Resolved
Actions #13

Updated by Jim Thompson 10 months ago

Gerhard Gröschl wrote in #note-8:

yeah, just as a reminder:
Captive Portal started crashing on our sites with 22.05 already. We waited eagerly for two updates but it only got worse unfortunately.

there is a 23.05.1 coming

Actions #14

Updated by Gerhard Gröschl 10 months ago

thx guys, we really appreciate your work very much!

Actions

Also available in: Atom PDF