Bug #14804
closedPanic when pfsync attempts to synchronize states between hosts with different rulesets
100%
Description
Additional discussion:
https://forum.netgate.com/topic/182442/
db:1:pfs> bt Tracing pid 12 tid 100062 td 0xfffffe00c498b560 kdb_enter() at kdb_enter+0x32/frame 0xfffffe001b1f6610 vpanic() at vpanic+0x163/frame 0xfffffe001b1f6740 panic() at panic+0x43/frame 0xfffffe001b1f67a0 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe001b1f6800 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe001b1f6860 calltrap() at calltrap+0x8/frame 0xfffffe001b1f6860 --- trap 0xc, rip = 0xffffffff80fb86d7, rsp = 0xfffffe001b1f6930, rbp = 0xfffffe001b1f69f0 --- pf_route() at pf_route+0x4e7/frame 0xfffffe001b1f69f0 pf_test() at pf_test+0xd7b/frame 0xfffffe001b1f6b90 pf_check_out() at pf_check_out+0x22/frame 0xfffffe001b1f6bb0 pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe001b1f6be0 ip_output() at ip_output+0xb4a/frame 0xfffffe001b1f6ce0 ip_forward() at ip_forward+0x3c2/frame 0xfffffe001b1f6d90 ip_input() at ip_input+0x6e9/frame 0xfffffe001b1f6df0 swi_net() at swi_net+0x128/frame 0xfffffe001b1f6e60 ithread_loop() at ithread_loop+0x257/frame 0xfffffe001b1f6ef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe001b1f6f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001b1f6f30 --- trap 0x24f03cbe, rip = 0x3e5ce0ab5d44269a, rsp = 0x9a1bada3ca8f0cb6, rbp = 0x2a37be9697d8c544 ---
db:1:pfs> bt Tracing pid 0 tid 100009 td 0xfffffe00205eec80 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c676e390 vpanic() at vpanic+0x163/frame 0xfffffe00c676e4c0 panic() at panic+0x43/frame 0xfffffe00c676e520 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00c676e580 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c676e5e0 calltrap() at calltrap+0x8/frame 0xfffffe00c676e5e0 --- trap 0xc, rip = 0xffffffff80fb86d7, rsp = 0xfffffe00c676e6b0, rbp = 0xfffffe00c676e770 --- pf_route() at pf_route+0x4e7/frame 0xfffffe00c676e770 pf_test() at pf_test+0xd7b/frame 0xfffffe00c676e910 pf_check_out() at pf_check_out+0x22/frame 0xfffffe00c676e930 pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe00c676e960 ip_output() at ip_output+0xb4a/frame 0xfffffe00c676ea60 ip_forward() at ip_forward+0x3c2/frame 0xfffffe00c676eb10 ip_input() at ip_input+0x6e9/frame 0xfffffe00c676eb70 netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00c676ebc0 ether_demux() at ether_demux+0x149/frame 0xfffffe00c676ebf0 ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00c676ec50 netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00c676eca0 ether_input() at ether_input+0x69/frame 0xfffffe00c676ed00 iflib_rxeof() at iflib_rxeof+0xc46/frame 0xfffffe00c676ee00 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00c676ee40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe00c676eec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00c676eef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe00c676ef30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00c676ef30 --- trap 0x7cfdf9fb, rip = 0x16592cb25965b2cb, rsp = 0x29d353a6a74c4e99, rbp = 0xde82bd057a0af415 ---
Updated by Marcos M over 1 year ago
- Description updated (diff)
Updated by Kristof Provost over 1 year ago
Those additional backtraces in comment #1 look totally different, and there's no indication that these are the same issue.
Updated by Kristof Provost over 1 year ago
The address suggests we're crashing on `ifp = r->rpool.cur->kif ? r->rpool.cur->kif->pfik_ifp : NULL;` in pf_route(), the we dereference the r->pool.cur->kif pointer, because it's not NULL, but 0x12. That's very, very odd, in that we ought to have a valid rule pointer (and almost certainly would have crashed earlier if we didn't), but somehow we also seem to have an invalid kpool.
I'd be very interesting to be able to reproduce this or at least get a full kernel core dump to investigate.
Updated by Kristof Provost about 1 year ago
The affected user has very helpfully provided a core dump, which shows a couple of things.
Firstly it confirms what I gathered from the minidump: we have a rule with a NULL rpool.cur. This rule has rule number -1, which indicates that it's the V_pf_default_rule.
That happens when the rules on the pfsync hosts are different (see pfsync_state_import()). It looks like V_pf_default_rule doesn't get its rpool correctly initialised, which in turn causes this panic.
That's a fairly straightforward fix, which I'm working on, as well as a test case to reproduce this crash (and to ensure we don't regress on this).
Updated by Jim Pingle about 1 year ago
- Subject changed from Panic in HA setup to Panic when pfsync attempts to synchronize states between hosts with different rulesets
- Status changed from New to In Progress
- Assignee set to Kristof Provost
- Target version set to 2.8.0
- Plus Target Version set to 23.09
Updated by Kristof Provost about 1 year ago
- Status changed from In Progress to Feedback
I've cherry-picked the upstream fix into our branches. The fix will be part of the next snapshot builds.
Updated by Vladimir Suhhanov about 1 year ago
There are no more crashes on the latest snapshots. Many thanks to all participants.
Updated by Vladimir Suhhanov about 1 year ago
Does this patch apply to the current beta builds? I have tried one beta build from 13 Oct and it crashes the same way.
Updated by Kristof Provost about 1 year ago
Yes, the relevant patch is in the 23.09 branch. What version are you running and what is the full backtrace you're getting?
Updated by Vladimir Suhhanov about 1 year ago
db:1:pfs> bt
Tracing pid 12 tid 100062 td 0xfffffe00c641f560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe001b1e2600
vpanic() at vpanic+0x163/frame 0xfffffe001b1e2730
panic() at panic+0x43/frame 0xfffffe001b1e2790
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe001b1e27f0
trap_pfault() at trap_pfault+0xae/frame 0xfffffe001b1e2860
calltrap() at calltrap+0x8/frame 0xfffffe001b1e2860
--- trap 0xc, rip = 0xffffffff80fd28d8, rsp = 0xfffffe001b1e2930, rbp = 0xfffffe001b1e29e0 ---
pf_route() at pf_route+0x768/frame 0xfffffe001b1e29e0
pf_test() at pf_test+0x1014/frame 0xfffffe001b1e2b80
pf_check_out() at pf_check_out+0x22/frame 0xfffffe001b1e2ba0
pfil_mbuf_out() at pfil_mbuf_out+0x58/frame 0xfffffe001b1e2bd0
ip_output() at ip_output+0xce6/frame 0xfffffe001b1e2cd0
ip_forward() at ip_forward+0x413/frame 0xfffffe001b1e2d80
ip_input() at ip_input+0x814/frame 0xfffffe001b1e2de0
swi_net() at swi_net+0x19b/frame 0xfffffe001b1e2e60
ithread_loop() at ithread_loop+0x266/frame 0xfffffe001b1e2ef0
fork_exit() at fork_exit+0x82/frame 0xfffffe001b1e2f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001b1e2f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs> show registers
cs 0x20
ds 0x3b
es 0x3b
fs 0x13
gs 0x1b
ss 0x28
rax 0x12
rcx 0xffffffff814b1a68
rdx 0xffffffff814a2042
rbx 0x100
rsp 0xfffffe001b1e2600
rbp 0xfffffe001b1e2600
rsi 0x80
rdi 0xffffffff83024090 cnputs_mtx
r8 0
r9 0
r10 0
r11 0
r12 0
r13 0
r14 0xffffffff81416f35
r15 0xfffffe00c641f560
rip 0xffffffff80d33612 kdb_enter+0x32
rflags 0x86
kdb_enter+0x32: movq $0,0x234c273(%rip)
db:1:pfs> show pcpu
cpuid = 5
dynamic pcpu = 0xfffffe009ca4fcc0
curthread = 0xfffffe00c641f560: pid 12 tid 100062 critnest 1 "swi1: netisr 5"
curpcb = 0xfffffe00c641fa80
fpcurthread = none
idlethread = 0xfffffe00c6369000: tid 100008 "idle: cpu5"
self = 0xffffffff84215000
curpmap = 0xffffffff83023ab0
tssp = 0xffffffff84215384
rsp0 = 0xfffffe001b1e3000
kcr3 = 0x80000000c5c5c001
ucr3 = 0xffffffffffffffff
scr3 = 0x23472bb5d
gs32p = 0xffffffff84215404
ldt = 0xffffffff84215444
tss = 0xffffffff84215434
curvnet = 0xfffff80001177f40
spin locks held:
Updated by Kristof Provost about 1 year ago
Yes, but what version are you running?
Post the output of "uname -a" and "pkg info pfSense-kernel-pfSense".
Updated by Vladimir Suhhanov about 1 year ago
Sorry just went out of my head…
FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256151-106588946ac: Mon Oct 16 03:09:09 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/nZsGV1oA/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/sources/FreeBSD-src-plus-RELENG_23_09/amd64.amd64/sys/pfSense-DEBUG amd64
pfSense-kernel-pfSense-23.09.b.20231016.0231
Name : pfSense-kernel-pfSense
Version : 23.09.b.20231016.0231
Installed on : Mon Oct 16 14:34:32 2023 EEST
Origin : security/pfSense-kernel
Architecture : FreeBSD:14:amd64
Prefix : /
Categories : security
Licenses : APACHE20
Maintainer : development@pfsense.org
WWW : http://www.pfsense.org/
Comment : pfSense kernel (pfSense)
Annotations :
repo_type : binary
repository : pfSense-core
Flat size : 87.4MiB
Description :
pfSense kernel (pfSense)
Updated by Kristof Provost about 1 year ago
Cheers, that helped!
I think I see what happened here. Basically I fixed the problem upstream and missed a case in the pfsense tree. At least it ought to be a trivial fix.
Updated by Jim Pingle about 1 year ago
- Status changed from Resolved to In Progress
Updated by Kristof Provost about 1 year ago
- Status changed from In Progress to Feedback
I've pushed a fix to all relevant branches (including 23.09). It'll be part of the next snapshot builds.
Updated by Vladimir Suhhanov about 1 year ago
Yes, looks like it is ok now. No more crashes on beta 23.09
Updated by Jim Pingle about 1 year ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Updated by Jim Pingle about 1 year ago
- Target version changed from 2.8.0 to 2.7.1