Project

General

Profile

Actions

Bug #14804

closed

Panic when pfsync attempts to synchronize states between hosts with different rulesets

Added by Marcos M over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
Operating System
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
23.09
Release Notes:
Default
Affected Version:
Affected Architecture:

Description

Additional discussion:
https://forum.netgate.com/topic/182442/

db:1:pfs> bt
Tracing pid 12 tid 100062 td 0xfffffe00c498b560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe001b1f6610
vpanic() at vpanic+0x163/frame 0xfffffe001b1f6740
panic() at panic+0x43/frame 0xfffffe001b1f67a0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe001b1f6800
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe001b1f6860
calltrap() at calltrap+0x8/frame 0xfffffe001b1f6860
--- trap 0xc, rip = 0xffffffff80fb86d7, rsp = 0xfffffe001b1f6930, rbp = 0xfffffe001b1f69f0 ---
pf_route() at pf_route+0x4e7/frame 0xfffffe001b1f69f0
pf_test() at pf_test+0xd7b/frame 0xfffffe001b1f6b90
pf_check_out() at pf_check_out+0x22/frame 0xfffffe001b1f6bb0
pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe001b1f6be0
ip_output() at ip_output+0xb4a/frame 0xfffffe001b1f6ce0
ip_forward() at ip_forward+0x3c2/frame 0xfffffe001b1f6d90
ip_input() at ip_input+0x6e9/frame 0xfffffe001b1f6df0
swi_net() at swi_net+0x128/frame 0xfffffe001b1f6e60
ithread_loop() at ithread_loop+0x257/frame 0xfffffe001b1f6ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe001b1f6f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001b1f6f30
--- trap 0x24f03cbe, rip = 0x3e5ce0ab5d44269a, rsp = 0x9a1bada3ca8f0cb6, rbp = 0x2a37be9697d8c544 ---
db:1:pfs> bt
Tracing pid 0 tid 100009 td 0xfffffe00205eec80
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c676e390
vpanic() at vpanic+0x163/frame 0xfffffe00c676e4c0
panic() at panic+0x43/frame 0xfffffe00c676e520
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00c676e580
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c676e5e0
calltrap() at calltrap+0x8/frame 0xfffffe00c676e5e0
--- trap 0xc, rip = 0xffffffff80fb86d7, rsp = 0xfffffe00c676e6b0, rbp = 0xfffffe00c676e770 ---
pf_route() at pf_route+0x4e7/frame 0xfffffe00c676e770
pf_test() at pf_test+0xd7b/frame 0xfffffe00c676e910
pf_check_out() at pf_check_out+0x22/frame 0xfffffe00c676e930
pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe00c676e960
ip_output() at ip_output+0xb4a/frame 0xfffffe00c676ea60
ip_forward() at ip_forward+0x3c2/frame 0xfffffe00c676eb10
ip_input() at ip_input+0x6e9/frame 0xfffffe00c676eb70
netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00c676ebc0
ether_demux() at ether_demux+0x149/frame 0xfffffe00c676ebf0
ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00c676ec50
netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00c676eca0
ether_input() at ether_input+0x69/frame 0xfffffe00c676ed00
iflib_rxeof() at iflib_rxeof+0xc46/frame 0xfffffe00c676ee00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00c676ee40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe00c676eec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe00c676eef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00c676ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00c676ef30
--- trap 0x7cfdf9fb, rip = 0x16592cb25965b2cb, rsp = 0x29d353a6a74c4e99, rbp = 0xde82bd057a0af415 ---
Actions #1

Updated by Marcos M over 1 year ago

  • Description updated (diff)

Potentially related:

https://forum.netgate.com/topic/176596/ Show

https://forum.netgate.com/topic/169168/ Show

https://forum.netgate.com/topic/167949/ Show

Actions #2

Updated by Kristof Provost over 1 year ago

Those additional backtraces in comment #1 look totally different, and there's no indication that these are the same issue.

Actions #3

Updated by Kristof Provost over 1 year ago

The address suggests we're crashing on `ifp = r->rpool.cur->kif ? r->rpool.cur->kif->pfik_ifp : NULL;` in pf_route(), the we dereference the r->pool.cur->kif pointer, because it's not NULL, but 0x12. That's very, very odd, in that we ought to have a valid rule pointer (and almost certainly would have crashed earlier if we didn't), but somehow we also seem to have an invalid kpool.

I'd be very interesting to be able to reproduce this or at least get a full kernel core dump to investigate.

Actions #4

Updated by Kristof Provost about 1 year ago

The affected user has very helpfully provided a core dump, which shows a couple of things.
Firstly it confirms what I gathered from the minidump: we have a rule with a NULL rpool.cur. This rule has rule number -1, which indicates that it's the V_pf_default_rule.

That happens when the rules on the pfsync hosts are different (see pfsync_state_import()). It looks like V_pf_default_rule doesn't get its rpool correctly initialised, which in turn causes this panic.
That's a fairly straightforward fix, which I'm working on, as well as a test case to reproduce this crash (and to ensure we don't regress on this).

Actions #5

Updated by Jim Pingle about 1 year ago

  • Subject changed from Panic in HA setup to Panic when pfsync attempts to synchronize states between hosts with different rulesets
  • Status changed from New to In Progress
  • Assignee set to Kristof Provost
  • Target version set to 2.8.0
  • Plus Target Version set to 23.09
Actions #6

Updated by Kristof Provost about 1 year ago

  • Status changed from In Progress to Feedback

I've cherry-picked the upstream fix into our branches. The fix will be part of the next snapshot builds.

Actions #7

Updated by Vladimir Suhhanov about 1 year ago

There are no more crashes on the latest snapshots. Many thanks to all participants.

Actions #8

Updated by Marcos M about 1 year ago

  • Status changed from Feedback to Resolved
Actions #9

Updated by Vladimir Suhhanov about 1 year ago

Does this patch apply to the current beta builds? I have tried one beta build from 13 Oct and it crashes the same way.

Actions #10

Updated by Kristof Provost about 1 year ago

Yes, the relevant patch is in the 23.09 branch. What version are you running and what is the full backtrace you're getting?

Actions #11

Updated by Vladimir Suhhanov about 1 year ago

db:1:pfs> bt
Tracing pid 12 tid 100062 td 0xfffffe00c641f560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe001b1e2600
vpanic() at vpanic+0x163/frame 0xfffffe001b1e2730
panic() at panic+0x43/frame 0xfffffe001b1e2790
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe001b1e27f0
trap_pfault() at trap_pfault+0xae/frame 0xfffffe001b1e2860
calltrap() at calltrap+0x8/frame 0xfffffe001b1e2860
--- trap 0xc, rip = 0xffffffff80fd28d8, rsp = 0xfffffe001b1e2930, rbp = 0xfffffe001b1e29e0 ---
pf_route() at pf_route+0x768/frame 0xfffffe001b1e29e0
pf_test() at pf_test+0x1014/frame 0xfffffe001b1e2b80
pf_check_out() at pf_check_out+0x22/frame 0xfffffe001b1e2ba0
pfil_mbuf_out() at pfil_mbuf_out+0x58/frame 0xfffffe001b1e2bd0
ip_output() at ip_output+0xce6/frame 0xfffffe001b1e2cd0
ip_forward() at ip_forward+0x413/frame 0xfffffe001b1e2d80
ip_input() at ip_input+0x814/frame 0xfffffe001b1e2de0
swi_net() at swi_net+0x19b/frame 0xfffffe001b1e2e60
ithread_loop() at ithread_loop+0x266/frame 0xfffffe001b1e2ef0
fork_exit() at fork_exit+0x82/frame 0xfffffe001b1e2f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001b1e2f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs> show registers
cs 0x20
ds 0x3b
es 0x3b
fs 0x13
gs 0x1b
ss 0x28
rax 0x12
rcx 0xffffffff814b1a68
rdx 0xffffffff814a2042
rbx 0x100
rsp 0xfffffe001b1e2600
rbp 0xfffffe001b1e2600
rsi 0x80
rdi 0xffffffff83024090 cnputs_mtx
r8 0
r9 0
r10 0
r11 0
r12 0
r13 0
r14 0xffffffff81416f35
r15 0xfffffe00c641f560
rip 0xffffffff80d33612 kdb_enter+0x32
rflags 0x86
kdb_enter+0x32: movq $0,0x234c273(%rip)
db:1:pfs> show pcpu
cpuid = 5
dynamic pcpu = 0xfffffe009ca4fcc0
curthread = 0xfffffe00c641f560: pid 12 tid 100062 critnest 1 "swi1: netisr 5"
curpcb = 0xfffffe00c641fa80
fpcurthread = none
idlethread = 0xfffffe00c6369000: tid 100008 "idle: cpu5"
self = 0xffffffff84215000
curpmap = 0xffffffff83023ab0
tssp = 0xffffffff84215384
rsp0 = 0xfffffe001b1e3000
kcr3 = 0x80000000c5c5c001
ucr3 = 0xffffffffffffffff
scr3 = 0x23472bb5d
gs32p = 0xffffffff84215404
ldt = 0xffffffff84215444
tss = 0xffffffff84215434
curvnet = 0xfffff80001177f40
spin locks held:

Actions #12

Updated by Kristof Provost about 1 year ago

Yes, but what version are you running?
Post the output of "uname -a" and "pkg info pfSense-kernel-pfSense".

Actions #13

Updated by Vladimir Suhhanov about 1 year ago

Sorry just went out of my head…

FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256151-106588946ac: Mon Oct 16 03:09:09 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/nZsGV1oA/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/sources/FreeBSD-src-plus-RELENG_23_09/amd64.amd64/sys/pfSense-DEBUG amd64

pfSense-kernel-pfSense-23.09.b.20231016.0231
Name : pfSense-kernel-pfSense
Version : 23.09.b.20231016.0231
Installed on : Mon Oct 16 14:34:32 2023 EEST
Origin : security/pfSense-kernel
Architecture : FreeBSD:14:amd64
Prefix : /
Categories : security
Licenses : APACHE20
Maintainer :
WWW : http://www.pfsense.org/
Comment : pfSense kernel (pfSense)
Annotations :
repo_type : binary
repository : pfSense-core
Flat size : 87.4MiB
Description :
pfSense kernel (pfSense)

WWW: http://www.pfsense.org/

Actions #14

Updated by Kristof Provost about 1 year ago

Cheers, that helped!

I think I see what happened here. Basically I fixed the problem upstream and missed a case in the pfsense tree. At least it ought to be a trivial fix.

Actions #15

Updated by Jim Pingle about 1 year ago

  • Status changed from Resolved to In Progress
Actions #16

Updated by Kristof Provost about 1 year ago

  • Status changed from In Progress to Feedback

I've pushed a fix to all relevant branches (including 23.09). It'll be part of the next snapshot builds.

Actions #17

Updated by Vladimir Suhhanov about 1 year ago

Yes, looks like it is ok now. No more crashes on beta 23.09

Actions #18

Updated by Jim Pingle about 1 year ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions #19

Updated by Jim Pingle about 1 year ago

  • Target version changed from 2.8.0 to 2.7.1
Actions

Also available in: Atom PDF