Project

General

Profile

Actions

Regression #12217

open

Kernel panic in IPFW when using Captive Portal

Added by Jim Pingle 3 months ago. Updated about 1 month ago.

Status:
Feedback
Priority:
Very High
Category:
Captive Portal
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
21.09
Release Notes:
Force Exclusion
Affected Version:
2.6.0
Affected Architecture:

Description

Starting around the 2.6.0 snapshot on August 3rd (20210803*), a VM configured for HA with Captive Portal experiences a kernel panic at boot. The same VM with the same config is stable using a snapshot from a few days prior, 20210731*.

If I disable captive portal, the system boots successfully and does not panic. The portal has very few settings active, only local authentication and vouchers are enabled.

After inspecting the textdump contents, Kristof suggested the following patch:

diff --git a/sys/netpfil/ipfw/ip_fw2.c b/sys/netpfil/ipfw/ip_fw2.c
index 7b3038b8f1c..50ff6676d55 100644
--- a/sys/netpfil/ipfw/ip_fw2.c
+++ b/sys/netpfil/ipfw/ip_fw2.c
@@ -1928,7 +1928,8 @@ do {                                              \
                        }

                        case O_MACADDR2_LOOKUP:
-                               if (args->eh != NULL) { /* have MAC header */
+                               if ((args->flags & IPFW_ARGS_ETHER) &&
+                                   args->eh != NULL) { /* have MAC header */
                                        uint32_t v = 0;
                                        match = ipfw_lookup_table(chain,
                                            cmd->arg1, 0, args->eh, &v, NULL,

Textdumps from two panics attached, but they contain the same backtrace and panic message (aside from time values and slight difference in some memory addresses):

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address    = 0x3
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff84346fd2
stack pointer            = 0x28:0xfffffe000e7b7590
frame pointer            = 0x28:0xfffffe000e7b7610
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 12 (swi4: clock (0))
trap number        = 12
panic: page fault
cpuid = 1
time = 1628171656
KDB: enter: panic
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100028 td 0xfffff8000516f740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe000e7b7250
vpanic() at vpanic+0x197/frame 0xfffffe000e7b72a0
panic() at panic+0x43/frame 0xfffffe000e7b7300
trap_fatal() at trap_fatal+0x391/frame 0xfffffe000e7b7360
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000e7b73b0
trap() at trap+0x286/frame 0xfffffe000e7b74c0
calltrap() at calltrap+0x8/frame 0xfffffe000e7b74c0
--- trap 0xc, rip = 0xffffffff84346fd2, rsp = 0xfffffe000e7b7590, rbp = 0xfffffe000e7b7610 ---
ta_lookup_mhash() at ta_lookup_mhash+0x62/frame 0xfffffe000e7b7610
ipfw_chk() at ipfw_chk+0x226f/frame 0xfffffe000e7b7840
ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe000e7b7920
pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000e7b79b0
ip_output() at ip_output+0xb4f/frame 0xfffffe000e7b7af0
carp_send_ad_locked() at carp_send_ad_locked+0x26a/frame 0xfffffe000e7b7b90
carp_send_ad() at carp_send_ad+0x33/frame 0xfffffe000e7b7bc0
softclock_call_cc() at softclock_call_cc+0x141/frame 0xfffffe000e7b7c70
softclock() at softclock+0x79/frame 0xfffffe000e7b7c90
ithread_loop() at ithread_loop+0x23c/frame 0xfffffe000e7b7cf0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000e7b7d30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000e7b7d30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Files

textdump.0.tar (154 KB) textdump.0.tar Jim Pingle, 08/05/2021 09:11 AM
textdump.1.tar (154 KB) textdump.1.tar Jim Pingle, 08/05/2021 09:11 AM
textdump.2.tar (154 KB) textdump.2.tar textdump from test VM without CARP Jim Pingle, 08/05/2021 09:44 AM
textdump.3.tar (154 KB) textdump.3.tar Jim Pingle, 09/07/2021 07:51 AM
Actions #1

Updated by Jim Pingle 3 months ago

This is actually easier to reproduce than I thought. If I take a fresh install of pfSense CE on a current snapshot (2.6.0.a.20210805.0500) and configure Captive Portal on LAN (no authentication, no other options enabled), then add a CARP VIP to LAN, it panics while applying the CARP VIP. Same backtrace as above.

Looking close, the backtrace does differ slightly. At the point where it happened, the CARP VIP had not yet been applied, only saved:

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100026 td 0xfffff8000515b740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe000e7a5070
vpanic() at vpanic+0x197/frame 0xfffffe000e7a50c0
panic() at panic+0x43/frame 0xfffffe000e7a5120
vm_fault() at vm_fault+0x24f2/frame 0xfffffe000e7a5270
vm_fault_trap() at vm_fault_trap+0x60/frame 0xfffffe000e7a52b0
trap_pfault() at trap_pfault+0x19c/frame 0xfffffe000e7a5300
trap() at trap+0x286/frame 0xfffffe000e7a5410
calltrap() at calltrap+0x8/frame 0xfffffe000e7a5410
--- trap 0xc, rip = 0xffffffff8433efd2, rsp = 0xfffffe000e7a54e0, rbp = 0xfffffe000e7a5560 ---
ta_lookup_mhash() at ta_lookup_mhash+0x62/frame 0xfffffe000e7a5560
ipfw_chk() at ipfw_chk+0x226f/frame 0xfffffe000e7a5790
ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe000e7a5870
pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000e7a5900
ip6_input() at ip6_input+0x70e/frame 0xfffffe000e7a59e0
swi_net() at swi_net+0x12b/frame 0xfffffe000e7a5a50
ithread_loop() at ithread_loop+0x23c/frame 0xfffffe000e7a5ab0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000e7a5af0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000e7a5af0

No CARP functions in that backtrace, so may be some other factor that isn't obvious yet.

Actions #2

Updated by Jim Pingle 3 months ago

  • Subject changed from Kernel panic in IPFW when using Captive Portal with CARP to Kernel panic in IPFW when using Captive Portal

Removing CARP from the subject since it doesn't appear to be a requirement to reproduce.

Actions #3

Updated by Jim Pingle 3 months ago

Attaching textdump from test VM without CARP.

Actions #4

Updated by Kristof Provost 2 months ago

  • Status changed from New to Feedback

Fix pushed to https://gitlab.netgate.com/pfSense/FreeBSD-src/-/commit/41d976b3b37dfcc66b14c67f610474e94b3d49dd (devel-12). I expect it to get merged to the plus-devel-12 branch as part of regular such merges.

struct ip_fw_args can contain a pointer to the Ethernet header of the packet,
but to know if it's safe to dereference the 'eh' pointer we cannot simply
NULL check it. The pointer lives in a union with other data, and moreover,
since e1075a56bca3dc9b7307b9f4813e6abedf4f8788 ipfw_check_packet() no longer
fully zeroes struct ip_fw_args before handing it to ipfw_chk().
In other words: it's possible for this pointer to contain junk data. We must
check for the IPFW_ARGS_ETHER flag instead.

Actions #5

Updated by Jim Pingle 2 months ago

  • % Done changed from 0 to 100
  • Plus Target Version set to 21.09

So far, so good with the latest snapshot (2.6.0.a.20210817.0500). I've updated several systems which easily crashed at boot or within moments of turning on Captive Poratl before and thus far they have not had a panic.

I'll keep it open for another day or so to be certain but at the moment it looks like it can be closed.

Actions #6

Updated by Jim Pingle 2 months ago

  • Status changed from Feedback to Resolved

Things are still stable here after running a couple days and also updating again. Closing this out for now, will reopen if anything related comes up.

Actions #7

Updated by Jim Pingle 2 months ago

  • Release Notes changed from Default to Force Exclusion
  • Affected Version set to 2.6.0
Actions #8

Updated by Jim Pingle about 1 month ago

Not sure if the original fix got dropped somehow or if this is new, but the backtrace is slightly different. It's crashing again on current snapshots:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address    = 0x9e
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff81381630
stack pointer            = 0x28:0xfffffe000043c0c0
frame pointer            = 0x28:0xfffffe000043c0c0
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 12 (irq264: virtio_pci4)
trap number        = 12
panic: page fault
cpuid = 0
time = 1631018367
KDB: enter: panic
db:0:kdb.enter.default>  show pcpu
cpuid        = 0
dynamic pcpu = 0xd5c340
curthread    = 0xfffff80005494740: pid 12 tid 100056 "irq264: virtio_pci4" 
curpcb       = 0xfffff80005494ce0
fpcurthread  = none
idlethread   = 0xfffff8000512a000: tid 100003 "idle: cpu0" 
curpmap      = 0xffffffff8368f568
tssp         = 0xffffffff837196a0
commontssp   = 0xffffffff837196a0
rsp0         = 0xfffffe000043cbc0
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff8371feb8
ldt          = 0xffffffff8371fef8
tss          = 0xffffffff8371fee8
tlb gen      = 0
curvnet      = 0xfffff80005069c00
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100056 td 0xfffff80005494740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe000043bd80
vpanic() at vpanic+0x197/frame 0xfffffe000043bdd0
panic() at panic+0x43/frame 0xfffffe000043be30
trap_fatal() at trap_fatal+0x391/frame 0xfffffe000043be90
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000043bee0
trap() at trap+0x286/frame 0xfffffe000043bff0
calltrap() at calltrap+0x8/frame 0xfffffe000043bff0
--- trap 0xc, rip = 0xffffffff81381630, rsp = 0xfffffe000043c0c0, rbp = 0xfffffe000043c0c0 ---
memcmp() at memcmp+0x60/frame 0xfffffe000043c0c0
ta_lookup_radix() at ta_lookup_radix+0x7b/frame 0xfffffe000043c120
ipfw_chk() at ipfw_chk+0x2864/frame 0xfffffe000043c360
ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe000043c440
pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000043c4d0
ip_input() at ip_input+0x475/frame 0xfffffe000043c580
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000043c5d0
ether_demux() at ether_demux+0x16a/frame 0xfffffe000043c600
dummynet_send() at dummynet_send+0x135/frame 0xfffffe000043c640
dummynet_io() at dummynet_io+0x391/frame 0xfffffe000043c690
ipfw_check_frame() at ipfw_check_frame+0x2f9/frame 0xfffffe000043c780
pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000043c810
ether_demux() at ether_demux+0x5c/frame 0xfffffe000043c840
ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe000043c8a0
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000043c8f0
ether_input() at ether_input+0x89/frame 0xfffffe000043c950
vtnet_rxq_eof() at vtnet_rxq_eof+0x7a5/frame 0xfffffe000043ca10
vtnet_rx_vq_process() at vtnet_rx_vq_process+0xb7/frame 0xfffffe000043ca50
ithread_loop() at ithread_loop+0x23c/frame 0xfffffe000043cab0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000043caf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000043caf0

The backtrace mentions dummynet but there are no limiters configured on this setup.

Actions #9

Updated by Jim Pingle about 1 month ago

Forgot to mention in the previous update but this crash happens when a user logs in, not as early as before.

Actions #11

Updated by Jim Pingle about 1 month ago

  • Status changed from Confirmed to Feedback

Kristof merged the request. Should be in snapshots tomorrow.

Actions

Also available in: Atom PDF