Regression #12217
closedKernel panic in IPFW when using Captive Portal
100%
Description
Starting around the 2.6.0 snapshot on August 3rd (20210803*), a VM configured for HA with Captive Portal experiences a kernel panic at boot. The same VM with the same config is stable using a snapshot from a few days prior, 20210731*.
If I disable captive portal, the system boots successfully and does not panic. The portal has very few settings active, only local authentication and vouchers are enabled.
After inspecting the textdump contents, Kristof suggested the following patch:
diff --git a/sys/netpfil/ipfw/ip_fw2.c b/sys/netpfil/ipfw/ip_fw2.c
index 7b3038b8f1c..50ff6676d55 100644
--- a/sys/netpfil/ipfw/ip_fw2.c
+++ b/sys/netpfil/ipfw/ip_fw2.c
@@ -1928,7 +1928,8 @@ do { \
}
case O_MACADDR2_LOOKUP:
- if (args->eh != NULL) { /* have MAC header */
+ if ((args->flags & IPFW_ARGS_ETHER) &&
+ args->eh != NULL) { /* have MAC header */
uint32_t v = 0;
match = ipfw_lookup_table(chain,
cmd->arg1, 0, args->eh, &v, NULL,
Textdumps from two panics attached, but they contain the same backtrace and panic message (aside from time values and slight difference in some memory addresses):
Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x3 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff84346fd2 stack pointer = 0x28:0xfffffe000e7b7590 frame pointer = 0x28:0xfffffe000e7b7610 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock (0)) trap number = 12 panic: page fault cpuid = 1 time = 1628171656 KDB: enter: panic
db:0:kdb.enter.default> bt Tracing pid 12 tid 100028 td 0xfffff8000516f740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe000e7b7250 vpanic() at vpanic+0x197/frame 0xfffffe000e7b72a0 panic() at panic+0x43/frame 0xfffffe000e7b7300 trap_fatal() at trap_fatal+0x391/frame 0xfffffe000e7b7360 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000e7b73b0 trap() at trap+0x286/frame 0xfffffe000e7b74c0 calltrap() at calltrap+0x8/frame 0xfffffe000e7b74c0 --- trap 0xc, rip = 0xffffffff84346fd2, rsp = 0xfffffe000e7b7590, rbp = 0xfffffe000e7b7610 --- ta_lookup_mhash() at ta_lookup_mhash+0x62/frame 0xfffffe000e7b7610 ipfw_chk() at ipfw_chk+0x226f/frame 0xfffffe000e7b7840 ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe000e7b7920 pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000e7b79b0 ip_output() at ip_output+0xb4f/frame 0xfffffe000e7b7af0 carp_send_ad_locked() at carp_send_ad_locked+0x26a/frame 0xfffffe000e7b7b90 carp_send_ad() at carp_send_ad+0x33/frame 0xfffffe000e7b7bc0 softclock_call_cc() at softclock_call_cc+0x141/frame 0xfffffe000e7b7c70 softclock() at softclock+0x79/frame 0xfffffe000e7b7c90 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe000e7b7cf0 fork_exit() at fork_exit+0x7e/frame 0xfffffe000e7b7d30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000e7b7d30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Files
Updated by Jim Pingle about 3 years ago
This is actually easier to reproduce than I thought. If I take a fresh install of pfSense CE on a current snapshot (2.6.0.a.20210805.0500) and configure Captive Portal on LAN (no authentication, no other options enabled), then add a CARP VIP to LAN, it panics while applying the CARP VIP. Same backtrace as above.
Looking close, the backtrace does differ slightly. At the point where it happened, the CARP VIP had not yet been applied, only saved:
db:0:kdb.enter.default> bt Tracing pid 12 tid 100026 td 0xfffff8000515b740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe000e7a5070 vpanic() at vpanic+0x197/frame 0xfffffe000e7a50c0 panic() at panic+0x43/frame 0xfffffe000e7a5120 vm_fault() at vm_fault+0x24f2/frame 0xfffffe000e7a5270 vm_fault_trap() at vm_fault_trap+0x60/frame 0xfffffe000e7a52b0 trap_pfault() at trap_pfault+0x19c/frame 0xfffffe000e7a5300 trap() at trap+0x286/frame 0xfffffe000e7a5410 calltrap() at calltrap+0x8/frame 0xfffffe000e7a5410 --- trap 0xc, rip = 0xffffffff8433efd2, rsp = 0xfffffe000e7a54e0, rbp = 0xfffffe000e7a5560 --- ta_lookup_mhash() at ta_lookup_mhash+0x62/frame 0xfffffe000e7a5560 ipfw_chk() at ipfw_chk+0x226f/frame 0xfffffe000e7a5790 ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe000e7a5870 pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000e7a5900 ip6_input() at ip6_input+0x70e/frame 0xfffffe000e7a59e0 swi_net() at swi_net+0x12b/frame 0xfffffe000e7a5a50 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe000e7a5ab0 fork_exit() at fork_exit+0x7e/frame 0xfffffe000e7a5af0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000e7a5af0
No CARP functions in that backtrace, so may be some other factor that isn't obvious yet.
Updated by Jim Pingle about 3 years ago
- Subject changed from Kernel panic in IPFW when using Captive Portal with CARP to Kernel panic in IPFW when using Captive Portal
Removing CARP from the subject since it doesn't appear to be a requirement to reproduce.
Updated by Jim Pingle about 3 years ago
- File textdump.2.tar textdump.2.tar added
Attaching textdump from test VM without CARP.
Updated by Kristof Provost about 3 years ago
- Status changed from New to Feedback
Fix pushed to https://gitlab.netgate.com/pfSense/FreeBSD-src/-/commit/41d976b3b37dfcc66b14c67f610474e94b3d49dd (devel-12). I expect it to get merged to the plus-devel-12 branch as part of regular such merges.
struct ip_fw_args can contain a pointer to the Ethernet header of the packet,
but to know if it's safe to dereference the 'eh' pointer we cannot simply
NULL check it. The pointer lives in a union with other data, and moreover,
since e1075a56bca3dc9b7307b9f4813e6abedf4f8788 ipfw_check_packet() no longer
fully zeroes struct ip_fw_args before handing it to ipfw_chk().
In other words: it's possible for this pointer to contain junk data. We must
check for the IPFW_ARGS_ETHER flag instead.
Updated by Jim Pingle about 3 years ago
- % Done changed from 0 to 100
- Plus Target Version set to 21.09
So far, so good with the latest snapshot (2.6.0.a.20210817.0500
). I've updated several systems which easily crashed at boot or within moments of turning on Captive Poratl before and thus far they have not had a panic.
I'll keep it open for another day or so to be certain but at the moment it looks like it can be closed.
Updated by Jim Pingle about 3 years ago
- Status changed from Feedback to Resolved
Things are still stable here after running a couple days and also updating again. Closing this out for now, will reopen if anything related comes up.
Updated by Jim Pingle about 3 years ago
- Release Notes changed from Default to Force Exclusion
- Affected Version set to 2.6.0
Updated by Jim Pingle about 3 years ago
- File textdump.3.tar textdump.3.tar added
- Status changed from Resolved to Confirmed
Not sure if the original fix got dropped somehow or if this is new, but the backtrace is slightly different. It's crashing again on current snapshots:
Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x9e fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff81381630 stack pointer = 0x28:0xfffffe000043c0c0 frame pointer = 0x28:0xfffffe000043c0c0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq264: virtio_pci4) trap number = 12 panic: page fault cpuid = 0 time = 1631018367 KDB: enter: panic
db:0:kdb.enter.default> show pcpu cpuid = 0 dynamic pcpu = 0xd5c340 curthread = 0xfffff80005494740: pid 12 tid 100056 "irq264: virtio_pci4" curpcb = 0xfffff80005494ce0 fpcurthread = none idlethread = 0xfffff8000512a000: tid 100003 "idle: cpu0" curpmap = 0xffffffff8368f568 tssp = 0xffffffff837196a0 commontssp = 0xffffffff837196a0 rsp0 = 0xfffffe000043cbc0 kcr3 = 0xffffffffffffffff ucr3 = 0xffffffffffffffff scr3 = 0x0 gs32p = 0xffffffff8371feb8 ldt = 0xffffffff8371fef8 tss = 0xffffffff8371fee8 tlb gen = 0 curvnet = 0xfffff80005069c00 db:0:kdb.enter.default> bt Tracing pid 12 tid 100056 td 0xfffff80005494740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe000043bd80 vpanic() at vpanic+0x197/frame 0xfffffe000043bdd0 panic() at panic+0x43/frame 0xfffffe000043be30 trap_fatal() at trap_fatal+0x391/frame 0xfffffe000043be90 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000043bee0 trap() at trap+0x286/frame 0xfffffe000043bff0 calltrap() at calltrap+0x8/frame 0xfffffe000043bff0 --- trap 0xc, rip = 0xffffffff81381630, rsp = 0xfffffe000043c0c0, rbp = 0xfffffe000043c0c0 --- memcmp() at memcmp+0x60/frame 0xfffffe000043c0c0 ta_lookup_radix() at ta_lookup_radix+0x7b/frame 0xfffffe000043c120 ipfw_chk() at ipfw_chk+0x2864/frame 0xfffffe000043c360 ipfw_check_packet() at ipfw_check_packet+0xf0/frame 0xfffffe000043c440 pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000043c4d0 ip_input() at ip_input+0x475/frame 0xfffffe000043c580 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000043c5d0 ether_demux() at ether_demux+0x16a/frame 0xfffffe000043c600 dummynet_send() at dummynet_send+0x135/frame 0xfffffe000043c640 dummynet_io() at dummynet_io+0x391/frame 0xfffffe000043c690 ipfw_check_frame() at ipfw_check_frame+0x2f9/frame 0xfffffe000043c780 pfil_run_hooks() at pfil_run_hooks+0xb0/frame 0xfffffe000043c810 ether_demux() at ether_demux+0x5c/frame 0xfffffe000043c840 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe000043c8a0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000043c8f0 ether_input() at ether_input+0x89/frame 0xfffffe000043c950 vtnet_rxq_eof() at vtnet_rxq_eof+0x7a5/frame 0xfffffe000043ca10 vtnet_rx_vq_process() at vtnet_rx_vq_process+0xb7/frame 0xfffffe000043ca50 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe000043cab0 fork_exit() at fork_exit+0x7e/frame 0xfffffe000043caf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000043caf0
The backtrace mentions dummynet but there are no limiters configured on this setup.
Updated by Jim Pingle about 3 years ago
Forgot to mention in the previous update but this crash happens when a user logs in, not as early as before.
Updated by Jim Pingle about 3 years ago
MR with fix from Kristof: https://gitlab.netgate.com/pfSense/FreeBSD-src/-/merge_requests/24
Updated by Jim Pingle about 3 years ago
- Status changed from Confirmed to Feedback
Kristof merged the request. Should be in snapshots tomorrow.
Updated by Jim Pingle about 3 years ago
- Plus Target Version changed from 21.09 to 22.01
Updated by Jim Pingle almost 3 years ago
- Status changed from Feedback to Resolved
Captive portal has been stable without crashing since this went in. No further sign of problems.