Regression #16790: Kernel panic due to race condition on a ``bpf`` device - pfSense - pfSense bugtracker

Actions

Copy link

Regression #16790

closed

Kernel panic due to race condition on a ``bpf`` device

Added by Azamat Khakimyanov 3 months ago. Updated about 2 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Mateusz Guzik

Category:

Operating System

Target version:

2.9.0

Start date:

Due date:

% Done:

Estimated time:

Plus Target Version:

26.03.1

Release Notes:

Default

Affected Version:

Affected Architecture:

Description

There are already two customers who reported about crashes which happened after update up to 26.03 pfSense Plus.
It happened on two different hardware: 7100 (HS #44225707405) and 4100 (HS #44386928982).

Crash report showed:

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 18
fault virtual address = 0x30
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80eb2b66
stack pointer        = 0x0:0xfffffe0067e2f7c0
frame pointer        = 0x0:0xfffffe0067e2f830
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_1)
rdi: fffffe0008760a00 rsi: fffffe0067e2f7e0 rdx: 0000000000000001
rcx: 00000000000042a7  r8: 0000000000000002  r9: fffff80149a99360
rax: 0000000000000000 rbx: fffff80149a99300 rbp: fffffe0067e2f830
r10: fffffe0067e2f6f0 r11: 0000000000000001 r12: 0000000000000060
r13: fffff80008751800 r14: 0000000000000000 r15: fffff8017099e800
trap number = 12
panic: page fault
cpuid = 1
time = 1776065740
KDB: enter: panic

db:1:pfs> bt
Tracing pid 0 tid 100008 td 0xfffff80100d0b000
kdb_enter() at kdb_enter+0x33/frame 0xfffffe0067e2f640
panic() at panic+0x43/frame 0xfffffe0067e2f6a0
trap_pfault() at trap_pfault+0x3cf/frame 0xfffffe0067e2f6f0
calltrap() at calltrap+0x8/frame 0xfffffe0067e2f6f0
--- trap 0xc, rip = 0xffffffff80eb2b66, rsp = 0xfffffe0067e2f7c0, rbp = 0xfffffe0067e2f830 ---
bpf_mtap() at bpf_mtap+0x86/frame 0xfffffe0067e2f830
vlan_transmit() at vlan_transmit+0x42/frame 0xfffffe0067e2f880
ether_output_frame() at ether_output_frame+0xd3/frame 0xfffffe0067e2f8b0
ether_output() at ether_output+0x697/frame 0xfffffe0067e2f940
ip_tryforward() at ip_tryforward+0x4fc/frame 0xfffffe0067e2f9e0
ip_input() at ip_input+0x31d/frame 0xfffffe0067e2fa40
netisr_dispatch_src() at netisr_dispatch_src+0x1fa/frame 0xfffffe0067e2fa90
ether_demux() at ether_demux+0x194/frame 0xfffffe0067e2fac0
ether_nh_input() at ether_nh_input+0x32b/frame 0xfffffe0067e2fb20
netisr_dispatch_src() at netisr_dispatch_src+0x9f/frame 0xfffffe0067e2fb70
ether_input() at ether_input+0x56/frame 0xfffffe0067e2fbc0
ether_demux() at ether_demux+0x8e/frame 0xfffffe0067e2fbf0
ether_nh_input() at ether_nh_input+0x32b/frame 0xfffffe0067e2fc50
netisr_dispatch_src() at netisr_dispatch_src+0x9f/frame 0xfffffe0067e2fca0
ether_input() at ether_input+0x56/frame 0xfffffe0067e2fcf0
iflib_rxeof() at iflib_rxeof+0xa6f/frame 0xfffffe0067e2fe00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe0067e2fe40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe0067e2fec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe0067e2fef0
fork_exit() at fork_exit+0x7b/frame 0xfffffe0067e2ff30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0067e2ff30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

System logs showed nothing informative.

Files

Download all files

4100_textdump.tar.0 (247 KB) 4100_textdump.tar.0		Azamat Khakimyanov, 04/14/2026 09:34 AM
7100_textdump.tar.0 (302 KB) 7100_textdump.tar.0		Azamat Khakimyanov, 04/14/2026 09:35 AM
7100_status_output.tgz (831 KB) 7100_status_output.tgz		Azamat Khakimyanov, 04/14/2026 09:36 AM

Related issues

Actions

Copy link

Updated by Mateusz Guzik 3 months ago

Assignee set to Mateusz Guzik

The panicking instruction is:

ffffffff80eacb66: 48 8b 48 30           movq    0x30(%rax), %rcx

According to register dump in the panic, rax is NULL.

addr2line on the rip shows:

static inline bool
bpf_chkdir(struct bpf_d *d, struct mbuf *m)
{
        return (d->bd_bif->bif_methods->bif_chkdir(d->bd_bif->bif_softc, m,
            d->bd_direction));
}

Upstream has the following commit not present in 26.03:

commit 5937e1cdc99180b4adae2cf20cabd75dd9f45546
Author: Gleb Smirnoff <glebius@FreeBSD.org>
Date:   Wed Feb 4 14:07:11 2026 -0800

    bpf: don't clear pointer from descriptor to the tap on descriptor close

    During packet processing the descriptor is looked up using epoch(9) and it
    can be accessed after bpf_detachd().  In scenario of descriptor close the
    tap point is alive (it actually produces packets) and thus the pointer can
    be legitimately dereferenced.  This fixes a race on a bpf(4) device close
    that would otherwise result in panic.

    Differential Revision:  https://reviews.freebsd.org/D55064

diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index 9f0b57728e88..228ac9867bd7 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -678,8 +678,8 @@ bpf_detachd(struct bpf_d *d, bool detached_ifp)
        BPFD_LOCK(d);
        CK_LIST_REMOVE(d, bd_next);
        writer = (d->bd_writer > 0);
-       d->bd_bif = NULL;
        if (detached_ifp) {
+               d->bd_bif = NULL;
                /*
                 * Notify descriptor as it's detached, so that any
                 * sleepers wake up and get ENXIO.

Looks like a fix to this very problem.

Actions

Copy link

Updated by Mateusz Guzik 3 months ago

Status changed from New to Feedback

Actions

Copy link

Updated by Craig Coonrad 2 months ago

Looks to be two cases this week.
6100 - HS 45137024843
8300 - HS 45133424386

Actions

Copy link

Updated by Lev Prokofev 2 months ago

Hm, could it be related to https://redmine.pfsense.org/issues/16828 ?

Actions

Copy link

Updated by Mateusz Guzik 2 months ago

I checked one of the textdumps in 16828, it is indeed the same problem. I think we can close that one as a duplicate.

Actions

Copy link

Updated by Danilo Zrenjanin 2 months ago

Another case.
5100 - HS 45223009297

Actions

Copy link

Updated by Lev Prokofev 2 months ago

One more 45279747897 - Whitebox

Actions

Copy link

Updated by Jim Pingle 2 months ago

Target version set to 26.03.1

Fix for this was picked back to 26.03.1 branch

Actions

Copy link

Updated by Jim Pingle 2 months ago

Subject changed from Possible regression on 26.03 which causes firewalls to crash to Kernel panic due to race condition on a ``bpf`` device

Actions

Copy link

#10

Updated by Jim Pingle 2 months ago

Project changed from pfSense Plus to pfSense
Category changed from Operating System to Operating System
Target version changed from 26.03.1 to 2.9.0
Plus Target Version set to 26.03.1

Actions

Copy link

#11

Updated by Jim Pingle 2 months ago

Private changed from Yes to No

Actions

Copy link

#12

Updated by Jim Pingle 2 months ago

Has duplicate Bug #16828: Kernel panic (page fault) in ``bpfmtap`` via ``vlantransmit`` with Suricata BPF listeners active on VLAN interfaces added

Actions

Copy link

#13

Updated by Jim Pingle about 2 months ago

Status changed from Feedback to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries