Project

General

Profile

Actions

Bug #14917

closed

Mulicast traffic on a detached interface causes a panic

Added by Steve Wheeler about 1 year ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Category:
Interfaces
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
23.09.1
Release Notes:
Default
Affected Version:
2.7.0
Affected Architecture:
All

Description

Multicast traffic can attempt to send over an interface that is down triggering a panic.

Here pimd is routing multicast traffic and the interface loses link:
Panic:

fault virtual address    = 0x0
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff80f051cb
stack pointer            = 0x28:0xfffffe0137f7cb30
frame pointer            = 0x28:0xfffffe0137f7cb60
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 58230 (pimd)
rdi: fffffe00c6a48f18 rsi:                4 rdx:                1
rcx:                0  r8:                0  r9: fffff8000b173000
rax:              100 rbx: fffffe013781dc80 rbp: fffffe0137f7cb60
r10:                0 r11: 80000003151cee01 r12: fffffe013781dc80
r13:                0 r14: fffff80141883300 r15:                0
trap number        = 12
panic: page fault
cpuid = 11
time = 1697857988
KDB: enter: panic

Backtrace:

db:1:pfs> bt
Tracing pid 58230 tid 100925 td 0xfffffe013781dc80
kdb_enter() at kdb_enter+0x32/frame 0xfffffe0137f7c8f0
vpanic() at vpanic+0x183/frame 0xfffffe0137f7c940
panic() at panic+0x43/frame 0xfffffe0137f7c9a0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe0137f7ca00
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0137f7ca60
calltrap() at calltrap+0x8/frame 0xfffffe0137f7ca60
--- trap 0xc, rip = 0xffffffff80f051cb, rsp = 0xfffffe0137f7cb30, rbp = 0xfffffe0137f7cb60 ---
X_ip_mrouter_done() at X_ip_mrouter_done+0x32b/frame 0xfffffe0137f7cb60
rip_detach() at rip_detach+0x3f/frame 0xfffffe0137f7cb90
sorele_locked() at sorele_locked+0x89/frame 0xfffffe0137f7cbb0
soclose() at soclose+0x14a/frame 0xfffffe0137f7cc10
_fdrop() at _fdrop+0x11/frame 0xfffffe0137f7cc30
closef() at closef+0x24b/frame 0xfffffe0137f7ccc0
fdescfree() at fdescfree+0x516/frame 0xfffffe0137f7cd90
exit1() at exit1+0x4c6/frame 0xfffffe0137f7cdf0
sys_exit() at sys_exit+0xd/frame 0xfffffe0137f7ce00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe0137f7cf30

See: https://forum.netgate.com/topic/183587/crash-when-rebooting-lan-side-switch

Actions #2

Updated by Kristof Provost 12 months ago

One report decodes to FreeBSD-src-RELENG_2_7_1/sys/netinet/ip_mroute.c:815, or `LIST_FOREACH_SAFE(rt, &V_mfchashtbl[i], mfc_hash, nrt)`.

We also know we're getting a NULL (read) dereference, with no offset (because `fault virtual address = 0x0`), which as far as I can tell can only happen there if V_mfchashtbl is NULL.
LIST_FOREACH_SAFE() calls LIST_FIRST() on &V_mfchashtbl[I], which accesses lh_first, which is indeed the first element in V_mfchashtbl.

I can think of two ways this could happen, but can't immediately prove either one is truly responsible.
The first is the allocation of V_mfchashtbl in ip_mrouter_init(). It's done with HASH_NOWAIT, which translates to M_NOWAIT, so the allocation might fail. That's never checked though, so we might end up cleaning up with it being NULL, which could produce this crash. It's quite odd that we don't see other panics resulting from this though.

The second potential cause would be a race in cleanup, because I don't trust the if (V_ip_mrouter == NULL) / V_ip_mrouter = NULL construct to reliably prevent races there. I'd expect to see different backtraces for this if there were multiple paths to get here though, and it ought to also be very rare. I'm also not sure this is the first thing that'd break in such a scenario. It'd certainly fail differently on a debug (i.e. with INVARIANTS) kernel.

Actions #3

Updated by Kristof Provost 12 months ago

Forcing V_mfchashtbl to NULL produces a panic on that exact line in X_ip_mrouter_done, with the same `fault virtual address = 0x0` read failure, so I'm relatively confident that is indeed the cause of the panic.
The fix is trivial, so I'll push that in so affected users can see if the new snapshots really fix the problem.

Actions #4

Updated by Kristof Provost 12 months ago

  • Status changed from New to Feedback
  • Assignee set to Kristof Provost

I've picked the relevant change (https://cgit.freebsd.org/src/commit/?id=b01cad6d3a8523101e7915809144f47e3045067f) to devel-main and plus-devel-main.
Tomorrow's snapshot builds should have that, and will hopefully no longer panic.

Actions #5

Updated by Marcos M 12 months ago

  • Target version changed from 2.8.0 to 2.7.2
  • Plus Target Version changed from 24.03 to 23.09.1
Actions #6

Updated by Kristof Provost 12 months ago

The relevant commit has also been cherry-picked in 2.7.2 and 23.09.1.

Actions #7

Updated by Jim Pingle 12 months ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

The original issue here is rare and difficult to reproduce, only affecting a small number of users. Since we don't have a viable way to confirm it's fixed, we can close this out for now. Should the problem recur in the wild on a version with this fix in place, we can reopen the issue if needed or start a new issue and cross-reference this one.

Actions #8

Updated by Daniel Ben-Zvi 8 months ago

Jim Pingle wrote in #note-7:

The original issue here is rare and difficult to reproduce, only affecting a small number of users. Since we don't have a viable way to confirm it's fixed, we can close this out for now. Should the problem recur in the wild on a version with this fix in place, we can reopen the issue if needed or start a new issue and cross-reference this one.

I can confirm that this bug still exists under version 23.09.1:


Tracing pid 29404 tid 365896 td 0xfffffe00baae8ac0
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00d9c46820
vpanic() at vpanic+0x163/frame 0xfffffe00d9c46950
panic() at panic+0x43/frame 0xfffffe00d9c469b0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00d9c46a10
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00d9c46a70
calltrap() at calltrap+0x8/frame 0xfffffe00d9c46a70
--- trap 0xc, rip = 0xffffffff80ef899a, rsp = 0xfffffe00d9c46b40, rbp = 0xfffffe00d9c46b70 ---
X_ip_mrouter_done() at X_ip_mrouter_done+0x32a/frame 0xfffffe00d9c46b70
rip_detach() at rip_detach+0x3f/frame 0xfffffe00d9c46ba0
sorele_locked() at sorele_locked+0x89/frame 0xfffffe00d9c46bc0
soclose() at soclose+0x14a/frame 0xfffffe00d9c46c20
_fdrop() at _fdrop+0x11/frame 0xfffffe00d9c46c40
closef() at closef+0x24a/frame 0xfffffe00d9c46cd0
fdescfree() at fdescfree+0x4c6/frame 0xfffffe00d9c46d90
exit1() at exit1+0x49e/frame 0xfffffe00d9c46df0
sys_exit() at sys_exit+0xd/frame 0xfffffe00d9c46e00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00d9c46f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d9c46f30

Actions

Also available in: Atom PDF