Bug #14917
closedMulicast traffic on a detached interface causes a panic
100%
Description
Multicast traffic can attempt to send over an interface that is down triggering a panic.
Here pimd is routing multicast traffic and the interface loses link:
Panic:
fault virtual address = 0x0 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80f051cb stack pointer = 0x28:0xfffffe0137f7cb30 frame pointer = 0x28:0xfffffe0137f7cb60 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 58230 (pimd) rdi: fffffe00c6a48f18 rsi: 4 rdx: 1 rcx: 0 r8: 0 r9: fffff8000b173000 rax: 100 rbx: fffffe013781dc80 rbp: fffffe0137f7cb60 r10: 0 r11: 80000003151cee01 r12: fffffe013781dc80 r13: 0 r14: fffff80141883300 r15: 0 trap number = 12 panic: page fault cpuid = 11 time = 1697857988 KDB: enter: panic
Backtrace:
db:1:pfs> bt Tracing pid 58230 tid 100925 td 0xfffffe013781dc80 kdb_enter() at kdb_enter+0x32/frame 0xfffffe0137f7c8f0 vpanic() at vpanic+0x183/frame 0xfffffe0137f7c940 panic() at panic+0x43/frame 0xfffffe0137f7c9a0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe0137f7ca00 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0137f7ca60 calltrap() at calltrap+0x8/frame 0xfffffe0137f7ca60 --- trap 0xc, rip = 0xffffffff80f051cb, rsp = 0xfffffe0137f7cb30, rbp = 0xfffffe0137f7cb60 --- X_ip_mrouter_done() at X_ip_mrouter_done+0x32b/frame 0xfffffe0137f7cb60 rip_detach() at rip_detach+0x3f/frame 0xfffffe0137f7cb90 sorele_locked() at sorele_locked+0x89/frame 0xfffffe0137f7cbb0 soclose() at soclose+0x14a/frame 0xfffffe0137f7cc10 _fdrop() at _fdrop+0x11/frame 0xfffffe0137f7cc30 closef() at closef+0x24b/frame 0xfffffe0137f7ccc0 fdescfree() at fdescfree+0x516/frame 0xfffffe0137f7cd90 exit1() at exit1+0x4c6/frame 0xfffffe0137f7cdf0 sys_exit() at sys_exit+0xd/frame 0xfffffe0137f7ce00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe0137f7cf30
See: https://forum.netgate.com/topic/183587/crash-when-rebooting-lan-side-switch
Updated by Kristof Provost 12 months ago
One report decodes to FreeBSD-src-RELENG_2_7_1/sys/netinet/ip_mroute.c:815, or `LIST_FOREACH_SAFE(rt, &V_mfchashtbl[i], mfc_hash, nrt)`.
We also know we're getting a NULL (read) dereference, with no offset (because `fault virtual address = 0x0`), which as far as I can tell can only happen there if V_mfchashtbl is NULL.
LIST_FOREACH_SAFE() calls LIST_FIRST() on &V_mfchashtbl[I], which accesses lh_first, which is indeed the first element in V_mfchashtbl.
I can think of two ways this could happen, but can't immediately prove either one is truly responsible.
The first is the allocation of V_mfchashtbl in ip_mrouter_init(). It's done with HASH_NOWAIT, which translates to M_NOWAIT, so the allocation might fail. That's never checked though, so we might end up cleaning up with it being NULL, which could produce this crash. It's quite odd that we don't see other panics resulting from this though.
The second potential cause would be a race in cleanup, because I don't trust the if (V_ip_mrouter == NULL) / V_ip_mrouter = NULL construct to reliably prevent races there. I'd expect to see different backtraces for this if there were multiple paths to get here though, and it ought to also be very rare. I'm also not sure this is the first thing that'd break in such a scenario. It'd certainly fail differently on a debug (i.e. with INVARIANTS) kernel.
Updated by Kristof Provost 12 months ago
Forcing V_mfchashtbl to NULL produces a panic on that exact line in X_ip_mrouter_done, with the same `fault virtual address = 0x0` read failure, so I'm relatively confident that is indeed the cause of the panic.
The fix is trivial, so I'll push that in so affected users can see if the new snapshots really fix the problem.
Updated by Kristof Provost 12 months ago
- Status changed from New to Feedback
- Assignee set to Kristof Provost
I've picked the relevant change (https://cgit.freebsd.org/src/commit/?id=b01cad6d3a8523101e7915809144f47e3045067f) to devel-main and plus-devel-main.
Tomorrow's snapshot builds should have that, and will hopefully no longer panic.
Updated by Kristof Provost 12 months ago
The relevant commit has also been cherry-picked in 2.7.2 and 23.09.1.
Updated by Jim Pingle 12 months ago
- Status changed from Feedback to Closed
- % Done changed from 0 to 100
The original issue here is rare and difficult to reproduce, only affecting a small number of users. Since we don't have a viable way to confirm it's fixed, we can close this out for now. Should the problem recur in the wild on a version with this fix in place, we can reopen the issue if needed or start a new issue and cross-reference this one.
Updated by Daniel Ben-Zvi 8 months ago
Jim Pingle wrote in #note-7:
The original issue here is rare and difficult to reproduce, only affecting a small number of users. Since we don't have a viable way to confirm it's fixed, we can close this out for now. Should the problem recur in the wild on a version with this fix in place, we can reopen the issue if needed or start a new issue and cross-reference this one.
I can confirm that this bug still exists under version 23.09.1:
Tracing pid 29404 tid 365896 td 0xfffffe00baae8ac0 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00d9c46820 vpanic() at vpanic+0x163/frame 0xfffffe00d9c46950 panic() at panic+0x43/frame 0xfffffe00d9c469b0 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00d9c46a10 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00d9c46a70 calltrap() at calltrap+0x8/frame 0xfffffe00d9c46a70 --- trap 0xc, rip = 0xffffffff80ef899a, rsp = 0xfffffe00d9c46b40, rbp = 0xfffffe00d9c46b70 --- X_ip_mrouter_done() at X_ip_mrouter_done+0x32a/frame 0xfffffe00d9c46b70 rip_detach() at rip_detach+0x3f/frame 0xfffffe00d9c46ba0 sorele_locked() at sorele_locked+0x89/frame 0xfffffe00d9c46bc0 soclose() at soclose+0x14a/frame 0xfffffe00d9c46c20 _fdrop() at _fdrop+0x11/frame 0xfffffe00d9c46c40 closef() at closef+0x24a/frame 0xfffffe00d9c46cd0 fdescfree() at fdescfree+0x4c6/frame 0xfffffe00d9c46d90 exit1() at exit1+0x49e/frame 0xfffffe00d9c46df0 sys_exit() at sys_exit+0xd/frame 0xfffffe00d9c46e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00d9c46f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d9c46f30