Project

General

Profile

Actions

Regression #14431

open

Sending IPv6 traffic on a disabled interface can trigger a kernel panic

Added by Steve Wheeler 4 months ago. Updated 7 days ago.

Status:
New
Priority:
High
Assignee:
Category:
Interfaces
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
24.03
Release Notes:
Default
Affected Version:
Affected Architecture:

Description

This issue was hidden by https://redmine.pfsense.org/issues/14164 but now that is solved in 23.05 is being seen.

db:1:pfs> bt
Tracing pid 93402 tid 103857 td 0xfffffe00cf7cac80
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00cf8a0800
vpanic() at vpanic+0x183/frame 0xfffffe00cf8a0850
panic() at panic+0x43/frame 0xfffffe00cf8a08b0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00cf8a0910
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cf8a0970
calltrap() at calltrap+0x8/frame 0xfffffe00cf8a0970
--- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00cf8a0a40, rbp = 0xfffffe00cf8a0a70 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00cf8a0a70
tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00cf8a0c60
tcp_output() at tcp_output+0x14/frame 0xfffffe00cf8a0c80
tcp6_usr_connect() at tcp6_usr_connect+0x2f4/frame 0xfffffe00cf8a0d10
soconnectat() at soconnectat+0x9e/frame 0xfffffe00cf8a0d60
kern_connectat() at kern_connectat+0xc9/frame 0xfffffe00cf8a0dc0
sys_connect() at sys_connect+0x75/frame 0xfffffe00cf8a0e00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00cf8a0f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00cf8a0f30
--- syscall (98, FreeBSD ELF64, connect), rip = 0x800fddc8a, rsp = 0x7fffdf5f8c98, rbp = 0x7fffdf5f8cd0 ---
db:1:pfs> bt
Tracing pid 68614 tid 100330 td 0xfffffe00cf325720
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c7d955f0
vpanic() at vpanic+0x183/frame 0xfffffe00c7d95640
panic() at panic+0x43/frame 0xfffffe00c7d956a0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00c7d95700
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c7d95760
calltrap() at calltrap+0x8/frame 0xfffffe00c7d95760
--- trap 0xc, rip = 0xffffffff80f63aa4, rsp = 0xfffffe00c7d95830, rbp = 0xfffffe00c7d95a50 ---
ip6_output() at ip6_output+0xb74/frame 0xfffffe00c7d95a50
udp6_send() at udp6_send+0x78e/frame 0xfffffe00c7d95c10
sosend_dgram() at sosend_dgram+0x357/frame 0xfffffe00c7d95c70
sousrsend() at sousrsend+0x5f/frame 0xfffffe00c7d95cd0
kern_sendit() at kern_sendit+0x132/frame 0xfffffe00c7d95d60
sendit() at sendit+0xb7/frame 0xfffffe00c7d95db0
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe00c7d95e00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00c7d95f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c7d95f30
--- syscall (133, FreeBSD ELF64, sendto), rip = 0x823f95f2a, rsp = 0x8202cea88, rbp = 0x8202cead0 ---
db:1:pfs> bt
Tracing pid 2 tid 100041 td 0xfffffe0085264560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850ad910
vpanic() at vpanic+0x183/frame 0xfffffe00850ad960
panic() at panic+0x43/frame 0xfffffe00850ad9c0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00850ada20
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ada80
calltrap() at calltrap+0x8/frame 0xfffffe00850ada80
--- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00850adb50, rbp = 0xfffffe00850adb80 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850adb80
tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00850add70
tcp_timer_rexmt() at tcp_timer_rexmt+0x514/frame 0xfffffe00850addd0
tcp_timer_enter() at tcp_timer_enter+0x102/frame 0xfffffe00850ade10
softclock_call_cc() at softclock_call_cc+0x13c/frame 0xfffffe00850adec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850adef0
fork_exit() at fork_exit+0x7d/frame 0xfffffe00850adf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850adf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs>

Related issues

Related to Bug #14575: Renewing the pppoe WAN cause crash if the Tailscale enabledDuplicate

Actions
Actions #1

Updated by Marcos M 4 months ago

  • Description updated (diff)
Actions #2

Updated by Rob A 4 months ago

To add additional context that may aid in diagnostics:

  • The issue presents with any change in WAN interface status - virtual or physical (eg unplug event)
  • This can include a PPPoE tunnel to an ISP being bounced or refreshed by the ISP
  • The example back-traces above were triggered by the PPPoE WAN link being 'disconnected' whilst the physical interface is still up
  • The issue can be replicated by 'disconnecting' the established PPPoE link via the GUI button on Status / Interfaces
  • The issue is intermittent and is triggered >50% of the time when under test

Full logs available, on request.

Actions #3

Updated by Mateusz Guzik 4 months ago

All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.

Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.

Actions #4

Updated by Kristof Provost 4 months ago

The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.

I've attempted to replicate this setup, with PPPoE carrying IPv6 traffic (iperf3 --bidir -P 4, with GUA and LL addresses) and hitting the 'Disconnect WAN' button on the Status/Interfaces page. Over 10+ attempts the system remained up, returned errors to iperf3 and re-connected without issue (when triggered via the status page).

Actions #5

Updated by Kristof Provost 4 months ago

I should add that I've been running iperf3 on the pfsense device. The backtraces show locally originated traffic, so that seemed like the setup most likely to trigger this issue.

Actions #6

Updated by Rob A 4 months ago

Mateusz Guzik wrote in #note-3:

All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.

Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.

In my specific use-case yes, I have equipment that I link to which are pure IPv6.

I run an exceptionally common config for the UK market. It is a standard IPv4 and IPv6 WAN connection over PPPoE to the main UK backbone (Openreach). The PPPoE MTU is a little higher than in some counties as it runs a standard 1500 MTU through the tunnel, with the physical link running a bit higher (ie a minimum of 1508 MTU) to absorb the typical PPPoE 8-byte overhead. In my case it is FTTP via an ONT running a 1000/115 Mbit service.

I can run additional tests or diagnostics once outside of normal business hours. Just point me where I need to go.

Actions #7

Updated by Rob A 4 months ago

This may or may not be irrelevant to the underlying fault but combing through other logs I can multiple WAN PPPoE connection attempts and failures. The ppp.log frag below is a successful PPPoE handshake. Even here I can see some hiccups; if I substitute a different router the PPPoE connection logs are clean, each and every time.

Show

☕️

Actions #8

Updated by Mateusz Guzik 4 months ago

Kristof Provost wrote in #note-4:

The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.

This is probably a race where you get far enough that any sanity checks already passed, but nothing guaranteed the already checked conditions remain stable for the duration and if you are unlucky enough they did change late enough.

Rob A wrote in #note-6:

Mateusz Guzik wrote in #note-3:

All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.

Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.

In my specific use-case yes, I have equipment that I link to which are pure IPv6.

Thanks, I'll look into it. This is most likely either a 1h job OR several days, no in-between.

Actions #9

Updated by Rob A 4 months ago

PPPoE reconnection WITHOUT triggering a pfSense Crash

From the 2am time slot this looks like an ISP-triggered reconnection and it did not cause pfSense to crash. I have added the ppp.log from this otherwise 'non-event' as it captures the multiple failed attempts to complete a full PPPoE handshake. I have never experienced anything other than a first-time connection with my other non-Netgate/pfSense router hardware. Perhaps these failed and repeated attempts are part of the issue?

Show

[Is there a code or button to add a spoiler type of truncation to logs such as this, as they break-up the readability of the thread somewhat?]

Found it, the collapse code. I will try harder, just getting used to Redmine.

☕️

Actions #10

Updated by Mateusz Guzik 4 months ago

After poking around here is my analysis, which confirms my preliminary suspicion:

All of the crash sites are invoking if_getafdata(ifp, AF_INET6), typically through the ND_IFINFO macro. The routine returns a pointer to protocol-specific if_data, which when crashing happens to be NULL.

It stems from a race against interface destruction, which eventually does:

static int
if_detach_internal(struct ifnet *ifp, bool vmove)
{
[...]
        IF_AFDATA_LOCK(ifp);
        i = ifp->if_afdata_initialized;
        ifp->if_afdata_initialized = 0;
        IF_AFDATA_UNLOCK(ifp);
        if (i == 0)
                return (0);
        SLIST_FOREACH(dp, &domains, dom_next) {
                if (dp->dom_ifdetach && ifp->if_afdata[dp->dom_family]) {
                        (*dp->dom_ifdetach)(ifp,
                            ifp->if_afdata[dp->dom_family]); // uaf after this
                        ifp->if_afdata[dp->dom_family] = NULL; // NULL pointer deref after this
                }
        }
[...]

As ipv6 support has a dom_ifdetach, this results in a use-after-free or a NULL pointer dereference, depending on timing of parallel access.

The entire thing is rather seriously misdesigned, but as a damage-controlling measure I think one can split dom_ifdetach into two parts, where the last one only executes when the interface itself is going away. I'm going to have to flame this over with network people.

Actions #11

Updated by Jim Pingle 4 months ago

  • Target version changed from 2.7.0 to CE-Next
Actions #12

Updated by Jim Pingle 3 months ago

  • Related to Bug #14575: Renewing the pppoe WAN cause crash if the Tailscale enabled added
Actions #13

Updated by Kristof Provost 2 months ago

Is Tailscale also in play here? I've trying and failing to reproduce this again. No matter what I try to do, I simply cannot get the system to panic. Clearly I'm missing some factor, but right now I have no idea what that'd be.

Actions #14

Updated by Rob A about 2 months ago

In my case there is no involvement of Tailscale as I do not use it.

Regards.

☕️

Actions #15

Updated by Rob A 27 days ago

I have switched to 23.09 dev as that is where most of the activity is focused. I will monitor and update if this issue has moved across to 23.09 dev or not.

☕️

Actions #16

Updated by Rob A 27 days ago

Issue remains 'live' with 23.09 dev. Details of the first crash on this version, triggered this time by taking the WAN interface down and then up via the GUI:

Crash report begins.  Anonymous machine information:

amd64
14.0-ALPHA2
FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep  5 06:26:19 UTC 2023     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fWgcJpOQ/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/s

Crash report details:

No PHP errors found.

Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 228864
  Blocksize: 512
  Compression: none
  Dumptime: 2023-09-06 18:00:47 +0100
  Hostname: Router-8.redacted.me
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep  5 06:26:19 UTC 2023
    root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fW
  Panic String: page fault
  Dump Parity: 954902300
  Bounds: 0
  Dump Status: good
db:1:pfs> bt
Tracing pid 2 tid 100041 td 0xfffffe0085272560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850c5840
vpanic() at vpanic+0x163/frame 0xfffffe00850c5970
panic() at panic+0x43/frame 0xfffffe00850c59d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00850c5a30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850c5a90
calltrap() at calltrap+0x8/frame 0xfffffe00850c5a90
--- trap 0xc, rip = 0xffffffff80f4d9e6, rsp = 0xfffffe00850c5b60, rbp = 0xfffffe00850c5b90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850c5b90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00850c5d70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00850c5dd0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00850c5e10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00850c5ec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850c5ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850c5f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850c5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Full logs available on request.

Actions #17

Updated by Rob A 15 days ago

Core dump provided to Christian McDonald for the related ndp issue.

☕️

Actions #18

Updated by Jim Pingle 14 days ago

  • Subject changed from IPv6: Sending traffic on a disabled interface cause a kernel panic. to Sending IPv6 traffic on a disabled interface can trigger a kernel panic
Actions #19

Updated by Jim Pingle 7 days ago

  • Plus Target Version changed from 23.09 to 24.03

Moving the target ahead for now but if we do manage to solve it before release we can always move it back.

Actions #20

Updated by Rob A 7 days ago

Understood and thanks for the heads-up that the fix may be 6 months away. I'll have to find a new router solution in the interim, meaning different hardware too as I run pfSense on a Netgate 6100. 🤕

Standing-by as a test-bed in the interim period.

☕️

Actions

Also available in: Atom PDF