Regression #14431
openSending IPv6 traffic on a disabled interface can trigger a kernel panic
0%
Description
This issue was hidden by https://redmine.pfsense.org/issues/14164 but now that is solved in 23.05 is being seen.
db:1:pfs> bt Tracing pid 93402 tid 103857 td 0xfffffe00cf7cac80 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00cf8a0800 vpanic() at vpanic+0x183/frame 0xfffffe00cf8a0850 panic() at panic+0x43/frame 0xfffffe00cf8a08b0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00cf8a0910 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cf8a0970 calltrap() at calltrap+0x8/frame 0xfffffe00cf8a0970 --- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00cf8a0a40, rbp = 0xfffffe00cf8a0a70 --- in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00cf8a0a70 tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00cf8a0c60 tcp_output() at tcp_output+0x14/frame 0xfffffe00cf8a0c80 tcp6_usr_connect() at tcp6_usr_connect+0x2f4/frame 0xfffffe00cf8a0d10 soconnectat() at soconnectat+0x9e/frame 0xfffffe00cf8a0d60 kern_connectat() at kern_connectat+0xc9/frame 0xfffffe00cf8a0dc0 sys_connect() at sys_connect+0x75/frame 0xfffffe00cf8a0e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00cf8a0f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00cf8a0f30 --- syscall (98, FreeBSD ELF64, connect), rip = 0x800fddc8a, rsp = 0x7fffdf5f8c98, rbp = 0x7fffdf5f8cd0 ---
db:1:pfs> bt Tracing pid 68614 tid 100330 td 0xfffffe00cf325720 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c7d955f0 vpanic() at vpanic+0x183/frame 0xfffffe00c7d95640 panic() at panic+0x43/frame 0xfffffe00c7d956a0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00c7d95700 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c7d95760 calltrap() at calltrap+0x8/frame 0xfffffe00c7d95760 --- trap 0xc, rip = 0xffffffff80f63aa4, rsp = 0xfffffe00c7d95830, rbp = 0xfffffe00c7d95a50 --- ip6_output() at ip6_output+0xb74/frame 0xfffffe00c7d95a50 udp6_send() at udp6_send+0x78e/frame 0xfffffe00c7d95c10 sosend_dgram() at sosend_dgram+0x357/frame 0xfffffe00c7d95c70 sousrsend() at sousrsend+0x5f/frame 0xfffffe00c7d95cd0 kern_sendit() at kern_sendit+0x132/frame 0xfffffe00c7d95d60 sendit() at sendit+0xb7/frame 0xfffffe00c7d95db0 sys_sendto() at sys_sendto+0x4d/frame 0xfffffe00c7d95e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00c7d95f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c7d95f30 --- syscall (133, FreeBSD ELF64, sendto), rip = 0x823f95f2a, rsp = 0x8202cea88, rbp = 0x8202cead0 ---
db:1:pfs> bt Tracing pid 2 tid 100041 td 0xfffffe0085264560 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850ad910 vpanic() at vpanic+0x183/frame 0xfffffe00850ad960 panic() at panic+0x43/frame 0xfffffe00850ad9c0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00850ada20 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ada80 calltrap() at calltrap+0x8/frame 0xfffffe00850ada80 --- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00850adb50, rbp = 0xfffffe00850adb80 --- in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850adb80 tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00850add70 tcp_timer_rexmt() at tcp_timer_rexmt+0x514/frame 0xfffffe00850addd0 tcp_timer_enter() at tcp_timer_enter+0x102/frame 0xfffffe00850ade10 softclock_call_cc() at softclock_call_cc+0x13c/frame 0xfffffe00850adec0 softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850adef0 fork_exit() at fork_exit+0x7d/frame 0xfffffe00850adf30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850adf30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- db:1:pfs>
Related issues
Updated by Rob A 4 months ago
To add additional context that may aid in diagnostics:
- The issue presents with any change in WAN interface status - virtual or physical (eg unplug event)
- This can include a PPPoE tunnel to an ISP being bounced or refreshed by the ISP
- The example back-traces above were triggered by the PPPoE WAN link being 'disconnected' whilst the physical interface is still up
- The issue can be replicated by 'disconnecting' the established PPPoE link via the GUI button on Status / Interfaces
- The issue is intermittent and is triggered >50% of the time when under test
Full logs available, on request.
Updated by Mateusz Guzik 4 months ago
All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.
Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.
Updated by Kristof Provost 4 months ago
The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.
I've attempted to replicate this setup, with PPPoE carrying IPv6 traffic (iperf3 --bidir -P 4, with GUA and LL addresses) and hitting the 'Disconnect WAN' button on the Status/Interfaces page. Over 10+ attempts the system remained up, returned errors to iperf3 and re-connected without issue (when triggered via the status page).
Updated by Kristof Provost 4 months ago
I should add that I've been running iperf3 on the pfsense device. The backtraces show locally originated traffic, so that seemed like the setup most likely to trigger this issue.
Updated by Rob A 4 months ago
Mateusz Guzik wrote in #note-3:
All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.
Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.
In my specific use-case yes, I have equipment that I link to which are pure IPv6.
I run an exceptionally common config for the UK market. It is a standard IPv4 and IPv6 WAN connection over PPPoE to the main UK backbone (Openreach). The PPPoE MTU is a little higher than in some counties as it runs a standard 1500 MTU through the tunnel, with the physical link running a bit higher (ie a minimum of 1508 MTU) to absorb the typical PPPoE 8-byte overhead. In my case it is FTTP via an ONT running a 1000/115 Mbit service.
I can run additional tests or diagnostics once outside of normal business hours. Just point me where I need to go.
Updated by Rob A 4 months ago
This may or may not be irrelevant to the underlying fault but combing through other logs I can multiple WAN PPPoE connection attempts and failures. The ppp.log frag below is a successful PPPoE handshake. Even here I can see some hiccups; if I substitute a different router the PPPoE connection logs are clean, each and every time.
☕️
Updated by Mateusz Guzik 4 months ago
Kristof Provost wrote in #note-4:
The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.
This is probably a race where you get far enough that any sanity checks already passed, but nothing guaranteed the already checked conditions remain stable for the duration and if you are unlucky enough they did change late enough.
Rob A wrote in #note-6:
Mateusz Guzik wrote in #note-3:
All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.
Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.
In my specific use-case yes, I have equipment that I link to which are pure IPv6.
Thanks, I'll look into it. This is most likely either a 1h job OR several days, no in-between.
Updated by Rob A 4 months ago
PPPoE reconnection WITHOUT triggering a pfSense Crash
From the 2am time slot this looks like an ISP-triggered reconnection and it did not cause pfSense to crash. I have added the ppp.log from this otherwise 'non-event' as it captures the multiple failed attempts to complete a full PPPoE handshake. I have never experienced anything other than a first-time connection with my other non-Netgate/pfSense router hardware. Perhaps these failed and repeated attempts are part of the issue?
[Is there a code or button to add a spoiler type of truncation to logs such as this, as they break-up the readability of the thread somewhat?]
Found it, the collapse code. I will try harder, just getting used to Redmine.
☕️
Updated by Mateusz Guzik 4 months ago
After poking around here is my analysis, which confirms my preliminary suspicion:
All of the crash sites are invoking if_getafdata(ifp, AF_INET6)
, typically through the ND_IFINFO
macro. The routine returns a pointer to protocol-specific if_data, which when crashing happens to be NULL.
It stems from a race against interface destruction, which eventually does:
static int
if_detach_internal(struct ifnet *ifp, bool vmove)
{
[...]
IF_AFDATA_LOCK(ifp);
i = ifp->if_afdata_initialized;
ifp->if_afdata_initialized = 0;
IF_AFDATA_UNLOCK(ifp);
if (i == 0)
return (0);
SLIST_FOREACH(dp, &domains, dom_next) {
if (dp->dom_ifdetach && ifp->if_afdata[dp->dom_family]) {
(*dp->dom_ifdetach)(ifp,
ifp->if_afdata[dp->dom_family]); // uaf after this
ifp->if_afdata[dp->dom_family] = NULL; // NULL pointer deref after this
}
}
[...]
As ipv6 support has a dom_ifdetach, this results in a use-after-free or a NULL pointer dereference, depending on timing of parallel access.
The entire thing is rather seriously misdesigned, but as a damage-controlling measure I think one can split dom_ifdetach into two parts, where the last one only executes when the interface itself is going away. I'm going to have to flame this over with network people.
Updated by Jim Pingle 4 months ago
- Target version changed from 2.7.0 to CE-Next
Updated by Jim Pingle 3 months ago
- Related to Bug #14575: Renewing the pppoe WAN cause crash if the Tailscale enabled added
Updated by Kristof Provost 2 months ago
Is Tailscale also in play here? I've trying and failing to reproduce this again. No matter what I try to do, I simply cannot get the system to panic. Clearly I'm missing some factor, but right now I have no idea what that'd be.
Updated by Rob A about 2 months ago
In my case there is no involvement of Tailscale as I do not use it.
Regards.
☕️
Updated by Rob A 27 days ago
Issue remains 'live' with 23.09 dev. Details of the first crash on this version, triggered this time by taking the WAN interface down and then up via the GUI:
Crash report begins. Anonymous machine information:
amd64
14.0-ALPHA2
FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep 5 06:26:19 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fWgcJpOQ/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/s
Crash report details:
No PHP errors found.
Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
Architecture: amd64
Architecture Version: 4
Dump Length: 228864
Blocksize: 512
Compression: none
Dumptime: 2023-09-06 18:00:47 +0100
Hostname: Router-8.redacted.me
Magic: FreeBSD Text Dump
Version String: FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep 5 06:26:19 UTC 2023
root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fW
Panic String: page fault
Dump Parity: 954902300
Bounds: 0
Dump Status: good
db:1:pfs> bt
Tracing pid 2 tid 100041 td 0xfffffe0085272560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850c5840
vpanic() at vpanic+0x163/frame 0xfffffe00850c5970
panic() at panic+0x43/frame 0xfffffe00850c59d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00850c5a30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850c5a90
calltrap() at calltrap+0x8/frame 0xfffffe00850c5a90
--- trap 0xc, rip = 0xffffffff80f4d9e6, rsp = 0xfffffe00850c5b60, rbp = 0xfffffe00850c5b90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850c5b90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00850c5d70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00850c5dd0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00850c5e10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00850c5ec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850c5ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850c5f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850c5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Full logs available on request.
Updated by Jim Pingle 14 days ago
- Subject changed from IPv6: Sending traffic on a disabled interface cause a kernel panic. to Sending IPv6 traffic on a disabled interface can trigger a kernel panic
Updated by Jim Pingle 7 days ago
- Plus Target Version changed from 23.09 to 24.03
Moving the target ahead for now but if we do manage to solve it before release we can always move it back.