Regression #14431
closedSending IPv6 traffic on a disabled interface can trigger a kernel panic
100%
Description
This issue was hidden by https://redmine.pfsense.org/issues/14164 but now that is solved in 23.05 is being seen.
db:1:pfs> bt Tracing pid 93402 tid 103857 td 0xfffffe00cf7cac80 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00cf8a0800 vpanic() at vpanic+0x183/frame 0xfffffe00cf8a0850 panic() at panic+0x43/frame 0xfffffe00cf8a08b0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00cf8a0910 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cf8a0970 calltrap() at calltrap+0x8/frame 0xfffffe00cf8a0970 --- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00cf8a0a40, rbp = 0xfffffe00cf8a0a70 --- in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00cf8a0a70 tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00cf8a0c60 tcp_output() at tcp_output+0x14/frame 0xfffffe00cf8a0c80 tcp6_usr_connect() at tcp6_usr_connect+0x2f4/frame 0xfffffe00cf8a0d10 soconnectat() at soconnectat+0x9e/frame 0xfffffe00cf8a0d60 kern_connectat() at kern_connectat+0xc9/frame 0xfffffe00cf8a0dc0 sys_connect() at sys_connect+0x75/frame 0xfffffe00cf8a0e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00cf8a0f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00cf8a0f30 --- syscall (98, FreeBSD ELF64, connect), rip = 0x800fddc8a, rsp = 0x7fffdf5f8c98, rbp = 0x7fffdf5f8cd0 ---
db:1:pfs> bt Tracing pid 68614 tid 100330 td 0xfffffe00cf325720 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c7d955f0 vpanic() at vpanic+0x183/frame 0xfffffe00c7d95640 panic() at panic+0x43/frame 0xfffffe00c7d956a0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00c7d95700 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c7d95760 calltrap() at calltrap+0x8/frame 0xfffffe00c7d95760 --- trap 0xc, rip = 0xffffffff80f63aa4, rsp = 0xfffffe00c7d95830, rbp = 0xfffffe00c7d95a50 --- ip6_output() at ip6_output+0xb74/frame 0xfffffe00c7d95a50 udp6_send() at udp6_send+0x78e/frame 0xfffffe00c7d95c10 sosend_dgram() at sosend_dgram+0x357/frame 0xfffffe00c7d95c70 sousrsend() at sousrsend+0x5f/frame 0xfffffe00c7d95cd0 kern_sendit() at kern_sendit+0x132/frame 0xfffffe00c7d95d60 sendit() at sendit+0xb7/frame 0xfffffe00c7d95db0 sys_sendto() at sys_sendto+0x4d/frame 0xfffffe00c7d95e00 amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00c7d95f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c7d95f30 --- syscall (133, FreeBSD ELF64, sendto), rip = 0x823f95f2a, rsp = 0x8202cea88, rbp = 0x8202cead0 ---
db:1:pfs> bt Tracing pid 2 tid 100041 td 0xfffffe0085264560 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850ad910 vpanic() at vpanic+0x183/frame 0xfffffe00850ad960 panic() at panic+0x43/frame 0xfffffe00850ad9c0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00850ada20 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ada80 calltrap() at calltrap+0x8/frame 0xfffffe00850ada80 --- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00850adb50, rbp = 0xfffffe00850adb80 --- in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850adb80 tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00850add70 tcp_timer_rexmt() at tcp_timer_rexmt+0x514/frame 0xfffffe00850addd0 tcp_timer_enter() at tcp_timer_enter+0x102/frame 0xfffffe00850ade10 softclock_call_cc() at softclock_call_cc+0x13c/frame 0xfffffe00850adec0 softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850adef0 fork_exit() at fork_exit+0x7d/frame 0xfffffe00850adf30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850adf30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- db:1:pfs>
Related issues
Updated by Rob A over 1 year ago
To add additional context that may aid in diagnostics:
- The issue presents with any change in WAN interface status - virtual or physical (eg unplug event)
- This can include a PPPoE tunnel to an ISP being bounced or refreshed by the ISP
- The example back-traces above were triggered by the PPPoE WAN link being 'disconnected' whilst the physical interface is still up
- The issue can be replicated by 'disconnecting' the established PPPoE link via the GUI button on Status / Interfaces
- The issue is intermittent and is triggered >50% of the time when under test
Full logs available, on request.
Updated by Mateusz Guzik over 1 year ago
All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.
Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.
Updated by Kristof Provost over 1 year ago
The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.
I've attempted to replicate this setup, with PPPoE carrying IPv6 traffic (iperf3 --bidir -P 4, with GUA and LL addresses) and hitting the 'Disconnect WAN' button on the Status/Interfaces page. Over 10+ attempts the system remained up, returned errors to iperf3 and re-connected without issue (when triggered via the status page).
Updated by Kristof Provost over 1 year ago
I should add that I've been running iperf3 on the pfsense device. The backtraces show locally originated traffic, so that seemed like the setup most likely to trigger this issue.
Updated by Rob A over 1 year ago
Mateusz Guzik wrote in #note-3:
All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.
Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.
In my specific use-case yes, I have equipment that I link to which are pure IPv6.
I run an exceptionally common config for the UK market. It is a standard IPv4 and IPv6 WAN connection over PPPoE to the main UK backbone (Openreach). The PPPoE MTU is a little higher than in some counties as it runs a standard 1500 MTU through the tunnel, with the physical link running a bit higher (ie a minimum of 1508 MTU) to absorb the typical PPPoE 8-byte overhead. In my case it is FTTP via an ONT running a 1000/115 Mbit service.
I can run additional tests or diagnostics once outside of normal business hours. Just point me where I need to go.
Updated by Rob A over 1 year ago
This may or may not be irrelevant to the underlying fault but combing through other logs I can multiple WAN PPPoE connection attempts and failures. The ppp.log frag below is a successful PPPoE handshake. Even here I can see some hiccups; if I substitute a different router the PPPoE connection logs are clean, each and every time.
☕️
Updated by Mateusz Guzik over 1 year ago
Kristof Provost wrote in #note-4:
The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.
This is probably a race where you get far enough that any sanity checks already passed, but nothing guaranteed the already checked conditions remain stable for the duration and if you are unlucky enough they did change late enough.
Rob A wrote in #note-6:
Mateusz Guzik wrote in #note-3:
All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.
Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.
In my specific use-case yes, I have equipment that I link to which are pure IPv6.
Thanks, I'll look into it. This is most likely either a 1h job OR several days, no in-between.
Updated by Rob A over 1 year ago
PPPoE reconnection WITHOUT triggering a pfSense Crash
From the 2am time slot this looks like an ISP-triggered reconnection and it did not cause pfSense to crash. I have added the ppp.log from this otherwise 'non-event' as it captures the multiple failed attempts to complete a full PPPoE handshake. I have never experienced anything other than a first-time connection with my other non-Netgate/pfSense router hardware. Perhaps these failed and repeated attempts are part of the issue?
[Is there a code or button to add a spoiler type of truncation to logs such as this, as they break-up the readability of the thread somewhat?]
Found it, the collapse code. I will try harder, just getting used to Redmine.
☕️
Updated by Mateusz Guzik over 1 year ago
After poking around here is my analysis, which confirms my preliminary suspicion:
All of the crash sites are invoking if_getafdata(ifp, AF_INET6)
, typically through the ND_IFINFO
macro. The routine returns a pointer to protocol-specific if_data, which when crashing happens to be NULL.
It stems from a race against interface destruction, which eventually does:
static int
if_detach_internal(struct ifnet *ifp, bool vmove)
{
[...]
IF_AFDATA_LOCK(ifp);
i = ifp->if_afdata_initialized;
ifp->if_afdata_initialized = 0;
IF_AFDATA_UNLOCK(ifp);
if (i == 0)
return (0);
SLIST_FOREACH(dp, &domains, dom_next) {
if (dp->dom_ifdetach && ifp->if_afdata[dp->dom_family]) {
(*dp->dom_ifdetach)(ifp,
ifp->if_afdata[dp->dom_family]); // uaf after this
ifp->if_afdata[dp->dom_family] = NULL; // NULL pointer deref after this
}
}
[...]
As ipv6 support has a dom_ifdetach, this results in a use-after-free or a NULL pointer dereference, depending on timing of parallel access.
The entire thing is rather seriously misdesigned, but as a damage-controlling measure I think one can split dom_ifdetach into two parts, where the last one only executes when the interface itself is going away. I'm going to have to flame this over with network people.
Updated by Jim Pingle over 1 year ago
- Target version changed from 2.7.0 to CE-Next
Updated by Jim Pingle over 1 year ago
- Related to Bug #14575: Renewing the pppoe WAN cause crash if the Tailscale enabled added
Updated by Kristof Provost over 1 year ago
Is Tailscale also in play here? I've trying and failing to reproduce this again. No matter what I try to do, I simply cannot get the system to panic. Clearly I'm missing some factor, but right now I have no idea what that'd be.
Updated by Rob A over 1 year ago
In my case there is no involvement of Tailscale as I do not use it.
Regards.
☕️
Updated by Rob A over 1 year ago
I have switched to 23.09 dev as that is where most of the activity is focused. I will monitor and update if this issue has moved across to 23.09 dev or not.
☕️
Updated by Rob A over 1 year ago
Issue remains 'live' with 23.09 dev. Details of the first crash on this version, triggered this time by taking the WAN interface down and then up via the GUI:
Crash report begins. Anonymous machine information:
amd64
14.0-ALPHA2
FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep 5 06:26:19 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fWgcJpOQ/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/s
Crash report details:
No PHP errors found.
Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
Architecture: amd64
Architecture Version: 4
Dump Length: 228864
Blocksize: 512
Compression: none
Dumptime: 2023-09-06 18:00:47 +0100
Hostname: Router-8.redacted.me
Magic: FreeBSD Text Dump
Version String: FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep 5 06:26:19 UTC 2023
root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fW
Panic String: page fault
Dump Parity: 954902300
Bounds: 0
Dump Status: good
db:1:pfs> bt
Tracing pid 2 tid 100041 td 0xfffffe0085272560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850c5840
vpanic() at vpanic+0x163/frame 0xfffffe00850c5970
panic() at panic+0x43/frame 0xfffffe00850c59d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00850c5a30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850c5a90
calltrap() at calltrap+0x8/frame 0xfffffe00850c5a90
--- trap 0xc, rip = 0xffffffff80f4d9e6, rsp = 0xfffffe00850c5b60, rbp = 0xfffffe00850c5b90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850c5b90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00850c5d70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00850c5dd0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00850c5e10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00850c5ec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850c5ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850c5f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850c5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Full logs available on request.
Updated by Rob A over 1 year ago
Core dump provided to Christian McDonald for the related ndp issue.
☕️
Updated by Jim Pingle over 1 year ago
- Subject changed from IPv6: Sending traffic on a disabled interface cause a kernel panic. to Sending IPv6 traffic on a disabled interface can trigger a kernel panic
Updated by Jim Pingle about 1 year ago
- Plus Target Version changed from 23.09 to 24.03
Moving the target ahead for now but if we do manage to solve it before release we can always move it back.
Updated by Rob A about 1 year ago
Understood and thanks for the heads-up that the fix may be 6 months away. I'll have to find a new router solution in the interim, meaning different hardware too as I run pfSense on a Netgate 6100. 🤕
Standing-by as a test-bed in the interim period.
☕️
Updated by Kristof Provost about 1 year ago
Unfortunately both Steve and I have been unable to reproduce this problem.
We could try to see if a full core dumps helps, but I suspect this to be a race condition and even the core file might not help.
Still, if you're game there are instructions on how to get a full core dump here: https://forum.netgate.com/post/1127502
We also need that to be done with a kernel we have debug symbols for. Can you install the latest snapshot kernel-debug (which includes symbols files)? Or I can provide a kernel I have symbols for, if that's easier.
Updated by Rob A about 1 year ago
Thanks Kristof, as it happens I had a crash today:
db:1:pfs> bt
Tracing pid 2 tid 100043 td 0xfffffe0085271720
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850bb840
vpanic() at vpanic+0x163/frame 0xfffffe00850bb970
panic() at panic+0x43/frame 0xfffffe00850bb9d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00850bba30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850bba90
calltrap() at calltrap+0x8/frame 0xfffffe00850bba90
--- trap 0xc, rip = 0xffffffff80f4e116, rsp = 0xfffffe00850bbb60, rbp = 0xfffffe00850bbb90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850bbb90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00850bbd70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00850bbdd0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00850bbe10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00850bbec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850bbef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850bbf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850bbf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs> show registers
cs 0x20
ds 0x3b
es 0x3b
fs 0x13
gs 0x1b
ss 0x28
rax 0x12
rcx 0xffffffff81457767
rdx 0x3f8
rbx 0x100
rsp 0xfffffe00850bb840
rbp 0xfffffe00850bb840
rsi 0xc75d684a
rdi 0x4
r8 0xb3c75d684a
r9 0xfffffe0085271720
r10 0xfffffe00850bb720
r11 0xcedfc2df9afff59c
r12 0
r13 0
r14 0xffffffff813db3a9
r15 0xfffffe0085271720
rip 0xffffffff80d388c2 kdb_enter+0x32
rflags 0x86
kdb_enter+0x32: movq $0,0x2344f43(%rip)
db:1:pfs>......
It does have the feel of a race-condition and with Steve stuck on G.fast he is probably sheltered from it.
Happy to provide a kernel dump, just in case it helps. I didn't quite understand the instructions as part of that thread though. I have managed one previously where Christian provided everything I needed - presumably the 'kernel + symbols' you mentioned. I seemed to follow those instructions ok and emailed the results to him (ndp issue) https://redmine.pfsense.org/issues/14755#change-69465. So if it can be simplified I can run tests over this weekend. I'm running the latest 23.09d but I have a snapshot back to 23.05.1 if required.
Updated by Kristof Provost about 1 year ago
Can you install and run this kernel and try to get a core dump?
https://www.codepro.be/files/pfSense-kernel-debug-pfSense-23.09.a.20231005.1555.pkg
Updated by Rob A about 1 year ago
No problem. Best method to install this in a recoverable way?
Updated by Kristof Provost about 1 year ago
You can just pkg install and pkg remove it later.
As usual, make a config backup just in case, but this ought to be pretty safe. Make a boot environment to be extra super-duper safe and for very convenient revert later.
Updated by Rob A about 1 year ago
No joy. With pkg install I get the error:
pkg: No packages available to install matching 'pfSense-23.09.a.20231005.1555.pkg' have been found in the repositories
Tried with the full url, just the package name and from a local copy of the file only to get a variation of the error above in the CLI.
Using the GUI I loaded it via System Patches... not sure if that is a genuine method or not.
Updated by Rob A about 1 year ago
I ran the system (In whatever state I achieved above) but I was fighting other issues such as Kea and pfBlocker not running. Kea could usually be relaunched but the system was a bit buggy. I rolled back to my snapshot to await whatever I did wrong with the pkg.
Updated by Kristof Provost about 1 year ago
Download the file to your device, and install with `pkg install -U <filename>`, via the device CLI.
Updated by Rob A about 1 year ago
Ok, your job would be easy if it wasn't for these dull customers!
[23.09-DEVELOPMENT][admin@Router-8.redacted.me]/root: pkg install -U /tmp/pfSense-kernel-debug-pfSense-23.09.a.20231005.1555.pkg
Checking integrity... done (0 conflicting)
The following 1 package(s) will be affected (of 0 checked):
New packages to be INSTALLED:
pfSense-kernel-debug-pfSense: 23.09.a.20231005.1555 [unknown-repository]
Number of packages to be installed: 1
The process will require 305 MiB more space.
Proceed with this action? [y/N]: y
[1/1] Installing pfSense-kernel-debug-pfSense-23.09.a.20231005.1555...
Extracting pfSense-kernel-debug-pfSense-23.09.a.20231005.1555: 100%
[23.09-DEVELOPMENT][admin@Router-8.redacted.me]/root:
☕️
Updated by Rob A about 1 year ago
With the debug kernel running I triggered a crash and have the regular crash report. I did not see a core dump file in root though and could not find it. Where would the core dump be located?
Updated by Rob A about 1 year ago
Regrettably no:
/root: ls /var/crash
bounds info.0 info.last textdump.tar.0 textdump.tar.last
Updated by Kristof Provost about 1 year ago
Did you edit the pfSense-ddb.conf file and add a swap partition for it to dump to?
Christian is working on a shiny webui button to make this all easier, but sadly that’s not ready yet.
Updated by Rob A about 1 year ago
Not sure if I have done so previously. Currently the file looks like this:
# $FreeBSD$
#
# This file is read when going to multi-user and its contents piped thru
# ``ddb'' to define debugging scripts.
#
# see ``man 4 ddb'' and ``man 8 ddb'' for details.
#
script lockinfo=show locks; show alllocks; show lockedvnods
script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace
# kdb.enter.panic panic(9) was called.
script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset
# kdb.enter.witness witness(4) detected a locking error.
script kdb.enter.witness=run lockinfo
Updated by Marcos M about 1 year ago
To clarify, make sure that after installing the kernel-debug package, you reboot and select the debug kernel (option 6) in the boot loader. Then follow the steps
on the link provided previously:
Short version: add a device for a swap partition, ideally at least as large as system RAM. A USB stick should work. (Note you'll lose all data on the stick!) If the USB (or other) swap device is da0 do: gpart destroy -F da0 gpart create -s gpt da0 gpart add -t freebsd-swap da0 Add /dev/da0p1 none swap sw 0 0 to /etc/fstab. Edit /etc/pfSense-ddb.conf and change the script kdb.enter.default to script kdb.enter.default=bt ; show registers ; dump ; reset. Reboot.
Updated by Rob A about 1 year ago
I'm finding it hard to distinguish 'the steps' in that thread from the normal noise and I don't know how to enter the boot loader for option 6 either.
Perhaps I am in too deep for my level as I am struggling with the 'assumed knowledge' that I don't seem to have when following this issue thread. It felt easier when I was asked for a ndp.core dump last time around.
Updated by Rob A about 1 year ago
Hopefully I have captured the panic:
<118>Netgate pfSense Plus 23.09-BETA amd64 20231020-0600
<118>Bootup complete
<6>ng0: changing name to 'pppoe0'
pf_test6: kif == NULL, if_xname pppoe0
<6>ng0: changing name to 'pppoe0'
Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 18
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80f4e116
stack pointer = 0x0:0xfffffe00850b6b60
frame pointer = 0x0:0xfffffe00850b6b90
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2 (clock (3))
rdi: fffff80203712800 rsi: 000000000000001c rdx: fffff8013760d878
rcx: fffff8013760d878 r8: 00000000ffffffbd r9: 0000000000000018
rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00850b6b90
r10: fffff802033dd8c0 r11: fffff8016f5e5000 r12: 0000000000010300
r13: fffff80203676b98 r14: fffffe00850b6b68 r15: 0000000000000018
trap number = 12
panic: page fault
cpuid = 3
time = 1697905286
KDB: enter: panic
vmcore dump available on request ~ 800 MB.
Updated by Rob A about 1 year ago
Dump header from device: /dev/nda0p3
Architecture: amd64
Architecture Version: 2
Dump Length: 791597056
Blocksize: 512
Compression: none
Dumptime: 2023-10-21 17:21:26 +0100
Hostname: Router-8
Magic: FreeBSD Kernel Dump
Version String: FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256155-ccf460c6e58: Fri Oct 20 06:38:47 UTC 2023
root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/
Panic String: page fault
Dump Parity: 2695527028
Bounds: 0
Dump Status: good
Updated by Kristof Provost about 1 year ago
The backtrace there shows us in the TCP output path (from a timer callback), sending a packet out. This requires setting the hop limit via `in6_selecthlim(inp, NULL);`.
gdb tries to mislead us a bit with the line number (in that it points to sys/netinet6/in6_src.c:850, which is the `if (ifp)` case, but we've supplied a NULL ifp), but the disassembly points the third case, where we look up the output interface via fib6_lookup(). That returns an ifp, and in this core dump we see an nhob_object pointing to a struct ifnet with various ifdead_* pointers. This would support the theory that we've somehow wound up getting a route via a destroyed interface.
I would expect such routes to be removed as part of the interface destruction, but clearly that's not happening for some reason.
Updated by Kristof Provost about 1 year ago
Before if_detach_internal() clears the if_afdata array it also removes all of the addresses from the interface, which ought to ensure we can't return this interface in a route lookup any more.
The address removal is passed to the rib code, which ought to result in the fib also clearing it. I wonder if there's an issue synchronising those. I've not yet been able to dump either rib or fib from this core file, so this is currently an unsupported theory.
Updated by Rob A about 1 year ago
Ok, I'll endeavour to get another core dump for you; although it may be as late as the weekend.
Updated by Kristof Provost about 1 year ago
I don't expect another core dump to make any difference in that analysis. I may or may not be able to find the information I need in the one I have, but another one is not likely to help.
Updated by Rob A about 1 year ago
Spending more time on my Ice Lake Xeon box recently but no observable difference to that of my 6100. Today's crash on 23.09-RC (I will retain the files, in case they offer anything new):
Crash report begins. Anonymous machine information:
amd64
14.0-CURRENT
FreeBSD 14.0-CURRENT amd64 1400094 #0 plus-RELENG_23_09-n256160-b373d8da4a4: Fri Oct 27 02:22:22 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/ERwoNEbB/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/
Crash report details:
No PHP errors found.
Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
Architecture: amd64
Architecture Version: 4
Dump Length: 301568
Blocksize: 512
Compression: none
Dumptime: 2023-11-04 16:42:38 +0000
Hostname: Router-7
Magic: FreeBSD Text Dump
Version String: FreeBSD 14.0-CURRENT amd64 1400094 #0 plus-RELENG_23_09-n256160-b373d8da4a4: Fri Oct 27 02:22:22 UTC 2023
root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/
Panic String: page fault
Dump Parity: 911556898
Bounds: 0
Dump Status: good
Filename: /var/crash/textdump.tar.0
ddb.txt���������������������������������������������������������������������������������������������0600����0�������0�������1012407�����14521472176� 7175� �����������������������������������������������������������������������������������������������������ustar���root����������������������������wheel������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������db:0:kdb.enter.default> run pfs
db:1:pfs> bt
Tracing pid 2 tid 100070 td 0xfffffe00e2238000
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00e1f4a840
vpanic() at vpanic+0x163/frame 0xfffffe00e1f4a970
panic() at panic+0x43/frame 0xfffffe00e1f4a9d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00e1f4aa30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e1f4aa90
calltrap() at calltrap+0x8/frame 0xfffffe00e1f4aa90
--- trap 0xc, rip = 0xffffffff80f4e216, rsp = 0xfffffe00e1f4ab60, rbp = 0xfffffe00e1f4ab90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00e1f4ab90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00e1f4ad70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00e1f4add0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00e1f4ae10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00e1f4aec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00e1f4aef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00e1f4af30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e1f4af30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs> show registers
cs 0x20
ds 0x3b
es 0x3b
fs 0x13
gs 0x1b
ss 0x28
rax 0x12
rcx 0xffffffff81458a29
rdx 0xfffffe00e1f4a480
rbx 0x100
rsp 0xfffffe00e1f4a840
rbp 0xfffffe00e1f4a840
rsi 0xa
rdi 0xffffffff82d4e850 gdb_consdev
r8 0
r9 0
r10 0
r11 0
r12 0
r13 0
r14 0xffffffff813dbf84
r15 0xfffffe00e2238000
rip 0xffffffff80d389c2 kdb_enter+0x32
rflags 0x86
kdb_enter+0x32: movq $0,0x2344e43(%rip)
db:1:pfs> show pcpu
cpuid = 5
dynamic pcpu = 0xfffffe00b7627f00
curthread = 0xfffffe00e2238000: pid 2 tid 100070 critnest 1 "clock (5)"
curpcb = 0xfffffe00e2238520
fpcurthread = none
idlethread = 0xfffffe00e2163000: tid 100008 "idle: cpu5"
self = 0xffffffff84015000
curpmap = 0xffffffff83021ab0
tssp = 0xffffffff84015384
rsp0 = 0xfffffe00e1f4b000
kcr3 = 0xffffffffffffffff
ucr3 = 0xffffffffffffffff
scr3 = 0x0
gs32p = 0xffffffff84015404
ldt = 0xffffffff84015444
tss = 0xffffffff84015434
curvnet = 0xfffff80001306340
db:1:pfs> run lockinfo
db:2:lockinfo> show locks
No such command; use "help" to list available commands
db:2:lockinfo> show alllocks
No such command; use "help" to list available commands
db:2:lockinfo> show lockedvnods
Locked vnodes
db:1:pfs> acttrace
☕️
Updated by Kristof Provost 12 months ago
I've pushed a workaround to both devel-main and plus-devel-main. That should avoid the panic (or at least make it much less likely to occur), but doesn't fix the underlying issue.
Updated by Jim Pingle 10 months ago
- Status changed from New to Feedback
- Assignee changed from Mateusz Guzik to Kristof Provost
- Target version changed from CE-Next to 2.8.0
- % Done changed from 0 to 100
Based on the last messages this should be in snapshots just needs testing/confirmation (or closing if we can't reproduce it locally)