Project

General

Profile

Actions

Regression #14431

closed

Sending IPv6 traffic on a disabled interface can trigger a kernel panic

Added by Steve Wheeler over 1 year ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Category:
Interfaces
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
24.03
Release Notes:
Default
Affected Version:
Affected Architecture:

Description

This issue was hidden by https://redmine.pfsense.org/issues/14164 but now that is solved in 23.05 is being seen.

db:1:pfs> bt
Tracing pid 93402 tid 103857 td 0xfffffe00cf7cac80
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00cf8a0800
vpanic() at vpanic+0x183/frame 0xfffffe00cf8a0850
panic() at panic+0x43/frame 0xfffffe00cf8a08b0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00cf8a0910
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cf8a0970
calltrap() at calltrap+0x8/frame 0xfffffe00cf8a0970
--- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00cf8a0a40, rbp = 0xfffffe00cf8a0a70 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00cf8a0a70
tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00cf8a0c60
tcp_output() at tcp_output+0x14/frame 0xfffffe00cf8a0c80
tcp6_usr_connect() at tcp6_usr_connect+0x2f4/frame 0xfffffe00cf8a0d10
soconnectat() at soconnectat+0x9e/frame 0xfffffe00cf8a0d60
kern_connectat() at kern_connectat+0xc9/frame 0xfffffe00cf8a0dc0
sys_connect() at sys_connect+0x75/frame 0xfffffe00cf8a0e00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00cf8a0f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00cf8a0f30
--- syscall (98, FreeBSD ELF64, connect), rip = 0x800fddc8a, rsp = 0x7fffdf5f8c98, rbp = 0x7fffdf5f8cd0 ---
db:1:pfs> bt
Tracing pid 68614 tid 100330 td 0xfffffe00cf325720
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00c7d955f0
vpanic() at vpanic+0x183/frame 0xfffffe00c7d95640
panic() at panic+0x43/frame 0xfffffe00c7d956a0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00c7d95700
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00c7d95760
calltrap() at calltrap+0x8/frame 0xfffffe00c7d95760
--- trap 0xc, rip = 0xffffffff80f63aa4, rsp = 0xfffffe00c7d95830, rbp = 0xfffffe00c7d95a50 ---
ip6_output() at ip6_output+0xb74/frame 0xfffffe00c7d95a50
udp6_send() at udp6_send+0x78e/frame 0xfffffe00c7d95c10
sosend_dgram() at sosend_dgram+0x357/frame 0xfffffe00c7d95c70
sousrsend() at sousrsend+0x5f/frame 0xfffffe00c7d95cd0
kern_sendit() at kern_sendit+0x132/frame 0xfffffe00c7d95d60
sendit() at sendit+0xb7/frame 0xfffffe00c7d95db0
sys_sendto() at sys_sendto+0x4d/frame 0xfffffe00c7d95e00
amd64_syscall() at amd64_syscall+0x109/frame 0xfffffe00c7d95f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c7d95f30
--- syscall (133, FreeBSD ELF64, sendto), rip = 0x823f95f2a, rsp = 0x8202cea88, rbp = 0x8202cead0 ---
db:1:pfs> bt
Tracing pid 2 tid 100041 td 0xfffffe0085264560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850ad910
vpanic() at vpanic+0x183/frame 0xfffffe00850ad960
panic() at panic+0x43/frame 0xfffffe00850ad9c0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00850ada20
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ada80
calltrap() at calltrap+0x8/frame 0xfffffe00850ada80
--- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00850adb50, rbp = 0xfffffe00850adb80 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850adb80
tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00850add70
tcp_timer_rexmt() at tcp_timer_rexmt+0x514/frame 0xfffffe00850addd0
tcp_timer_enter() at tcp_timer_enter+0x102/frame 0xfffffe00850ade10
softclock_call_cc() at softclock_call_cc+0x13c/frame 0xfffffe00850adec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850adef0
fork_exit() at fork_exit+0x7d/frame 0xfffffe00850adf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850adf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs>

Related issues

Related to Bug #14575: Renewing the pppoe WAN cause crash if the Tailscale enabledDuplicate

Actions
Actions #1

Updated by Marcos M over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Rob A over 1 year ago

To add additional context that may aid in diagnostics:

  • The issue presents with any change in WAN interface status - virtual or physical (eg unplug event)
  • This can include a PPPoE tunnel to an ISP being bounced or refreshed by the ISP
  • The example back-traces above were triggered by the PPPoE WAN link being 'disconnected' whilst the physical interface is still up
  • The issue can be replicated by 'disconnecting' the established PPPoE link via the GUI button on Status / Interfaces
  • The issue is intermittent and is triggered >50% of the time when under test

Full logs available, on request.

Actions #3

Updated by Mateusz Guzik over 1 year ago

All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.

Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.

Actions #4

Updated by Kristof Provost over 1 year ago

The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.

I've attempted to replicate this setup, with PPPoE carrying IPv6 traffic (iperf3 --bidir -P 4, with GUA and LL addresses) and hitting the 'Disconnect WAN' button on the Status/Interfaces page. Over 10+ attempts the system remained up, returned errors to iperf3 and re-connected without issue (when triggered via the status page).

Actions #5

Updated by Kristof Provost over 1 year ago

I should add that I've been running iperf3 on the pfsense device. The backtraces show locally originated traffic, so that seemed like the setup most likely to trigger this issue.

Actions #6

Updated by Rob A over 1 year ago

Mateusz Guzik wrote in #note-3:

All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.

Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.

In my specific use-case yes, I have equipment that I link to which are pure IPv6.

I run an exceptionally common config for the UK market. It is a standard IPv4 and IPv6 WAN connection over PPPoE to the main UK backbone (Openreach). The PPPoE MTU is a little higher than in some counties as it runs a standard 1500 MTU through the tunnel, with the physical link running a bit higher (ie a minimum of 1508 MTU) to absorb the typical PPPoE 8-byte overhead. In my case it is FTTP via an ONT running a 1000/115 Mbit service.

I can run additional tests or diagnostics once outside of normal business hours. Just point me where I need to go.

Actions #7

Updated by Rob A over 1 year ago

This may or may not be irrelevant to the underlying fault but combing through other logs I can multiple WAN PPPoE connection attempts and failures. The ppp.log frag below is a successful PPPoE handshake. Even here I can see some hiccups; if I substitute a different router the PPPoE connection logs are clean, each and every time.

Show

☕️

Actions #8

Updated by Mateusz Guzik over 1 year ago

Kristof Provost wrote in #note-4:

The addresses in both the ip6_output() and in6_selecthlim() panics suggest that fib6_lookup() returned an nhop_object with a struct ifnet where if_afdata[AF_INET6] is NULL.
That shouldn't happen, and it's not clear to me how it can happen.

This is probably a race where you get far enough that any sanity checks already passed, but nothing guaranteed the already checked conditions remain stable for the duration and if you are unlucky enough they did change late enough.

Rob A wrote in #note-6:

Mateusz Guzik wrote in #note-3:

All the above crashes are in ipv6 code, most likely racing against an interface and/or address removal.

Given your description though I have to ask if you are explicitly using ipv6 for anything -- it may be a temporary workaround is to not set ipv6 addresses on any of the interfaces.

In my specific use-case yes, I have equipment that I link to which are pure IPv6.

Thanks, I'll look into it. This is most likely either a 1h job OR several days, no in-between.

Actions #9

Updated by Rob A over 1 year ago

PPPoE reconnection WITHOUT triggering a pfSense Crash

From the 2am time slot this looks like an ISP-triggered reconnection and it did not cause pfSense to crash. I have added the ppp.log from this otherwise 'non-event' as it captures the multiple failed attempts to complete a full PPPoE handshake. I have never experienced anything other than a first-time connection with my other non-Netgate/pfSense router hardware. Perhaps these failed and repeated attempts are part of the issue?

Show

[Is there a code or button to add a spoiler type of truncation to logs such as this, as they break-up the readability of the thread somewhat?]

Found it, the collapse code. I will try harder, just getting used to Redmine.

☕️

Actions #10

Updated by Mateusz Guzik over 1 year ago

After poking around here is my analysis, which confirms my preliminary suspicion:

All of the crash sites are invoking if_getafdata(ifp, AF_INET6), typically through the ND_IFINFO macro. The routine returns a pointer to protocol-specific if_data, which when crashing happens to be NULL.

It stems from a race against interface destruction, which eventually does:

static int
if_detach_internal(struct ifnet *ifp, bool vmove)
{
[...]
        IF_AFDATA_LOCK(ifp);
        i = ifp->if_afdata_initialized;
        ifp->if_afdata_initialized = 0;
        IF_AFDATA_UNLOCK(ifp);
        if (i == 0)
                return (0);
        SLIST_FOREACH(dp, &domains, dom_next) {
                if (dp->dom_ifdetach && ifp->if_afdata[dp->dom_family]) {
                        (*dp->dom_ifdetach)(ifp,
                            ifp->if_afdata[dp->dom_family]); // uaf after this
                        ifp->if_afdata[dp->dom_family] = NULL; // NULL pointer deref after this
                }
        }
[...]

As ipv6 support has a dom_ifdetach, this results in a use-after-free or a NULL pointer dereference, depending on timing of parallel access.

The entire thing is rather seriously misdesigned, but as a damage-controlling measure I think one can split dom_ifdetach into two parts, where the last one only executes when the interface itself is going away. I'm going to have to flame this over with network people.

Actions #11

Updated by Jim Pingle over 1 year ago

  • Target version changed from 2.7.0 to CE-Next
Actions #12

Updated by Jim Pingle over 1 year ago

  • Related to Bug #14575: Renewing the pppoe WAN cause crash if the Tailscale enabled added
Actions #13

Updated by Kristof Provost over 1 year ago

Is Tailscale also in play here? I've trying and failing to reproduce this again. No matter what I try to do, I simply cannot get the system to panic. Clearly I'm missing some factor, but right now I have no idea what that'd be.

Actions #14

Updated by Rob A over 1 year ago

In my case there is no involvement of Tailscale as I do not use it.

Regards.

☕️

Actions #15

Updated by Rob A over 1 year ago

I have switched to 23.09 dev as that is where most of the activity is focused. I will monitor and update if this issue has moved across to 23.09 dev or not.

☕️

Actions #16

Updated by Rob A over 1 year ago

Issue remains 'live' with 23.09 dev. Details of the first crash on this version, triggered this time by taking the WAN interface down and then up via the GUI:

Crash report begins.  Anonymous machine information:

amd64
14.0-ALPHA2
FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep  5 06:26:19 UTC 2023     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fWgcJpOQ/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/s

Crash report details:

No PHP errors found.

Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 228864
  Blocksize: 512
  Compression: none
  Dumptime: 2023-09-06 18:00:47 +0100
  Hostname: Router-8.redacted.me
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 14.0-ALPHA2 amd64 1400094 #1 plus-devel-main-n256133-bef8dca4536: Tue Sep  5 06:26:19 UTC 2023
    root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-master-main/obj/amd64/fW
  Panic String: page fault
  Dump Parity: 954902300
  Bounds: 0
  Dump Status: good
db:1:pfs> bt
Tracing pid 2 tid 100041 td 0xfffffe0085272560
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850c5840
vpanic() at vpanic+0x163/frame 0xfffffe00850c5970
panic() at panic+0x43/frame 0xfffffe00850c59d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00850c5a30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850c5a90
calltrap() at calltrap+0x8/frame 0xfffffe00850c5a90
--- trap 0xc, rip = 0xffffffff80f4d9e6, rsp = 0xfffffe00850c5b60, rbp = 0xfffffe00850c5b90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850c5b90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00850c5d70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00850c5dd0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00850c5e10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00850c5ec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850c5ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850c5f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850c5f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Full logs available on request.

Actions #17

Updated by Rob A over 1 year ago

Core dump provided to Christian McDonald for the related ndp issue.

☕️

Actions #18

Updated by Jim Pingle over 1 year ago

  • Subject changed from IPv6: Sending traffic on a disabled interface cause a kernel panic. to Sending IPv6 traffic on a disabled interface can trigger a kernel panic
Actions #19

Updated by Jim Pingle about 1 year ago

  • Plus Target Version changed from 23.09 to 24.03

Moving the target ahead for now but if we do manage to solve it before release we can always move it back.

Actions #20

Updated by Rob A about 1 year ago

Understood and thanks for the heads-up that the fix may be 6 months away. I'll have to find a new router solution in the interim, meaning different hardware too as I run pfSense on a Netgate 6100. 🤕

Standing-by as a test-bed in the interim period.

☕️

Actions #21

Updated by Kristof Provost about 1 year ago

Unfortunately both Steve and I have been unable to reproduce this problem.

We could try to see if a full core dumps helps, but I suspect this to be a race condition and even the core file might not help.

Still, if you're game there are instructions on how to get a full core dump here: https://forum.netgate.com/post/1127502
We also need that to be done with a kernel we have debug symbols for. Can you install the latest snapshot kernel-debug (which includes symbols files)? Or I can provide a kernel I have symbols for, if that's easier.

Actions #22

Updated by Rob A about 1 year ago

Thanks Kristof, as it happens I had a crash today:

db:1:pfs> bt
Tracing pid 2 tid 100043 td 0xfffffe0085271720
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850bb840
vpanic() at vpanic+0x163/frame 0xfffffe00850bb970
panic() at panic+0x43/frame 0xfffffe00850bb9d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00850bba30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850bba90
calltrap() at calltrap+0x8/frame 0xfffffe00850bba90
--- trap 0xc, rip = 0xffffffff80f4e116, rsp = 0xfffffe00850bbb60, rbp = 0xfffffe00850bbb90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850bbb90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00850bbd70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00850bbdd0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00850bbe10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00850bbec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850bbef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00850bbf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850bbf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx         0xffffffff81457767
rdx                      0x3f8
rbx                      0x100
rsp         0xfffffe00850bb840
rbp         0xfffffe00850bb840
rsi                 0xc75d684a
rdi                        0x4
r8                0xb3c75d684a
r9          0xfffffe0085271720
r10         0xfffffe00850bb720
r11         0xcedfc2df9afff59c
r12                          0
r13                          0
r14         0xffffffff813db3a9
r15         0xfffffe0085271720
rip         0xffffffff80d388c2  kdb_enter+0x32
rflags                    0x86
kdb_enter+0x32: movq    $0,0x2344f43(%rip)
db:1:pfs>......

It does have the feel of a race-condition and with Steve stuck on G.fast he is probably sheltered from it.

Happy to provide a kernel dump, just in case it helps. I didn't quite understand the instructions as part of that thread though. I have managed one previously where Christian provided everything I needed - presumably the 'kernel + symbols' you mentioned. I seemed to follow those instructions ok and emailed the results to him (ndp issue) https://redmine.pfsense.org/issues/14755#change-69465. So if it can be simplified I can run tests over this weekend. I'm running the latest 23.09d but I have a snapshot back to 23.05.1 if required.

Actions #23

Updated by Kristof Provost about 1 year ago

Can you install and run this kernel and try to get a core dump?
https://www.codepro.be/files/pfSense-kernel-debug-pfSense-23.09.a.20231005.1555.pkg

Actions #24

Updated by Rob A about 1 year ago

No problem. Best method to install this in a recoverable way?

Actions #25

Updated by Kristof Provost about 1 year ago

You can just pkg install and pkg remove it later.
As usual, make a config backup just in case, but this ought to be pretty safe. Make a boot environment to be extra super-duper safe and for very convenient revert later.

Actions #26

Updated by Rob A about 1 year ago

No joy. With pkg install I get the error:

pkg: No packages available to install matching 'pfSense-23.09.a.20231005.1555.pkg' have been found in the repositories

Tried with the full url, just the package name and from a local copy of the file only to get a variation of the error above in the CLI.

Using the GUI I loaded it via System Patches... not sure if that is a genuine method or not.

Actions #27

Updated by Rob A about 1 year ago

I ran the system (In whatever state I achieved above) but I was fighting other issues such as Kea and pfBlocker not running. Kea could usually be relaunched but the system was a bit buggy. I rolled back to my snapshot to await whatever I did wrong with the pkg.

Actions #28

Updated by Kristof Provost about 1 year ago

Download the file to your device, and install with `pkg install -U <filename>`, via the device CLI.

Actions #29

Updated by Rob A about 1 year ago

Ok, your job would be easy if it wasn't for these dull customers!

[23.09-DEVELOPMENT][admin@Router-8.redacted.me]/root: pkg install -U /tmp/pfSense-kernel-debug-pfSense-23.09.a.20231005.1555.pkg
Checking integrity... done (0 conflicting)
The following 1 package(s) will be affected (of 0 checked):

New packages to be INSTALLED:
    pfSense-kernel-debug-pfSense: 23.09.a.20231005.1555 [unknown-repository]

Number of packages to be installed: 1

The process will require 305 MiB more space.

Proceed with this action? [y/N]: y
[1/1] Installing pfSense-kernel-debug-pfSense-23.09.a.20231005.1555...
Extracting pfSense-kernel-debug-pfSense-23.09.a.20231005.1555: 100%
[23.09-DEVELOPMENT][admin@Router-8.redacted.me]/root: 

☕️

Actions #30

Updated by Rob A about 1 year ago

With the debug kernel running I triggered a crash and have the regular crash report. I did not see a core dump file in root though and could not find it. Where would the core dump be located?

Actions #31

Updated by Kristof Provost about 1 year ago

It should end up in /var/crash

Actions #32

Updated by Rob A about 1 year ago

Regrettably no:

/root: ls /var/crash
bounds            info.0            info.last         textdump.tar.0    textdump.tar.last
Actions #33

Updated by Kristof Provost about 1 year ago

Did you edit the pfSense-ddb.conf file and add a swap partition for it to dump to?
Christian is working on a shiny webui button to make this all easier, but sadly that’s not ready yet.

Actions #34

Updated by Rob A about 1 year ago

Not sure if I have done so previously. Currently the file looks like this:

# $FreeBSD$
#
#  This file is read when going to multi-user and its contents piped thru
#  ``ddb'' to define debugging scripts.
#
# see ``man 4 ddb'' and ``man 8 ddb'' for details.
#

script lockinfo=show locks; show alllocks; show lockedvnods
script pfs=bt ; show registers ; show pcpu ; run lockinfo ; acttrace ; ps ; alltrace

# kdb.enter.panic       panic(9) was called.
script kdb.enter.default=textdump set; capture on; run pfs ; capture off; textdump dump; reset

# kdb.enter.witness    witness(4) detected a locking error.
script kdb.enter.witness=run lockinfo
Actions #35

Updated by Marcos M about 1 year ago

To clarify, make sure that after installing the kernel-debug package, you reboot and select the debug kernel (option 6) in the boot loader. Then follow the steps
on the link provided previously:

Short version: add a device for a swap partition, ideally at least as large as system RAM. A USB stick should work. (Note you'll lose all data on the stick!)
If the USB (or other) swap device is da0 do:

gpart destroy -F da0
gpart create -s gpt da0
gpart add -t freebsd-swap da0

Add /dev/da0p1 none swap sw 0 0 to /etc/fstab.
Edit /etc/pfSense-ddb.conf and change the script kdb.enter.default to script kdb.enter.default=bt ; show registers ; dump ; reset.

Reboot.

Actions #36

Updated by Rob A about 1 year ago

I'm finding it hard to distinguish 'the steps' in that thread from the normal noise and I don't know how to enter the boot loader for option 6 either.

Perhaps I am in too deep for my level as I am struggling with the 'assumed knowledge' that I don't seem to have when following this issue thread. It felt easier when I was asked for a ndp.core dump last time around.

Actions #37

Updated by Marcos M about 1 year ago

I'll message you on the forum.

Actions #38

Updated by Rob A about 1 year ago

Hopefully I have captured the panic:

<118>Netgate pfSense Plus 23.09-BETA amd64 20231020-0600
<118>Bootup complete
<6>ng0: changing name to 'pppoe0'
pf_test6: kif == NULL, if_xname pppoe0
<6>ng0: changing name to 'pppoe0'

Fatal trap 12: page fault while in kernel mode

cpuid = 3; apic id = 18
fault virtual address    = 0x10
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff80f4e116
stack pointer            = 0x0:0xfffffe00850b6b60
frame pointer            = 0x0:0xfffffe00850b6b90
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 2 (clock (3))
rdi: fffff80203712800 rsi: 000000000000001c rdx: fffff8013760d878
rcx: fffff8013760d878  r8: 00000000ffffffbd  r9: 0000000000000018
rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00850b6b90
r10: fffff802033dd8c0 r11: fffff8016f5e5000 r12: 0000000000010300
r13: fffff80203676b98 r14: fffffe00850b6b68 r15: 0000000000000018
trap number        = 12
panic: page fault
cpuid = 3
time = 1697905286
KDB: enter: panic

vmcore dump available on request ~ 800 MB.

Actions #39

Updated by Rob A about 1 year ago

Dump header from device: /dev/nda0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 791597056
  Blocksize: 512
  Compression: none
  Dumptime: 2023-10-21 17:21:26 +0100
  Hostname: Router-8
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256155-ccf460c6e58: Fri Oct 20 06:38:47 UTC 2023
    root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/
  Panic String: page fault
  Dump Parity: 2695527028
  Bounds: 0
  Dump Status: good

Actions #40

Updated by Kristof Provost about 1 year ago

The backtrace there shows us in the TCP output path (from a timer callback), sending a packet out. This requires setting the hop limit via `in6_selecthlim(inp, NULL);`.
gdb tries to mislead us a bit with the line number (in that it points to sys/netinet6/in6_src.c:850, which is the `if (ifp)` case, but we've supplied a NULL ifp), but the disassembly points the third case, where we look up the output interface via fib6_lookup(). That returns an ifp, and in this core dump we see an nhob_object pointing to a struct ifnet with various ifdead_* pointers. This would support the theory that we've somehow wound up getting a route via a destroyed interface.

I would expect such routes to be removed as part of the interface destruction, but clearly that's not happening for some reason.

Actions #41

Updated by Kristof Provost about 1 year ago

Before if_detach_internal() clears the if_afdata array it also removes all of the addresses from the interface, which ought to ensure we can't return this interface in a route lookup any more.
The address removal is passed to the rib code, which ought to result in the fib also clearing it. I wonder if there's an issue synchronising those. I've not yet been able to dump either rib or fib from this core file, so this is currently an unsupported theory.

Actions #42

Updated by Rob A about 1 year ago

Ok, I'll endeavour to get another core dump for you; although it may be as late as the weekend.

Actions #43

Updated by Kristof Provost about 1 year ago

I don't expect another core dump to make any difference in that analysis. I may or may not be able to find the information I need in the one I have, but another one is not likely to help.

Actions #44

Updated by Rob A about 1 year ago

Understood - thank you.

Actions #45

Updated by Rob A about 1 year ago

Spending more time on my Ice Lake Xeon box recently but no observable difference to that of my 6100. Today's crash on 23.09-RC (I will retain the files, in case they offer anything new):

Crash report begins.  Anonymous machine information:

amd64
14.0-CURRENT
FreeBSD 14.0-CURRENT amd64 1400094 #0 plus-RELENG_23_09-n256160-b373d8da4a4: Fri Oct 27 02:22:22 UTC 2023     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/ERwoNEbB/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/

Crash report details:

No PHP errors found.

Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 301568
  Blocksize: 512
  Compression: none
  Dumptime: 2023-11-04 16:42:38 +0000
  Hostname: Router-7
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 14.0-CURRENT amd64 1400094 #0 plus-RELENG_23_09-n256160-b373d8da4a4: Fri Oct 27 02:22:22 UTC 2023
    root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/
  Panic String: page fault
  Dump Parity: 911556898
  Bounds: 0
  Dump Status: good

Filename: /var/crash/textdump.tar.0
ddb.txt���������������������������������������������������������������������������������������������0600����0�������0�������1012407�����14521472176�  7175� �����������������������������������������������������������������������������������������������������ustar���root����������������������������wheel������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������db:0:kdb.enter.default>  run pfs
db:1:pfs> bt
Tracing pid 2 tid 100070 td 0xfffffe00e2238000
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00e1f4a840
vpanic() at vpanic+0x163/frame 0xfffffe00e1f4a970
panic() at panic+0x43/frame 0xfffffe00e1f4a9d0
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00e1f4aa30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00e1f4aa90
calltrap() at calltrap+0x8/frame 0xfffffe00e1f4aa90
--- trap 0xc, rip = 0xffffffff80f4e216, rsp = 0xfffffe00e1f4ab60, rbp = 0xfffffe00e1f4ab90 ---
in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00e1f4ab90
tcp_default_output() at tcp_default_output+0x1d97/frame 0xfffffe00e1f4ad70
tcp_timer_rexmt() at tcp_timer_rexmt+0x52f/frame 0xfffffe00e1f4add0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe00e1f4ae10
softclock_call_cc() at softclock_call_cc+0x134/frame 0xfffffe00e1f4aec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00e1f4aef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00e1f4af30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e1f4af30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx         0xffffffff81458a29
rdx         0xfffffe00e1f4a480
rbx                      0x100
rsp         0xfffffe00e1f4a840
rbp         0xfffffe00e1f4a840
rsi                        0xa
rdi         0xffffffff82d4e850  gdb_consdev
r8                           0
r9                           0
r10                          0
r11                          0
r12                          0
r13                          0
r14         0xffffffff813dbf84
r15         0xfffffe00e2238000
rip         0xffffffff80d389c2  kdb_enter+0x32
rflags                    0x86
kdb_enter+0x32: movq    $0,0x2344e43(%rip)
db:1:pfs>  show pcpu
cpuid        = 5
dynamic pcpu = 0xfffffe00b7627f00
curthread    = 0xfffffe00e2238000: pid 2 tid 100070 critnest 1 "clock (5)" 
curpcb       = 0xfffffe00e2238520
fpcurthread  = none
idlethread   = 0xfffffe00e2163000: tid 100008 "idle: cpu5" 
self         = 0xffffffff84015000
curpmap      = 0xffffffff83021ab0
tssp         = 0xffffffff84015384
rsp0         = 0xfffffe00e1f4b000
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff84015404
ldt          = 0xffffffff84015444
tss          = 0xffffffff84015434
curvnet      = 0xfffff80001306340
db:1:pfs>  run lockinfo
db:2:lockinfo> show locks
No such command; use "help" to list available commands
db:2:lockinfo>  show alllocks
No such command; use "help" to list available commands
db:2:lockinfo>  show lockedvnods
Locked vnodes
db:1:pfs>  acttrace

☕️

Actions #46

Updated by Kristof Provost 12 months ago

I've pushed a workaround to both devel-main and plus-devel-main. That should avoid the panic (or at least make it much less likely to occur), but doesn't fix the underlying issue.

Actions #47

Updated by Jim Pingle 10 months ago

  • Status changed from New to Feedback
  • Assignee changed from Mateusz Guzik to Kristof Provost
  • Target version changed from CE-Next to 2.8.0
  • % Done changed from 0 to 100

Based on the last messages this should be in snapshots just needs testing/confirmation (or closing if we can't reproduce it locally)

Actions #48

Updated by Jim Pingle 8 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF