Project

General

Profile

Actions

Bug #14077

closed

Kernel panic from incoming IPv6 connections

Added by Marcos M about 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Category:
Operating System
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
23.05
Release Notes:
Default
Affected Version:
2.7.0
Affected Architecture:
6100

Description

After upgrading to 23.01, the system crashes with the following test on a Netgate 6100:
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 04
fault virtual address   = 0x460
fault code      = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80eb8606
stack pointer           = 0x28:0xfffffe00107aa020
frame pointer           = 0x28:0xfffffe00107aa020
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process     = 0 (if_io_tqg_0)
rdi:                0 rsi:                2 rdx:                1
rcx:                0  r8:                0  r9:  100000000000000
rax:                2 rbx:                0 rbp: fffffe00107aa020
r10: fffff8010f7de4f8 r11:                8 r12: fffffe00107aa088
r13: fffff8002ce71478 r14:                0 r15: fffff8002ce71400
trap number     = 12
panic: page fault
cpuid = 0
time = 1677006198
KDB: enter: panic
db:1:pfs> bt
Tracing pid 0 tid 100007 td 0xfffffe0011f46720
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00107a9de0
vpanic() at vpanic+0x182/frame 0xfffffe00107a9e30
panic() at panic+0x43/frame 0xfffffe00107a9e90
trap_fatal() at trap_fatal+0x409/frame 0xfffffe00107a9ef0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00107a9f50
calltrap() at calltrap+0x8/frame 0xfffffe00107a9f50
--- trap 0xc, rip = 0xffffffff80eb8606, rsp = 0xfffffe00107aa020, rbp = 0xfffffe00107aa020 ---
if_inc_counter() at if_inc_counter+0x6/frame 0xfffffe00107aa020
looutput() at looutput+0x4f/frame 0xfffffe00107aa050
ip6_forward() at ip6_forward+0x888/frame 0xfffffe00107aa150
pf_refragment6() at pf_refragment6+0x164/frame 0xfffffe00107aa1a0
pf_test6() at pf_test6+0x1380/frame 0xfffffe00107aa310
pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe00107aa340
pfil_mbuf_out() at pfil_mbuf_out+0x35/frame 0xfffffe00107aa370
ip6_output() at ip6_output+0x1204/frame 0xfffffe00107aa5b0
icmp6_reflect() at icmp6_reflect+0x2dd/frame 0xfffffe00107aa660
icmp6_error() at icmp6_error+0x37c/frame 0xfffffe00107aa6d0
pf_route6() at pf_route6+0x7ff/frame 0xfffffe00107aa7b0
pf_test6() at pf_test6+0xce3/frame 0xfffffe00107aa930
pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe00107aa960
pfil_mbuf_out() at pfil_mbuf_out+0x35/frame 0xfffffe00107aa990
ip6_forward() at ip6_forward+0x3f4/frame 0xfffffe00107aaa90
ip6_input() at ip6_input+0x9a4/frame 0xfffffe00107aab70
netisr_dispatch_src() at netisr_dispatch_src+0x2a6/frame 0xfffffe00107aabc0
ether_demux() at ether_demux+0x144/frame 0xfffffe00107aabf0
ether_nh_input() at ether_nh_input+0x353/frame 0xfffffe00107aac50
netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00107aaca0
ether_input() at ether_input+0x69/frame 0xfffffe00107aad00
iflib_rxeof() at iflib_rxeof+0xbdb/frame 0xfffffe00107aae00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00107aae40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00107aaec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe00107aaef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe00107aaf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00107aaf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx                        0x1
rdx                      0x3f8
rbx                      0x100
rsp         0xfffffe00107a9de0
rbp         0xfffffe00107a9de0
rsi                          0
rdi                        0x4
r8          0xfefefefefefefeff
r9          0x8080808080808080
r10         0xfffffe00107a9cc0
r11         0xcedfc2df9afff59c
r12                      0x400
r13         0xfffffe00107a9f60
r14         0xfffffe00107a9e70
r15         0xfffffe0011f46720
rip         0xffffffff80dd82f2  kdb_enter+0x32
rflags                    0x82
kdb_enter+0x32: movq    $0,0x27bd313(%rip)
db:1:pfs>  show pcpu
cpuid        = 0
dynamic pcpu = 0x126d800
curthread    = 0xfffffe0011f46720: pid 0 tid 100007 critnest 1 "if_io_tqg_0" 
curpcb       = 0xfffffe0011f46c40
fpcurthread  = none
idlethread   = 0xfffffe0011f483a0: tid 100003 "idle: cpu0" 
self         = 0xffffffff84610000
curpmap      = 0xffffffff83549750
tssp         = 0xffffffff84610384
rsp0         = 0xfffffe00107ab000
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff84610404
ldt          = 0xffffffff84610444
tss          = 0xffffffff84610434
curvnet      = 0xfffff800011d0900

Related issues

Related to Bug #14092: Kernel panic when PF passes a large/fragmented ICMP6 packetResolvedKristof Provost

Actions
Actions #1

Updated by Jim Pingle about 1 year ago

There must be some other required component to replicate this. I've not seen a panic like this on the 6100 at my edge and I've been pushing every bit of my edge traffic through it consistently, a decent chunk of it going through IPv6 (with and without NPt).

I was able to download that entire Linux ISO torrent without error.

Actions #2

Updated by Bruno Dambrine about 1 year ago

I have reinstalled the 6100 with the 23.01 to make sure that the issue is not linked to the upgrade.
I got the same result so i am back to the 22.05.

I will create a boot environment where i will upgrade to 23.01 and put it back to factory defaults.
Then I will try to rebuild my configuration step by step and do some tests.
Maybe I will be able to find the parameters that create the crash.

Actions #3

Updated by Paul Kennedy about 1 year ago

Bruno Dambrine wrote in #note-2:

I have reinstalled the 6100 with the 23.01 to make sure that the issue is not linked to the upgrade.
I got the same result so i am back to the 22.05.

I will create a boot environment where i will upgrade to 23.01 and put it back to factory defaults.
Then I will try to rebuild my configuration step by step and do some tests.
Maybe I will be able to find the parameters that create the crash.

Hi guys.

I think I posted the same issue on NON-Netgate hardware in main forum - https://forum.netgate.com/topic/178613/23-01-crashing-frequently-ipsec-connections-constantly-dropping-and-respawning-unable-to-access-http-over-vpn-address-constantly-times-out

Looks like the same panic error (I had uploaded the dump files in that message - maybe related??)

Actions #5

Updated by Bruno Dambrine about 1 year ago

Hi.

I have rebuild the configuration and I may have some useful information.

First of all, some information on how I am connected to Internet. I have an optic fiber and at the end the Internet Service Provider box.
Behind the box there is a TV decoder and the Netgate 6100. All my network is behind the Netgate 6100 (Switches, PC, NAS, ...).

By rebuilding my configuration, I found that the crash comes with the NAT rules.
I downloads torrents (linux images, ...) with my QNAP NAS. So I use the QNAP Download Station for that (a software provided by QNAP and installed in the NAS).
The Download Station use port 6881 to 6889 for incoming TCP connection (to speed up download and seeding : the ports depends on the torrent client but it is a common practice for torrent client).
So I have done a NAT configuration on the ISP box to forward port 6881 to 6889 to the Netgate 6100.
And I have done a NAT configuration on the Netgate 6100 to forward port 6881 to 6889 to the NAS.
If I disabled the two rules (IPv4 and IPv6) in the 6100. I don't have the crash.
If I keep them but I disabled the rules in the ISP box. I still don't have the crash.
So it seems that the crash is linked to incoming connections not the configuration itself.
In my configuration, I also have a NAT on the port 443 for a HTTPS server in my network. But it does not provide any crash.
The main difference between the 443 NAT and the 6881-6889 NAT is the number of connections.
On the 443 NAT, I have only few connections but when I download some torrents, I have a lot connections on the 6881-6889 NAT.

My guess is the issue is about the NAT. Few connections is OK but a lot of connections leads to a crash.
So the downloads just create the conditions for the crash. The issue may appear on any situation where there are many connections on a NAT.

I hope these information can help solving the issue.

@Jim
Maybe during your tests, you could not have any incoming connection (firewall, ...). As I said earlier when I disable the NAT, everything is fine. It is just slower.

Actions #6

Updated by Jim Pingle about 1 year ago

  • Project changed from pfSense Plus to pfSense
  • Subject changed from Kernel panic on 6100 to Kernel panic from incoming IPv6 connections
  • Category changed from Operating System to Operating System
  • Assignee set to Kristof Provost
  • Target version set to 2.7.0
  • Affected Plus Version deleted (23.01)
  • Plus Target Version set to 23.05

This looks similar to another crash we have been able to reproduce, and we're still working on a fix. I suspect it's the same root cause based on the similarity of the backtrace in the crash dumps. It does seem to be tied to incoming packets, but not the type or volume. If it's the same as the other issue, it may be from incoming packets which are larger than the MTU on the link.

Actions #7

Updated by Bruno Dambrine about 1 year ago

Thank you for the information.

I got an unexpected crash but I forgot that I have another NAT rule (the 443 NAT rule)...
So I switch back to 22.05 while you are working on a fix.

Actions #8

Updated by Kristof Provost about 1 year ago

This issue isn't related to IPv4 NAT, so your NAT rules will not matter.

See #14092 as well, because this is almost certainly that issue. The fix is pending review upstream, and will likely land in snapshot builds later this week.

Actions #9

Updated by Flole Systems about 1 year ago

#14092 is not public, so it's impossible to check what that one is about and what will trigger it.

Actions #10

Updated by Kristof Provost about 1 year ago

Sorry, I missed that.

I believe I understand the issue. Briefly put, pf_refragment6() ends up calling ip6_forward() for traffic in the output (so not forwarding) path, and ip6_forward() assumes that m->m_pkthdr.rcvif is set, which is not the case for output traffic.

This fixes the panic: https://reviews.freebsd.org/D39061 (and subsequent reviews fix link-local functionality and add a test case).

Actions #11

Updated by Jim Pingle about 1 year ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 100

A fix for this was merged into snapshots around the 17th. If possible, please upgrade to a current dev snapshot and see if you can reproduce the problem now.

I was able to induce a crash before, but not on current snapshots.

Actions #12

Updated by Bruno Dambrine about 1 year ago

Sorry, I have two questions.

1 - Can I install the last snapshop of pfsense CE on my netgate 6100 as I do with pfsense+ ?

2 - Will i be able to reload the conf from the pfsense+ 22.05 or do I have to rebuild the conf ?

Actions #13

Updated by Jim Pingle about 1 year ago

Bruno Dambrine wrote in #note-12:

1 - Can I install the last snapshop of pfsense CE on my netgate 6100 as I do with pfsense+ ?

It may technically be possible to some extent but not something I'd recommend, we do not test that or make sure it works overall. It would be missing support for various aspects of the system, though basic functionality may be there.

2 - Will i be able to reload the conf from the pfsense+ 22.05 or do I have to rebuild the conf ?

For compatibility between versions, the important piece is the "config revision" -- you can always import a config that is the same or older revision to a newer version, but you can't go backward. See https://docs.netgate.com/pfsense/en/latest/releases/versions.html for a table with which versions have which config revisions.

Eventually we'll have 23.05 snapshots public but at the moment we're still working on things quite heavily so they aren't generally available yet.

Actions #14

Updated by David Myers about 1 year ago

I'm not proficient with FreeBSD package management so this is probably a dumb question, but is there any way to drop a kernel with this fix onto an existing 23.01 system?

Actions #15

Updated by Bruno Dambrine about 1 year ago

Jim Pingle wrote in #note-13:

Bruno Dambrine wrote in #note-12:

1 - Can I install the last snapshop of pfsense CE on my netgate 6100 as I do with pfsense+ ?

It may technically be possible to some extent but not something I'd recommend, we do not test that or make sure it works overall. It would be missing support for various aspects of the system, though basic functionality may be there.

2 - Will i be able to reload the conf from the pfsense+ 22.05 or do I have to rebuild the conf ?

For compatibility between versions, the important piece is the "config revision" -- you can always import a config that is the same or older revision to a newer version, but you can't go backward. See https://docs.netgate.com/pfsense/en/latest/releases/versions.html for a table with which versions have which config revisions.

Eventually we'll have 23.05 snapshots public but at the moment we're still working on things quite heavily so they aren't generally available yet.

Thanks for the explanation.

I will wait for a public snapshot of the 23.05.

Actions #16

Updated by Jim Pingle 12 months ago

  • Status changed from Feedback to Resolved
Actions #17

Updated by Bruno Dambrine 12 months ago

This evening, I have installed the last beta of 23.05 on my 6100 and done some tests.
Currently no crash.

Thanks.

Actions #18

Updated by Jim Pingle 11 months ago

  • Related to Bug #14092: Kernel panic when PF passes a large/fragmented ICMP6 packet added
Actions #19

Updated by Jim Pingle 11 months ago

There are more details about this issue and specifics of how to easily reproduce it over on #14092 which is now public since we released 23.05 with the fix included.

This is the same root cause, though we didn't close it out as a duplicate since it was generating useful feedback.

Actions #20

Updated by Jim Pingle 10 months ago

  • Affected Version set to 2.7.0
Actions

Also available in: Atom PDF