Bug #14077
closedKernel panic from incoming IPv6 connections
100%
Description
- With a default configuration, download the following torrent file https://download.rockylinux.org/pub/rocky/9/isos/x86_64/Rocky-9.1-x86_64-dvd.torrent (QNAP's Download Station was used in this case).
- The crash occurs seemingly randomly throughout the download - download speed is 1Gbps.
Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 04 fault virtual address = 0x460 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80eb8606 stack pointer = 0x28:0xfffffe00107aa020 frame pointer = 0x28:0xfffffe00107aa020 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (if_io_tqg_0) rdi: 0 rsi: 2 rdx: 1 rcx: 0 r8: 0 r9: 100000000000000 rax: 2 rbx: 0 rbp: fffffe00107aa020 r10: fffff8010f7de4f8 r11: 8 r12: fffffe00107aa088 r13: fffff8002ce71478 r14: 0 r15: fffff8002ce71400 trap number = 12 panic: page fault cpuid = 0 time = 1677006198 KDB: enter: panic db:1:pfs> bt Tracing pid 0 tid 100007 td 0xfffffe0011f46720 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00107a9de0 vpanic() at vpanic+0x182/frame 0xfffffe00107a9e30 panic() at panic+0x43/frame 0xfffffe00107a9e90 trap_fatal() at trap_fatal+0x409/frame 0xfffffe00107a9ef0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00107a9f50 calltrap() at calltrap+0x8/frame 0xfffffe00107a9f50 --- trap 0xc, rip = 0xffffffff80eb8606, rsp = 0xfffffe00107aa020, rbp = 0xfffffe00107aa020 --- if_inc_counter() at if_inc_counter+0x6/frame 0xfffffe00107aa020 looutput() at looutput+0x4f/frame 0xfffffe00107aa050 ip6_forward() at ip6_forward+0x888/frame 0xfffffe00107aa150 pf_refragment6() at pf_refragment6+0x164/frame 0xfffffe00107aa1a0 pf_test6() at pf_test6+0x1380/frame 0xfffffe00107aa310 pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe00107aa340 pfil_mbuf_out() at pfil_mbuf_out+0x35/frame 0xfffffe00107aa370 ip6_output() at ip6_output+0x1204/frame 0xfffffe00107aa5b0 icmp6_reflect() at icmp6_reflect+0x2dd/frame 0xfffffe00107aa660 icmp6_error() at icmp6_error+0x37c/frame 0xfffffe00107aa6d0 pf_route6() at pf_route6+0x7ff/frame 0xfffffe00107aa7b0 pf_test6() at pf_test6+0xce3/frame 0xfffffe00107aa930 pf_check6_out() at pf_check6_out+0x40/frame 0xfffffe00107aa960 pfil_mbuf_out() at pfil_mbuf_out+0x35/frame 0xfffffe00107aa990 ip6_forward() at ip6_forward+0x3f4/frame 0xfffffe00107aaa90 ip6_input() at ip6_input+0x9a4/frame 0xfffffe00107aab70 netisr_dispatch_src() at netisr_dispatch_src+0x2a6/frame 0xfffffe00107aabc0 ether_demux() at ether_demux+0x144/frame 0xfffffe00107aabf0 ether_nh_input() at ether_nh_input+0x353/frame 0xfffffe00107aac50 netisr_dispatch_src() at netisr_dispatch_src+0xb9/frame 0xfffffe00107aaca0 ether_input() at ether_input+0x69/frame 0xfffffe00107aad00 iflib_rxeof() at iflib_rxeof+0xbdb/frame 0xfffffe00107aae00 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe00107aae40 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe00107aaec0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc3/frame 0xfffffe00107aaef0 fork_exit() at fork_exit+0x7e/frame 0xfffffe00107aaf30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00107aaf30 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- db:1:pfs> show registers cs 0x20 ds 0x3b es 0x3b fs 0x13 gs 0x1b ss 0x28 rax 0x12 rcx 0x1 rdx 0x3f8 rbx 0x100 rsp 0xfffffe00107a9de0 rbp 0xfffffe00107a9de0 rsi 0 rdi 0x4 r8 0xfefefefefefefeff r9 0x8080808080808080 r10 0xfffffe00107a9cc0 r11 0xcedfc2df9afff59c r12 0x400 r13 0xfffffe00107a9f60 r14 0xfffffe00107a9e70 r15 0xfffffe0011f46720 rip 0xffffffff80dd82f2 kdb_enter+0x32 rflags 0x82 kdb_enter+0x32: movq $0,0x27bd313(%rip) db:1:pfs> show pcpu cpuid = 0 dynamic pcpu = 0x126d800 curthread = 0xfffffe0011f46720: pid 0 tid 100007 critnest 1 "if_io_tqg_0" curpcb = 0xfffffe0011f46c40 fpcurthread = none idlethread = 0xfffffe0011f483a0: tid 100003 "idle: cpu0" self = 0xffffffff84610000 curpmap = 0xffffffff83549750 tssp = 0xffffffff84610384 rsp0 = 0xfffffe00107ab000 kcr3 = 0xffffffffffffffff ucr3 = 0xffffffffffffffff scr3 = 0x0 gs32p = 0xffffffff84610404 ldt = 0xffffffff84610444 tss = 0xffffffff84610434 curvnet = 0xfffff800011d0900
Related issues
Updated by Jim Pingle almost 2 years ago
There must be some other required component to replicate this. I've not seen a panic like this on the 6100 at my edge and I've been pushing every bit of my edge traffic through it consistently, a decent chunk of it going through IPv6 (with and without NPt).
I was able to download that entire Linux ISO torrent without error.
Updated by Bruno Dambrine almost 2 years ago
I have reinstalled the 6100 with the 23.01 to make sure that the issue is not linked to the upgrade.
I got the same result so i am back to the 22.05.
I will create a boot environment where i will upgrade to 23.01 and put it back to factory defaults.
Then I will try to rebuild my configuration step by step and do some tests.
Maybe I will be able to find the parameters that create the crash.
Updated by Paul Kennedy almost 2 years ago
Bruno Dambrine wrote in #note-2:
I have reinstalled the 6100 with the 23.01 to make sure that the issue is not linked to the upgrade.
I got the same result so i am back to the 22.05.I will create a boot environment where i will upgrade to 23.01 and put it back to factory defaults.
Then I will try to rebuild my configuration step by step and do some tests.
Maybe I will be able to find the parameters that create the crash.
Hi guys.
I think I posted the same issue on NON-Netgate hardware in main forum - https://forum.netgate.com/topic/178613/23-01-crashing-frequently-ipsec-connections-constantly-dropping-and-respawning-unable-to-access-http-over-vpn-address-constantly-times-out
Looks like the same panic error (I had uploaded the dump files in that message - maybe related??)
Updated by Bruno Dambrine almost 2 years ago
Hi.
I have rebuild the configuration and I may have some useful information.
First of all, some information on how I am connected to Internet. I have an optic fiber and at the end the Internet Service Provider box.
Behind the box there is a TV decoder and the Netgate 6100. All my network is behind the Netgate 6100 (Switches, PC, NAS, ...).
By rebuilding my configuration, I found that the crash comes with the NAT rules.
I downloads torrents (linux images, ...) with my QNAP NAS. So I use the QNAP Download Station for that (a software provided by QNAP and installed in the NAS).
The Download Station use port 6881 to 6889 for incoming TCP connection (to speed up download and seeding : the ports depends on the torrent client but it is a common practice for torrent client).
So I have done a NAT configuration on the ISP box to forward port 6881 to 6889 to the Netgate 6100.
And I have done a NAT configuration on the Netgate 6100 to forward port 6881 to 6889 to the NAS.
If I disabled the two rules (IPv4 and IPv6) in the 6100. I don't have the crash.
If I keep them but I disabled the rules in the ISP box. I still don't have the crash.
So it seems that the crash is linked to incoming connections not the configuration itself.
In my configuration, I also have a NAT on the port 443 for a HTTPS server in my network. But it does not provide any crash.
The main difference between the 443 NAT and the 6881-6889 NAT is the number of connections.
On the 443 NAT, I have only few connections but when I download some torrents, I have a lot connections on the 6881-6889 NAT.
My guess is the issue is about the NAT. Few connections is OK but a lot of connections leads to a crash.
So the downloads just create the conditions for the crash. The issue may appear on any situation where there are many connections on a NAT.
I hope these information can help solving the issue.
@Jim
Maybe during your tests, you could not have any incoming connection (firewall, ...). As I said earlier when I disable the NAT, everything is fine. It is just slower.
Updated by Jim Pingle almost 2 years ago
- Project changed from pfSense Plus to pfSense
- Subject changed from Kernel panic on 6100 to Kernel panic from incoming IPv6 connections
- Category changed from Operating System to Operating System
- Assignee set to Kristof Provost
- Target version set to 2.7.0
- Affected Plus Version deleted (
23.01) - Plus Target Version set to 23.05
This looks similar to another crash we have been able to reproduce, and we're still working on a fix. I suspect it's the same root cause based on the similarity of the backtrace in the crash dumps. It does seem to be tied to incoming packets, but not the type or volume. If it's the same as the other issue, it may be from incoming packets which are larger than the MTU on the link.
Updated by Bruno Dambrine almost 2 years ago
Thank you for the information.
I got an unexpected crash but I forgot that I have another NAT rule (the 443 NAT rule)...
So I switch back to 22.05 while you are working on a fix.
Updated by Kristof Provost almost 2 years ago
This issue isn't related to IPv4 NAT, so your NAT rules will not matter.
See #14092 as well, because this is almost certainly that issue. The fix is pending review upstream, and will likely land in snapshot builds later this week.
Updated by Flole Systems almost 2 years ago
#14092 is not public, so it's impossible to check what that one is about and what will trigger it.
Updated by Kristof Provost almost 2 years ago
Sorry, I missed that.
I believe I understand the issue. Briefly put, pf_refragment6() ends up calling ip6_forward() for traffic in the output (so not forwarding) path, and ip6_forward() assumes that m->m_pkthdr.rcvif is set, which is not the case for output traffic.
This fixes the panic: https://reviews.freebsd.org/D39061 (and subsequent reviews fix link-local functionality and add a test case).
Updated by Jim Pingle over 1 year ago
- Status changed from New to Feedback
- % Done changed from 0 to 100
A fix for this was merged into snapshots around the 17th. If possible, please upgrade to a current dev snapshot and see if you can reproduce the problem now.
I was able to induce a crash before, but not on current snapshots.
Updated by Bruno Dambrine over 1 year ago
Sorry, I have two questions.
1 - Can I install the last snapshop of pfsense CE on my netgate 6100 as I do with pfsense+ ?
2 - Will i be able to reload the conf from the pfsense+ 22.05 or do I have to rebuild the conf ?
Updated by Jim Pingle over 1 year ago
Bruno Dambrine wrote in #note-12:
1 - Can I install the last snapshop of pfsense CE on my netgate 6100 as I do with pfsense+ ?
It may technically be possible to some extent but not something I'd recommend, we do not test that or make sure it works overall. It would be missing support for various aspects of the system, though basic functionality may be there.
2 - Will i be able to reload the conf from the pfsense+ 22.05 or do I have to rebuild the conf ?
For compatibility between versions, the important piece is the "config revision" -- you can always import a config that is the same or older revision to a newer version, but you can't go backward. See https://docs.netgate.com/pfsense/en/latest/releases/versions.html for a table with which versions have which config revisions.
Eventually we'll have 23.05 snapshots public but at the moment we're still working on things quite heavily so they aren't generally available yet.
Updated by David Myers over 1 year ago
I'm not proficient with FreeBSD package management so this is probably a dumb question, but is there any way to drop a kernel with this fix onto an existing 23.01 system?
Updated by Bruno Dambrine over 1 year ago
Jim Pingle wrote in #note-13:
Bruno Dambrine wrote in #note-12:
1 - Can I install the last snapshop of pfsense CE on my netgate 6100 as I do with pfsense+ ?
It may technically be possible to some extent but not something I'd recommend, we do not test that or make sure it works overall. It would be missing support for various aspects of the system, though basic functionality may be there.
2 - Will i be able to reload the conf from the pfsense+ 22.05 or do I have to rebuild the conf ?
For compatibility between versions, the important piece is the "config revision" -- you can always import a config that is the same or older revision to a newer version, but you can't go backward. See https://docs.netgate.com/pfsense/en/latest/releases/versions.html for a table with which versions have which config revisions.
Eventually we'll have 23.05 snapshots public but at the moment we're still working on things quite heavily so they aren't generally available yet.
Thanks for the explanation.
I will wait for a public snapshot of the 23.05.
Updated by Bruno Dambrine over 1 year ago
This evening, I have installed the last beta of 23.05 on my 6100 and done some tests.
Currently no crash.
Thanks.
Updated by Jim Pingle over 1 year ago
- Related to Bug #14092: Kernel panic when PF passes a large/fragmented ICMP6 packet added
Updated by Jim Pingle over 1 year ago
There are more details about this issue and specifics of how to easily reproduce it over on #14092 which is now public since we released 23.05 with the fix included.
This is the same root cause, though we didn't close it out as a duplicate since it was generating useful feedback.