Bug #9058
closedKernel panic during L2TP retransmit
0%
Description
I'm using a Multilink L2TP WAN. After a fresh reinstall of 2.4.4 and completely new config (no import) it crashes regularly. The crashdump always has the same stacktrace showing a page fault while copying a packet for retransmitting after a timeout.
Fatal trap 12: page fault while in kernel mode ... current process = 13 (ng_queue1) ... Tracing pid 13 tid 100022 td 0xfffff80003551620 m_copypacket() at m_copypacket+0x16/frame 0xfffffe00797d3ab0 ng_l2tp_seq_rack_timeout() at ng_l2tp_seq_rack_timeout+0x164/frame 0xfffffe00797d3af0 ng_apply_item() at ng_apply_item+0x8f/frame 0xfffffe00797d3b70 ngthread() at ngthread+0x10a/frame 0xfffffe00797d3bb0 fork_exit() at fork_exit+0x83/frame 0xfffffe00797d3bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00797d3bf0
Files
Updated by Jim Pingle almost 6 years ago
I saw a crash with a backtrace like that once on a test VM with an L2TP WAN but only one time, not repeatedly, so I chalked it up to a random power or other VM-specific event.
Approximately how often is "regularly"? Have you noticed any pattern to when it happens? Is it after a certain amount of traffic, certain amount of time, or randomly?
Updated by Bianco Veigel almost 6 years ago
Right now it happens at least once a day, but at random times. I'll check if the amount of traffic might be related.
Updated by Bianco Veigel almost 6 years ago
After a few more crashes with different error messages, I ran a memory test, which showed errors. RAM is replaced and I will report back if the error occurs again, but right now I think this can be closed.
Updated by Jim Pingle almost 6 years ago
- Status changed from New to Feedback
OK, we'll wait for some more feedback here to see what happens.
Updated by Bianco Veigel almost 6 years ago
- Time (20:00 & 03:10)
- Uptime (56h & 31h)
- Packout count (10/9M & 7/5M)
- Byte count (4.1/2.5GB & 3.1/1.1GB)
The byte/s and packet/s also have much higher peeks between the crashes.
Updated by Jim Pingle almost 6 years ago
- Status changed from Feedback to New
OK, and is the backtrace in the crash report always the same?
I have not seen a recurrence of this on my local setup but it does not transmit much of anything over L2TP.
Updated by Bianco Veigel almost 6 years ago
yes it's always the same (except the hex addresses)
Tracing pid 13 tid 100022 td 0xfffff800039b4000 m_copypacket() at m_copypacket+0x16/frame 0xfffffe01195f7ab0 ng_l2tp_seq_rack_timeout() at ng_l2tp_seq_rack_timeout+0x164/frame 0xfffffe01195f7af0 ng_apply_item() at ng_apply_item+0x8f/frame 0xfffffe01195f7b70 ngthread() at ngthread+0x10a/frame 0xfffffe01195f7bb0 fork_exit() at fork_exit+0x83/frame 0xfffffe01195f7bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01195f7bf0
Updated by Bianco Veigel almost 6 years ago
This seems to be an upstream bug in FreeBSD mpd5 - today I got the same crash on my L2TP Server (FreeBSD 11.2-RELEASE-p2 / mpd5 5.8):
Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x1c fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80b79f06 stack pointer = 0x28:0xfffffe0094ab8a60 frame pointer = 0x28:0xfffffe0094ab8aa0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 619 (ng_queue2) trap number = 12 panic: page fault cpuid = 2 KDB: stack backtrace: #0 0xffffffff80b3d567 at kdb_backtrace+0x67 #1 0xffffffff80af6b07 at vpanic+0x177 #2 0xffffffff80af6983 at panic+0x43 #3 0xffffffff80f77fcf at trap_fatal+0x35f #4 0xffffffff80f78029 at trap_pfault+0x49 #5 0xffffffff80f777f7 at trap+0x2c7 #6 0xffffffff80f57dac at calltrap+0x8 #7 0xffffffff82835bf4 at ng_l2tp_seq_rack_timeout+0x164 #8 0xffffffff8282868d at ng_apply_item+0xcd #9 0xffffffff8282b084 at ngthread+0x1a4 #10 0xffffffff80aba083 at fork_exit+0x83 #11 0xffffffff80f58cce at fork_trampoline+0xe
Updated by Jim Pingle over 5 years ago
- Target version set to 2.5.0
- Affected Version changed from 2.4.4 to All
This is still happening on FreeBSD 12/pfSense 2.5.0. Same backtrace:
db:0:kdb.enter.default> show pcpu cpuid = 1 dynamic pcpu = 0xfffffe007db41c80 curthread = 0xfffff80004166000: pid 13 tid 100023 "ng_queue1" curpcb = 0xfffffe000c798cc0 fpcurthread = none idlethread = 0xfffff8000411d580: tid 100004 "idle: cpu1" curpmap = 0xffffffff82c884c8 tssp = 0xffffffff82db3988 commontssp = 0xffffffff82db3988 rsp0 = 0xfffffe000c798cc0 gs32p = 0xffffffff82dba5c0 ldt = 0xffffffff82dba600 tss = 0xffffffff82dba5f0 curvnet = 0xfffff8000406d540 db:0:kdb.enter.default> bt Tracing pid 13 tid 100023 td 0xfffff80004166000 m_copypacket() at m_copypacket+0x16/frame 0xfffffe000c798aa0 ng_l2tp_seq_rack_timeout() at ng_l2tp_seq_rack_timeout+0x1aa/frame 0xfffffe000c798ae0 ng_apply_item() at ng_apply_item+0x8f/frame 0xfffffe000c798b60 ngthread() at ngthread+0x12a/frame 0xfffffe000c798bb0 fork_exit() at fork_exit+0x83/frame 0xfffffe000c798bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000c798bf0
I have a few saved crash reports if needed. Seems to happen once every couple weeks for me, so it's difficult to know if it's solved.
Updated by Bianco Veigel almost 5 years ago
I've opened a bug at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=241133
Updated by Renato Botelho about 4 years ago
- Target version changed from 2.5.0 to Future
Waiting for a fix on FreeBSD side. When it happens we can target a pfSense release to add it
Updated by Renato Botelho about 4 years ago
A possible solution proposed by markj@ - https://reviews.freebsd.org/D26548
If this revision is accepted I'll import it into our tree and we can validate the fix
Updated by Bianco Veigel about 4 years ago
As far as I can tell this has been accepted upstream (https://svnweb.freebsd.org/changeset/base/366167). Can someone point me to a howto or tutorial, which explains, what I need to do to apply this fix to pfsense? I've hit this bug at least twice in the last three days.
I'm currently running pfSense 2.4.5-RELEASE-p1 (amd64).
Updated by Renato Botelho about 4 years ago
- Assignee set to Renato Botelho
- Target version changed from Future to 2.5.0
A fix was pushed on FreeBSD. I'm going to import it to pfSense
Updated by Renato Botelho about 4 years ago
- Status changed from New to Feedback
I've imported the fix to pfSense. Next round of snapshots with timestamp bigger than 20200928.0650 will contain the patch
Updated by Bianco Veigel about 4 years ago
- File info.0 info.0 added
- File textdump.tar.0 textdump.tar.0 added
I've updated to 2.5.0.a.20200928.1250 and got the same crash as before. I've attached the crashdump.
Is there anything else, I can do?
Updated by Renato Botelho about 4 years ago
- Status changed from Feedback to In Progress
Updated by Renato Botelho about 4 years ago
Bianco Veigel wrote:
I've updated to 2.5.0.a.20200928.1250 and got the same crash as before. I've attached the crashdump.
Is there anything else, I can do?
I've added INVARIANTS to kernel in order to collect more useful data from crash. Please try it again on 20200929-1250 image
Updated by Renato Botelho about 4 years ago
https://reviews.freebsd.org/D26586 was also imported to devel-12 branch. Next round of snapshots (1250) will have this applied
Updated by Bianco Veigel about 4 years ago
I'd like to test the new version, but there is still no 20200929-1250.
Updated by Renato Botelho about 4 years ago
Bianco Veigel wrote:
I'd like to test the new version, but there is still no 20200929-1250.
That build failed. It's now fixed and a new image will be available in ~3h
Updated by Renato Botelho about 4 years ago
Bianco Veigel wrote:
I'd like to test the new version, but there is still no 20200929-1250.
A new snapshot is available
Updated by Renato Botelho about 4 years ago
- Status changed from In Progress to Feedback
Updated by Bianco Veigel about 4 years ago
- File info.0 info.0 added
- File textdump.tar.0 textdump.tar.0 added
it crashed again with 2.5.0.a.20200930.0050
Updated by Mark Johnston about 4 years ago
Bianco Veigel wrote:
it crashed again with 2.5.0.a.20200930.0050
Thanks for your patience so far, it's very appreciated. I have another debugging patch that fixes a few other possible races and prints some extra info. I'll try to get it into a snapshot soon.
Updated by Bianco Veigel about 4 years ago
- File info.0 info.0 added
- File textdump.tar.0 textdump.tar.0 added
And I hit it again with 2.5.0.a.20201003.0050
Updated by Anonymous almost 4 years ago
- Status changed from Feedback to In Progress
Updated by Renato Botelho almost 4 years ago
- Assignee changed from Renato Botelho to Mark Johnston
Mark, it crashed again after Luiz pushed 2c7ab6a3c3f on devel-12 branch
Updated by Renato Botelho over 3 years ago
- Target version changed from 2.5.0 to CE-Next
It still happens in rare conditions and we didn't came up with a solution in time for 2.5.0
Updated by Bianco Veigel about 3 years ago
Is there anything I can do, to help you fix this? I'm still hitting this bug regularly (up to multiple times per day).
I'm currently running pfSense 2.5.0-RELEASE as l2tp client against FreeBSD 12.2-RELEASE-p7 as l2tp server.
I understand, that this is an unusual setup, so I'm willing to offer bounty of 500€ (either directly paid to the developer or as a donation to the FreeBSD Foundation).
Updated by Renato Botelho about 3 years ago
- Assignee changed from Mark Johnston to Mateusz Guzik
Bianco Veigel wrote in #note-31:
Is there anything I can do, to help you fix this? I'm still hitting this bug regularly (up to multiple times per day).
I'm currently running pfSense 2.5.0-RELEASE as l2tp client against FreeBSD 12.2-RELEASE-p7 as l2tp server.
I understand, that this is an unusual setup, so I'm willing to offer bounty of 500€ (either directly paid to the developer or as a donation to the FreeBSD Foundation).
I've imported another patch from Mateusz to add extra debug and it will be available on tomorrow's 2.6.0 snapshot. Please collect output in case it crashes again and send us so we can analyze and try to figure out why it's happening.
Updated by Bianco Veigel about 3 years ago
- File info.0 info.0 added
- File textdump.tar.0 textdump.tar.0 added
I've upgraded to 2.6.0-DEVELOPMENT built on Wed Aug 04 01:14:35 EDT 2021 and it crashed again. The crash dumps are attached.
Updated by Renato Botelho about 3 years ago
Bianco Veigel wrote in #note-33:
I've upgraded to 2.6.0-DEVELOPMENT built on Wed Aug 04 01:14:35 EDT 2021 and it crashed again. The crash dumps are attached.
Thanks!
A new patch with an attempt to fix it was pushed and will be available on tomorrow's 2.6.0 snapshot.
Updated by Mateusz Guzik about 3 years ago
Hi Bianco,
did you get the chance to test the fix?
If you check dmesg and see messages like these:
ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff800d22c4878 ng_l2tp_seq_rack_timeout; empty transmit queue on seq 0xfffff800d22c4878: ns 725 rack 725 active 2
then the problematic condition occurred.
Updated by Bianco Veigel about 3 years ago
I've updated to 2.6.0-DEVELOPMENT (amd64) built on Fri Aug 06 01:10:08 EDT 2021 this evening, and am waiting if it crashes again. I will report in the next 7 days, if it crashed again or not.
Updated by Bianco Veigel about 3 years ago
Mateusz Guzik wrote in #note-35:
Hi Bianco,
did you get the chance to test the fix?
If you check dmesg and see messages like these:
[...]then the problematic condition occurred.
Seems like the problemativ condition occured, but this time it didn't crash:
ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 ng_l2tp_seq_rack_timeout; empty transmit queue on seq 0xfffff8008bd22878: ns 1005 rack 1005 active 2ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
Updated by Bianco Veigel about 3 years ago
Now I'm getting the following message every minute:
2021 Aug 7 12:58:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 12:59:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:00:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:01:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:02:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:03:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:04:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:05:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:06:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878 2021 Aug 7 13:07:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
Updated by Mateusz Guzik about 3 years ago
All the messages will be removed soon. They are only there right now to confirm the problem condition is being exercised.
Updated by Bianco Veigel about 3 years ago
It was clear to me, that those messages are only for debugging - I was concerned, that the timer is still running for an hour.
PS: I've hit the condition multiple times again today, but my pfSense is still running and the L2TP connection was not interrupted.
Updated by Mateusz Guzik about 3 years ago
I removed debug printfs so updating to upcoming snapshot should stop the spam. I'll wait for your final confirmation that the bug indeed is fixed before altering the ticket.
Updated by Bianco Veigel about 3 years ago
It's been running stable for five days (which is twice the max uptime of the last 30 days), so I'll say it's fixed.
Please let me know, were to throw my money at.
Updated by Mateusz Guzik about 3 years ago
Glad to hear it is working for you.
The work was sponsored by Netgate, so I suggest you use the money to buy more products :)
Updated by Renato Botelho about 3 years ago
- Status changed from In Progress to Feedback
- Target version changed from CE-Next to 2.6.0
- Plus Target Version set to 21.09
Setting target version since it's now fixed. Thanks!
Updated by Jim Pingle about 3 years ago
- Subject changed from crash in l2tp retransmit to Kernel panic during L2TP retransmit
Updating subject for release notes.
Updated by Jim Pingle about 3 years ago
- Status changed from Feedback to Resolved
Marking resolved based on current issue feedback. For what it's worth, I have not seen a crash on the system where I usually get this crash as well, though it was very rare in my case.
Updated by Jim Pingle almost 3 years ago
- Plus Target Version changed from 21.09 to 22.01