Project

General

Profile

Actions

Bug #9058

closed

Kernel panic during L2TP retransmit

Added by Bianco Veigel almost 3 years ago. Updated 13 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
L2TP
Target version:
Start date:
10/23/2018
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
21.09
Release Notes:
Affected Version:
All
Affected Architecture:
amd64

Description

I'm using a Multilink L2TP WAN. After a fresh reinstall of 2.4.4 and completely new config (no import) it crashes regularly. The crashdump always has the same stacktrace showing a page fault while copying a packet for retransmitting after a timeout.

Fatal trap 12: page fault while in kernel mode
...
current process        = 13 (ng_queue1)
...
Tracing pid 13 tid 100022 td 0xfffff80003551620
m_copypacket() at m_copypacket+0x16/frame 0xfffffe00797d3ab0
ng_l2tp_seq_rack_timeout() at ng_l2tp_seq_rack_timeout+0x164/frame 0xfffffe00797d3af0
ng_apply_item() at ng_apply_item+0x8f/frame 0xfffffe00797d3b70
ngthread() at ngthread+0x10a/frame 0xfffffe00797d3bb0
fork_exit() at fork_exit+0x83/frame 0xfffffe00797d3bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00797d3bf0

Files

info.0 (539 Bytes) info.0 Bianco Veigel, 10/23/2018 06:38 AM
textdump.tar.0 (101 KB) textdump.tar.0 Bianco Veigel, 10/23/2018 06:38 AM
info.0 (432 Bytes) info.0 Bianco Veigel, 09/29/2020 05:23 AM
textdump.tar.0 (154 KB) textdump.tar.0 Bianco Veigel, 09/29/2020 05:23 AM
info.0 (431 Bytes) info.0 Bianco Veigel, 10/02/2020 03:56 AM
textdump.tar.0 (154 KB) textdump.tar.0 Bianco Veigel, 10/02/2020 03:56 AM
info.0 (427 Bytes) info.0 Bianco Veigel, 10/03/2020 06:03 PM
textdump.tar.0 (154 KB) textdump.tar.0 Bianco Veigel, 10/03/2020 06:03 PM
info.0 (435 Bytes) info.0 Bianco Veigel, 08/05/2021 02:31 AM
textdump.tar.0 (102 KB) textdump.tar.0 Bianco Veigel, 08/05/2021 02:31 AM
Actions #1

Updated by Jim Thompson almost 3 years ago

  • Priority changed from High to Normal
Actions #2

Updated by Jim Pingle almost 3 years ago

I saw a crash with a backtrace like that once on a test VM with an L2TP WAN but only one time, not repeatedly, so I chalked it up to a random power or other VM-specific event.

Approximately how often is "regularly"? Have you noticed any pattern to when it happens? Is it after a certain amount of traffic, certain amount of time, or randomly?

Actions #3

Updated by Bianco Veigel almost 3 years ago

Right now it happens at least once a day, but at random times. I'll check if the amount of traffic might be related.

Actions #4

Updated by Bianco Veigel almost 3 years ago

After a few more crashes with different error messages, I ran a memory test, which showed errors. RAM is replaced and I will report back if the error occurs again, but right now I think this can be closed.

Actions #5

Updated by Jim Pingle almost 3 years ago

  • Status changed from New to Feedback

OK, we'll wait for some more feedback here to see what happens.

Actions #6

Updated by Bianco Veigel almost 3 years ago

Thanks for waiting. My pfsense crashed two times in the last two days. From the monitoring (telegraf, 300s interval) it seems not related to
  • Time (20:00 & 03:10)
  • Uptime (56h & 31h)
  • Packout count (10/9M & 7/5M)
  • Byte count (4.1/2.5GB & 3.1/1.1GB)

The byte/s and packet/s also have much higher peeks between the crashes.

Actions #7

Updated by Jim Pingle almost 3 years ago

  • Status changed from Feedback to New

OK, and is the backtrace in the crash report always the same?

I have not seen a recurrence of this on my local setup but it does not transmit much of anything over L2TP.

Actions #8

Updated by Bianco Veigel almost 3 years ago

yes it's always the same (except the hex addresses)

Tracing pid 13 tid 100022 td 0xfffff800039b4000
m_copypacket() at m_copypacket+0x16/frame 0xfffffe01195f7ab0
ng_l2tp_seq_rack_timeout() at ng_l2tp_seq_rack_timeout+0x164/frame 0xfffffe01195f7af0
ng_apply_item() at ng_apply_item+0x8f/frame 0xfffffe01195f7b70
ngthread() at ngthread+0x10a/frame 0xfffffe01195f7bb0
fork_exit() at fork_exit+0x83/frame 0xfffffe01195f7bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01195f7bf0

Actions #9

Updated by Bianco Veigel almost 3 years ago

This seems to be an upstream bug in FreeBSD mpd5 - today I got the same crash on my L2TP Server (FreeBSD 11.2-RELEASE-p2 / mpd5 5.8):

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address      = 0x1c
fault code         = supervisor read data, page not present
instruction pointer        = 0x20:0xffffffff80b79f06
stack pointer              = 0x28:0xfffffe0094ab8a60
frame pointer              = 0x28:0xfffffe0094ab8aa0
code segment               = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags   = interrupt enabled, resume, IOPL = 0
current process            = 619 (ng_queue2)
trap number                = 12
panic: page fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff80b3d567 at kdb_backtrace+0x67
#1 0xffffffff80af6b07 at vpanic+0x177
#2 0xffffffff80af6983 at panic+0x43
#3 0xffffffff80f77fcf at trap_fatal+0x35f
#4 0xffffffff80f78029 at trap_pfault+0x49
#5 0xffffffff80f777f7 at trap+0x2c7
#6 0xffffffff80f57dac at calltrap+0x8
#7 0xffffffff82835bf4 at ng_l2tp_seq_rack_timeout+0x164
#8 0xffffffff8282868d at ng_apply_item+0xcd
#9 0xffffffff8282b084 at ngthread+0x1a4
#10 0xffffffff80aba083 at fork_exit+0x83
#11 0xffffffff80f58cce at fork_trampoline+0xe

Actions #10

Updated by Jim Pingle over 2 years ago

  • Target version set to 2.5.0
  • Affected Version changed from 2.4.4 to All

This is still happening on FreeBSD 12/pfSense 2.5.0. Same backtrace:

db:0:kdb.enter.default>  show pcpu
cpuid        = 1
dynamic pcpu = 0xfffffe007db41c80
curthread    = 0xfffff80004166000: pid 13 tid 100023 "ng_queue1" 
curpcb       = 0xfffffe000c798cc0
fpcurthread  = none
idlethread   = 0xfffff8000411d580: tid 100004 "idle: cpu1" 
curpmap      = 0xffffffff82c884c8
tssp         = 0xffffffff82db3988
commontssp   = 0xffffffff82db3988
rsp0         = 0xfffffe000c798cc0
gs32p        = 0xffffffff82dba5c0
ldt          = 0xffffffff82dba600
tss          = 0xffffffff82dba5f0
curvnet      = 0xfffff8000406d540
db:0:kdb.enter.default>  bt
Tracing pid 13 tid 100023 td 0xfffff80004166000
m_copypacket() at m_copypacket+0x16/frame 0xfffffe000c798aa0
ng_l2tp_seq_rack_timeout() at ng_l2tp_seq_rack_timeout+0x1aa/frame 0xfffffe000c798ae0
ng_apply_item() at ng_apply_item+0x8f/frame 0xfffffe000c798b60
ngthread() at ngthread+0x12a/frame 0xfffffe000c798bb0
fork_exit() at fork_exit+0x83/frame 0xfffffe000c798bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000c798bf0

I have a few saved crash reports if needed. Seems to happen once every couple weeks for me, so it's difficult to know if it's solved.

Actions #12

Updated by Renato Botelho 12 months ago

  • Target version changed from 2.5.0 to Future

Waiting for a fix on FreeBSD side. When it happens we can target a pfSense release to add it

Actions #13

Updated by Renato Botelho 12 months ago

A possible solution proposed by markj@ - https://reviews.freebsd.org/D26548

If this revision is accepted I'll import it into our tree and we can validate the fix

Actions #14

Updated by Bianco Veigel 12 months ago

As far as I can tell this has been accepted upstream (https://svnweb.freebsd.org/changeset/base/366167). Can someone point me to a howto or tutorial, which explains, what I need to do to apply this fix to pfsense? I've hit this bug at least twice in the last three days.

I'm currently running pfSense 2.4.5-RELEASE-p1 (amd64).

Actions #15

Updated by Renato Botelho 12 months ago

  • Assignee set to Renato Botelho
  • Target version changed from Future to 2.5.0

A fix was pushed on FreeBSD. I'm going to import it to pfSense

Actions #16

Updated by Renato Botelho 12 months ago

  • Status changed from New to Feedback

I've imported the fix to pfSense. Next round of snapshots with timestamp bigger than 20200928.0650 will contain the patch

Actions #17

Updated by Bianco Veigel 12 months ago

I've updated to 2.5.0.a.20200928.1250 and got the same crash as before. I've attached the crashdump.

Is there anything else, I can do?

Actions #18

Updated by Renato Botelho 12 months ago

  • Status changed from Feedback to In Progress
Actions #19

Updated by Renato Botelho 12 months ago

Bianco Veigel wrote:

I've updated to 2.5.0.a.20200928.1250 and got the same crash as before. I've attached the crashdump.

Is there anything else, I can do?

I've added INVARIANTS to kernel in order to collect more useful data from crash. Please try it again on 20200929-1250 image

Actions #20

Updated by Renato Botelho 12 months ago

https://reviews.freebsd.org/D26586 was also imported to devel-12 branch. Next round of snapshots (1250) will have this applied

Actions #21

Updated by Bianco Veigel 12 months ago

I'd like to test the new version, but there is still no 20200929-1250.

Actions #22

Updated by Renato Botelho 12 months ago

Bianco Veigel wrote:

I'd like to test the new version, but there is still no 20200929-1250.

That build failed. It's now fixed and a new image will be available in ~3h

Actions #23

Updated by Renato Botelho 12 months ago

Bianco Veigel wrote:

I'd like to test the new version, but there is still no 20200929-1250.

A new snapshot is available

Actions #24

Updated by Renato Botelho 12 months ago

  • Status changed from In Progress to Feedback
Actions #25

Updated by Bianco Veigel 12 months ago

it crashed again with 2.5.0.a.20200930.0050

Actions #26

Updated by Mark Johnston 12 months ago

Bianco Veigel wrote:

it crashed again with 2.5.0.a.20200930.0050

Thanks for your patience so far, it's very appreciated. I have another debugging patch that fixes a few other possible races and prints some extra info. I'll try to get it into a snapshot soon.

Actions #27

Updated by Bianco Veigel 12 months ago

And I hit it again with 2.5.0.a.20201003.0050

Actions #28

Updated by Steve Beaver 12 months ago

  • Status changed from Feedback to In Progress
Actions #29

Updated by Renato Botelho 11 months ago

  • Assignee changed from Renato Botelho to Mark Johnston

Mark, it crashed again after Luiz pushed 2c7ab6a3c3f on devel-12 branch

Actions #30

Updated by Renato Botelho 8 months ago

  • Target version changed from 2.5.0 to CE-Next

It still happens in rare conditions and we didn't came up with a solution in time for 2.5.0

Actions #31

Updated by Bianco Veigel about 2 months ago

Is there anything I can do, to help you fix this? I'm still hitting this bug regularly (up to multiple times per day).

I'm currently running pfSense 2.5.0-RELEASE as l2tp client against FreeBSD 12.2-RELEASE-p7 as l2tp server.

I understand, that this is an unusual setup, so I'm willing to offer bounty of 500€ (either directly paid to the developer or as a donation to the FreeBSD Foundation).

Actions #32

Updated by Renato Botelho about 2 months ago

  • Assignee changed from Mark Johnston to Mateusz Guzik

Bianco Veigel wrote in #note-31:

Is there anything I can do, to help you fix this? I'm still hitting this bug regularly (up to multiple times per day).

I'm currently running pfSense 2.5.0-RELEASE as l2tp client against FreeBSD 12.2-RELEASE-p7 as l2tp server.

I understand, that this is an unusual setup, so I'm willing to offer bounty of 500€ (either directly paid to the developer or as a donation to the FreeBSD Foundation).

I've imported another patch from Mateusz to add extra debug and it will be available on tomorrow's 2.6.0 snapshot. Please collect output in case it crashes again and send us so we can analyze and try to figure out why it's happening.

Actions #33

Updated by Bianco Veigel about 2 months ago

I've upgraded to 2.6.0-DEVELOPMENT built on Wed Aug 04 01:14:35 EDT 2021 and it crashed again. The crash dumps are attached.

Actions #34

Updated by Renato Botelho about 2 months ago

Bianco Veigel wrote in #note-33:

I've upgraded to 2.6.0-DEVELOPMENT built on Wed Aug 04 01:14:35 EDT 2021 and it crashed again. The crash dumps are attached.

Thanks!

A new patch with an attempt to fix it was pushed and will be available on tomorrow's 2.6.0 snapshot.

Actions #35

Updated by Mateusz Guzik about 2 months ago

Hi Bianco,

did you get the chance to test the fix?

If you check dmesg and see messages like these:

ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff800d22c4878
ng_l2tp_seq_rack_timeout; empty transmit queue on seq 0xfffff800d22c4878: ns 725 rack 725 active 2

then the problematic condition occurred.

Actions #36

Updated by Bianco Veigel about 2 months ago

I've updated to 2.6.0-DEVELOPMENT (amd64) built on Fri Aug 06 01:10:08 EDT 2021 this evening, and am waiting if it crashes again. I will report in the next 7 days, if it crashed again or not.

Actions #37

Updated by Mateusz Guzik about 2 months ago

Sounds good, thanks for the update!

Actions #38

Updated by Bianco Veigel about 2 months ago

Mateusz Guzik wrote in #note-35:

Hi Bianco,

did you get the chance to test the fix?

If you check dmesg and see messages like these:
[...]

then the problematic condition occurred.

Seems like the problemativ condition occured, but this time it didn't crash:

ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
ng_l2tp_seq_rack_timeout; empty transmit queue on seq 0xfffff8008bd22878: ns 1005 rack 1005 active 2ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878

Actions #39

Updated by Bianco Veigel about 2 months ago

Now I'm getting the following message every minute:

2021 Aug  7 12:58:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 12:59:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:00:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:01:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:02:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:03:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:04:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:05:07 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:06:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878
2021 Aug  7 13:07:06 tony.bln.wg1337.de ng_l2tp_seq_recv_nr: rack_timer still active on seq 0xfffff8008bd22878

Actions #40

Updated by Mateusz Guzik about 2 months ago

All the messages will be removed soon. They are only there right now to confirm the problem condition is being exercised.

Actions #41

Updated by Bianco Veigel about 2 months ago

It was clear to me, that those messages are only for debugging - I was concerned, that the timer is still running for an hour.

PS: I've hit the condition multiple times again today, but my pfSense is still running and the L2TP connection was not interrupted.

Actions #42

Updated by Mateusz Guzik about 1 month ago

I removed debug printfs so updating to upcoming snapshot should stop the spam. I'll wait for your final confirmation that the bug indeed is fixed before altering the ticket.

Actions #43

Updated by Bianco Veigel about 1 month ago

It's been running stable for five days (which is twice the max uptime of the last 30 days), so I'll say it's fixed.
Please let me know, were to throw my money at.

Actions #44

Updated by Mateusz Guzik about 1 month ago

Glad to hear it is working for you.

The work was sponsored by Netgate, so I suggest you use the money to buy more products :)

Actions #45

Updated by Renato Botelho about 1 month ago

  • Status changed from In Progress to Feedback
  • Target version changed from CE-Next to 2.6.0
  • Plus Target Version set to 21.09

Setting target version since it's now fixed. Thanks!

Actions #46

Updated by Jim Pingle 28 days ago

  • Subject changed from crash in l2tp retransmit to Kernel panic during L2TP retransmit

Updating subject for release notes.

Actions #47

Updated by Jim Pingle 13 days ago

  • Status changed from Feedback to Resolved

Marking resolved based on current issue feedback. For what it's worth, I have not seen a crash on the system where I usually get this crash as well, though it was very rare in my case.

Actions

Also available in: Atom PDF