Bug #15684: Panic in ``tcp_m_copym`` with selective ACK enabled - pfSense - pfSense bugtracker

Actions

Copy link

Bug #15684

closed

Panic in ``tcp_m_copym`` with selective ACK enabled

Added by Steve Wheeler almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Kristof Provost

Category:

Operating System

Target version:

2.8.0

Start date:

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

24.11

Release Notes:

Force Exclusion

Affected Version:

2.7.2

Affected Architecture:

All

Description

In some situations pfSense panics with:

db:1:pfs> bt
Tracing pid 2 tid 100112 td 0xfffff8000182f000
kdb_enter() at kdb_enter+0x33/frame 0xfffffe0084fe38f0
panic() at panic+0x43/frame 0xfffffe0084fe3950
trap_fatal() at trap_fatal+0x40f/frame 0xfffffe0084fe39b0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0084fe3a10
calltrap() at calltrap+0x8/frame 0xfffffe0084fe3a10
--- trap 0xc, rip = 0xffffffff80f246e2, rsp = 0xfffffe0084fe3ae0, rbp = 0xfffffe0084fe3b70 ---
tcp_m_copym() at tcp_m_copym+0x62/frame 0xfffffe0084fe3b70
tcp_default_output() at tcp_default_output+0x1294/frame 0xfffffe0084fe3d60
tcp_timer_rexmt() at tcp_timer_rexmt+0x53c/frame 0xfffffe0084fe3dc0
tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe0084fe3e00
softclock_call_cc() at softclock_call_cc+0x12e/frame 0xfffffe0084fe3ec0
softclock_thread() at softclock_thread+0xe9/frame 0xfffffe0084fe3ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe0084fe3f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0084fe3f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx         0xffffffff8141f825
rdx                      0x3f8
rbx                      0x100
rsp         0xfffffe0084fe37c8
rbp         0xfffffe0084fe38f0
rsi                        0xa
rdi         0xffffffff82d509d0  gdb_consdev
r8                           0
r9          0xfffffe0084fe3400
r10                       0x64
r11                          0
r12                          0
r13                          0
r14         0xffffffff8142fefb
r15         0xfffff8000182f000
rip         0xffffffff80d3f4c3  kdb_enter+0x33
rflags                    0x82
kdb_enter+0x33: movq    $0,0x235af42(%rip)

db:1:pfs>  show pcpu
cpuid        = 15
dynamic pcpu = 0xfffffe008f09ff40
curthread    = 0xfffff8000182f000: pid 2 tid 100112 critnest 1 "clock (15)" 
curpcb       = 0xfffff8000182f520
fpcurthread  = none
idlethread   = 0xfffff80001798000: tid 100018 "idle: cpu15" 
self         = 0xffffffff8401f000
curpmap      = 0xffffffff8303e6b0
tssp         = 0xffffffff8401f384
rsp0         = 0xfffffe0084fe4000
kcr3         = 0x800000007044b002
ucr3         = 0xffffffffffffffff
scr3         = 0x13e07cc78
gs32p        = 0xffffffff8401f404
ldt          = 0xffffffff8401f444
tss          = 0xffffffff8401f434
curvnet      = 0xfffff800012791c0

Fatal trap 12: page fault while in kernel mode
cpuid = 15; apic id = 0f
fault virtual address    = 0x1c
fault code        = supervisor read data, page not present
instruction pointer    = 0x20:0xffffffff80f246e2
stack pointer            = 0x28:0xfffffe0084fe3ae0
frame pointer            = 0x28:0xfffffe0084fe3b70
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 2 (clock (15))
rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe0084fe3cf8
rcx: 0000000000000000  r8: 00000000000004f4  r9: 0000000000000000
rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe0084fe3b70
r10: 0000000000001388 r11: 00000000940e1ad0 r12: 0000000000000000
r13: 00000000000004f4 r14: fffff801fdab9000 r15: 0000000000000028
trap number        = 12
panic: page fault
cpuid = 15
time = 1723446922
KDB: enter: panic

This appears to be something trying to access an mbuf after it has been freed. Likely by an interface or routing change.

Related issues

Actions

Copy link

Updated by Steve Wheeler almost 2 years ago

Hitting this in 24.03

Actions

Copy link

Updated by Lev Prokofev almost 2 years ago

Customer hit this issue, ticket for reference #3053406835

Actions

Copy link

Updated by Steve Wheeler almost 2 years ago

To move forward we need a full core dump from a system hitting the bug. If anyone can setup their to provide that please reach out to me.

Actions

Copy link

Updated by Christian Bönning almost 2 years ago

Our Netgate 1537 crashed earlier today. In `/var/crash` however there's only `bounds`, `info.0` as well as `textdump.tar.0` which I provided to you through Nextcloud.

~~If you can give me a hint where a full core dump would be located or what to do to produce them I can get them to you.~~

Found instructions in a Forum Post (https://forum.netgate.com/topic/188861/24-03-crashing-again/19) and adjusted `/etc/pfSense-ddb.conf` (I cannot currently reboot the Instance to not cause more outages than "needed" but as it crashed again just a couple of minutes after the first one I'm sure it won't take too long until it reboots -- though it ran for a couple of days without issues before those 2 occurrences).

If the instance remains stable throughout the day I'll manually boot it today EOB (around 8pm CEST).

Actions

Copy link

Updated by Christian Bönning almost 2 years ago

Minutes after rebooting the secondary unit (another Netgate 1537) to enable "full core dump mode" the primary unit one crashed again.

With that it's running with an adjusted pfSense-ddb.conf

Actions

Copy link

Updated by Christian Bönning almost 2 years ago

We have a `vmcore` produced with a crash which occurred earlier today. Can you share a Nextcloud Link so I can provide it to you?

Actions

Copy link

Updated by Steve Wheeler almost 2 years ago

Excellent. Here we go:
https://nc.netgate.com/nextcloud/s/k6CLjPKRKKaPt5C

Actions

Copy link

Updated by Christian Bönning almost 2 years ago

Upload completed with 2nd attempt.

sha1sum of the uploaded file should be the following:
bfe8b2f2cccb7823fcb4b775821fe42104754c34 vmcore.3

Actions

Copy link

Updated by Steve Wheeler almost 2 years ago

Hmm, not seeing it nextcloud on this side. How did it fail the first time? What size is it?

Actions

Copy link

#10

Updated by Christian Bönning almost 2 years ago

It failed for a switch of WAN Connections I was using.

I uploaded it again as a gzipped version (179848383 bytes) which uncompresses into 928342016 bytes.

Actions

Copy link

#11

Updated by Steve Wheeler almost 2 years ago

Great we have that and it looks promising.

Actions

Copy link

#12

Updated by Kristof Provost almost 2 years ago

The core dump confirms what I suspected from the initial report, in that tcp_m_copym() got called with a NULL mbuf. That's returned by sbsndptr_noadv(). It returns NULL because so->so_snd->sb_mb is NULL. That's not supposed to happen, as there's an explicit assertion for that.
That in turn would suggest we're not supposed to be in this specific code path in these circumstances.

It's not yet clear to me how that can happen, but it may have something to do with selective-acks.
While I dig further it'd be interesting to know if disabling SACK support avoids the crash. Use sysctl net.inet.tcp.sack.enable=0 (either on the console or via the system tuneables page).

Actions

Copy link

#13

Updated by Christian Bönning almost 2 years ago

I have set `net.inet.tcp.sack.enable=0` through System Tuneables on both Units and will report back if the crash occurs again (might take anything in between hours or days).

Actions

Copy link

#14

Updated by Kristof Provost almost 2 years ago

I think I know what's happening here. I'm only 95% sure, but it matches all observations.

It's an issue that's known upstream (although by a different headline, because it was reported on an INVARIANTS kernel so it ran into an assertion rather than the NULL dereference).

Essentially the code failed to clean up the selective ack state when the socket was closed, so the select ack information showed there was data to retransmit, but this data had already been freed, leading to the panic.
The fix is in https://cgit.freebsd.org/src/commit/?id=3eeb22cb819409b49296ecb0acbd453671168313 which is already part of 24.08. The bug is https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276761. It shows essentially the same backtrace (although it panics in sbsndptr_noadv(), because assertions are enabled), for the same reason (i.e. so->so_snd->sb_mb is NULL).

I'm also relatively confident that the sysctl listed above will prevent the issue as well, which can serve as a workaround until 24.08 releases.

Actions

Copy link

#15

Updated by Jim Pingle almost 2 years ago

Subject changed from tcp_m_copym panic to Panic in ``tcp_m_copym`` with selective ACK enabled
Status changed from Confirmed to Feedback
Assignee set to Kristof Provost
Target version set to 2.8.0
% Done changed from 0 to 100
Plus Target Version set to 24.08