Bug #15684
closedPanic in ``tcp_m_copym`` with selective ACK enabled
100%
Description
In some situations pfSense panics with:
db:1:pfs> bt Tracing pid 2 tid 100112 td 0xfffff8000182f000 kdb_enter() at kdb_enter+0x33/frame 0xfffffe0084fe38f0 panic() at panic+0x43/frame 0xfffffe0084fe3950 trap_fatal() at trap_fatal+0x40f/frame 0xfffffe0084fe39b0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe0084fe3a10 calltrap() at calltrap+0x8/frame 0xfffffe0084fe3a10 --- trap 0xc, rip = 0xffffffff80f246e2, rsp = 0xfffffe0084fe3ae0, rbp = 0xfffffe0084fe3b70 --- tcp_m_copym() at tcp_m_copym+0x62/frame 0xfffffe0084fe3b70 tcp_default_output() at tcp_default_output+0x1294/frame 0xfffffe0084fe3d60 tcp_timer_rexmt() at tcp_timer_rexmt+0x53c/frame 0xfffffe0084fe3dc0 tcp_timer_enter() at tcp_timer_enter+0x101/frame 0xfffffe0084fe3e00 softclock_call_cc() at softclock_call_cc+0x12e/frame 0xfffffe0084fe3ec0 softclock_thread() at softclock_thread+0xe9/frame 0xfffffe0084fe3ef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe0084fe3f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0084fe3f30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
db:1:pfs> show registers cs 0x20 ds 0x3b es 0x3b fs 0x13 gs 0x1b ss 0x28 rax 0x12 rcx 0xffffffff8141f825 rdx 0x3f8 rbx 0x100 rsp 0xfffffe0084fe37c8 rbp 0xfffffe0084fe38f0 rsi 0xa rdi 0xffffffff82d509d0 gdb_consdev r8 0 r9 0xfffffe0084fe3400 r10 0x64 r11 0 r12 0 r13 0 r14 0xffffffff8142fefb r15 0xfffff8000182f000 rip 0xffffffff80d3f4c3 kdb_enter+0x33 rflags 0x82 kdb_enter+0x33: movq $0,0x235af42(%rip)
db:1:pfs> show pcpu cpuid = 15 dynamic pcpu = 0xfffffe008f09ff40 curthread = 0xfffff8000182f000: pid 2 tid 100112 critnest 1 "clock (15)" curpcb = 0xfffff8000182f520 fpcurthread = none idlethread = 0xfffff80001798000: tid 100018 "idle: cpu15" self = 0xffffffff8401f000 curpmap = 0xffffffff8303e6b0 tssp = 0xffffffff8401f384 rsp0 = 0xfffffe0084fe4000 kcr3 = 0x800000007044b002 ucr3 = 0xffffffffffffffff scr3 = 0x13e07cc78 gs32p = 0xffffffff8401f404 ldt = 0xffffffff8401f444 tss = 0xffffffff8401f434 curvnet = 0xfffff800012791c0
Fatal trap 12: page fault while in kernel mode cpuid = 15; apic id = 0f fault virtual address = 0x1c fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80f246e2 stack pointer = 0x28:0xfffffe0084fe3ae0 frame pointer = 0x28:0xfffffe0084fe3b70 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2 (clock (15)) rdi: 0000000000000000 rsi: 0000000000000000 rdx: fffffe0084fe3cf8 rcx: 0000000000000000 r8: 00000000000004f4 r9: 0000000000000000 rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe0084fe3b70 r10: 0000000000001388 r11: 00000000940e1ad0 r12: 0000000000000000 r13: 00000000000004f4 r14: fffff801fdab9000 r15: 0000000000000028 trap number = 12 panic: page fault cpuid = 15 time = 1723446922 KDB: enter: panic
This appears to be something trying to access an mbuf after it has been freed. Likely by an interface or routing change.
Related issues
Updated by Lev Prokofev 3 months ago
Customer hit this issue, ticket for reference #3053406835
Updated by Steve Wheeler 3 months ago
To move forward we need a full core dump from a system hitting the bug. If anyone can setup their to provide that please reach out to me.
Updated by Christian Bönning 3 months ago
Our Netgate 1537 crashed earlier today. In `/var/crash` however there's only `bounds`, `info.0` as well as `textdump.tar.0` which I provided to you through Nextcloud.
If you can give me a hint where a full core dump would be located or what to do to produce them I can get them to you.
Found instructions in a Forum Post (https://forum.netgate.com/topic/188861/24-03-crashing-again/19) and adjusted `/etc/pfSense-ddb.conf` (I cannot currently reboot the Instance to not cause more outages than "needed" but as it crashed again just a couple of minutes after the first one I'm sure it won't take too long until it reboots -- though it ran for a couple of days without issues before those 2 occurrences).
If the instance remains stable throughout the day I'll manually boot it today EOB (around 8pm CEST).
Updated by Christian Bönning 3 months ago
Minutes after rebooting the secondary unit (another Netgate 1537) to enable "full core dump mode" the primary unit one crashed again.
With that it's running with an adjusted pfSense-ddb.conf
Updated by Christian Bönning 3 months ago
We have a `vmcore` produced with a crash which occurred earlier today. Can you share a Nextcloud Link so I can provide it to you?
Updated by Steve Wheeler 3 months ago
Excellent. Here we go:
https://nc.netgate.com/nextcloud/s/k6CLjPKRKKaPt5C
Updated by Christian Bönning 3 months ago
Upload completed with 2nd attempt.
sha1sum of the uploaded file should be the following:
bfe8b2f2cccb7823fcb4b775821fe42104754c34 vmcore.3
Updated by Steve Wheeler 3 months ago
Hmm, not seeing it nextcloud on this side. How did it fail the first time? What size is it?
Updated by Christian Bönning 3 months ago
It failed for a switch of WAN Connections I was using.
I uploaded it again as a gzipped version (179848383 bytes) which uncompresses into 928342016 bytes.
Updated by Steve Wheeler 3 months ago
Great we have that and it looks promising.
Updated by Kristof Provost 3 months ago
The core dump confirms what I suspected from the initial report, in that tcp_m_copym() got called with a NULL mbuf. That's returned by sbsndptr_noadv(). It returns NULL because so->so_snd->sb_mb is NULL. That's not supposed to happen, as there's an explicit assertion for that.
That in turn would suggest we're not supposed to be in this specific code path in these circumstances.
It's not yet clear to me how that can happen, but it may have something to do with selective-acks.
While I dig further it'd be interesting to know if disabling SACK support avoids the crash. Use sysctl net.inet.tcp.sack.enable=0 (either on the console or via the system tuneables page).
Updated by Christian Bönning 3 months ago
I have set `net.inet.tcp.sack.enable=0` through System Tuneables on both Units and will report back if the crash occurs again (might take anything in between hours or days).
Updated by Kristof Provost 3 months ago
I think I know what's happening here. I'm only 95% sure, but it matches all observations.
It's an issue that's known upstream (although by a different headline, because it was reported on an INVARIANTS kernel so it ran into an assertion rather than the NULL dereference).
Essentially the code failed to clean up the selective ack state when the socket was closed, so the select ack information showed there was data to retransmit, but this data had already been freed, leading to the panic.
The fix is in https://cgit.freebsd.org/src/commit/?id=3eeb22cb819409b49296ecb0acbd453671168313 which is already part of 24.08. The bug is https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276761. It shows essentially the same backtrace (although it panics in sbsndptr_noadv(), because assertions are enabled), for the same reason (i.e. so->so_snd->sb_mb is NULL).
I'm also relatively confident that the sysctl listed above will prevent the issue as well, which can serve as a workaround until 24.08 releases.
Updated by Jim Pingle 3 months ago
- Subject changed from tcp_m_copym panic to Panic in ``tcp_m_copym`` with selective ACK enabled
- Status changed from Confirmed to Feedback
- Assignee set to Kristof Provost
- Target version set to 2.8.0
- % Done changed from 0 to 100
- Plus Target Version set to 24.08
Updated by Jim Pingle about 1 month ago
- Plus Target Version changed from 24.08 to 24.11
Updated by Jim Pingle 15 days ago
- Has duplicate Bug #15752: Montly kernel panic added