Project

General

Profile

Actions

Bug #15752

closed

Montly kernel panic

Added by Sebastian Wagner 3 months ago. Updated about 2 months ago.

Status:
Duplicate
Priority:
Normal
Category:
Operating System
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Default
Affected Version:
Affected Architecture:
7100

Description

In a regular interval, every month, we experience a kernel panic. As the appliance is connected via a USB console cable we are luckily able to resolve it remotely.

The console shows this repeatedly, almost flooding the screen.

Tracing command kernel pid 0 tid 309435 td 0xfffff8006e140740
sched_switch() at sched_switch+0x88a/frame 0xfffffe00bc10ee20
mi_switch() at mi_switch+0xba/frame 0xfffffe00bc10ee40
_sleep() at _sleep+0x1be/frame 0xfffffe00bc10eec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xb1/frame 0xfffffe00bc10eef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00bc10ef30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00bc10ef30
--- trap 0x6e6177, rip = 0x6766637063, rsp = 0x7, rbp = 0x1600000001 ---

Sending Ctrl+C stops that and triggers a reboot. Luckily we found that workaround :)

Netgate 7100
24.03-RELEASE (amd64)

Attached are the output recovered from the console with screen and the info and textdump files offered by the webinterface. Let us know if any other logs are relevant before they are rotated.


Files

info.0 (547 Bytes) info.0 Sebastian Wagner, 09/29/2024 10:24 AM
2024-09-29-screen.txt (28.7 KB) 2024-09-29-screen.txt output from console Sebastian Wagner, 09/29/2024 10:24 AM
textdump.tar.0 (546 KB) textdump.tar.0 Sebastian Wagner, 09/29/2024 10:24 AM

Related issues

Is duplicate of Bug #15684: Panic in ``tcp_m_copym`` with selective ACK enabledResolvedKristof Provost

Actions
Actions #1

Updated by Kris Phillips 3 months ago

  • Status changed from New to Incomplete

Have you tested the RAM on your appliance to verify this isn't a memory issue? Page faults are typically an issue with RAM and if it's happening frequently enough, it could be intermittently failing hardware.

Actions #2

Updated by Sebastian Wagner 3 months ago

Thank you for the response. There doesn't seem to be a memtest included, so the best option is to use the bootable media with USB from https://memtest.org/, I guess?

Actions #3

Updated by Jordan G 2 months ago

Sebastian Wagner wrote in #note-2:

Thank you for the response. There doesn't seem to be a memtest included, so the best option is to use the bootable media with USB from https://memtest.org/, I guess?

yes that would work or whatever flavor bootable distribution that contains diagnostic memory testing

Actions #4

Updated by Sebastian Wagner about 2 months ago

We were able to perform a first test now:

      Memtest86+ v7.00      | Intel(R) Atom(TM) CPU C3558 @ 2.20GHz
CLK/Temp: 2200MHz   58/62*C | Pass 30% ############
L1 Cache:   24KB  41.6 GB/s | Test 49% ###################
L2 Cache:    2MB  23.2 GB/s | Test #6  [Moving inversions, 64 bit pattern]
L3 Cache:   N/A             | Testing: 4GB - 5GB [1GB of 7.99GB]
Memory  : 7.99GB  4.28 GB/s | Pattern: 0xefffffffffffffff
--------------------------------------------------------------------------------
CPU: 4 Cores 4 Threads    SMP: 4T (PAR)   | Time:  0:43:20  Status: Pass     /
RAM: 1200MHz (DDR4-2400) CAS 17-17-17-39  | Pass:  1        Errors: 0
--------------------------------------------------------------------------------

That showed no errors in one pass.

Meanwhile, the error keeps happening, with varying frequency. The shortest gap was 8 days now.

Actions #5

Updated by Reid Linnemann about 2 months ago

  • Status changed from Incomplete to Duplicate
  • Assignee set to Reid Linnemann
  • Parent task set to #15684

This is a known issue in both CE and 24.03, I've reclassified this as a duplicate and linked the parent task. The parent issue is https://redmine.pfsense.org/issues/15684. You can work around this by disabling selective ack in the system tunables:

net.inet.tcp.sack.enable=0

This will also be fixed in 24.11 which is in BETA at this time.

Actions #6

Updated by Jim Pingle about 2 months ago

  • Parent task deleted (#15684)
Actions #7

Updated by Jim Pingle about 2 months ago

  • Project changed from pfSense Plus to pfSense
  • Category changed from Operating System to Operating System
Actions #8

Updated by Jim Pingle about 2 months ago

  • Is duplicate of Bug #15684: Panic in ``tcp_m_copym`` with selective ACK enabled added
Actions #9

Updated by Sebastian Wagner about 2 months ago

Thank you! We applied the workaround and wait for the update. In case you don't hear from us anymore, it worked :)

Actions

Also available in: Atom PDF