Bug #15752
closedMontly kernel panic
0%
Description
In a regular interval, every month, we experience a kernel panic. As the appliance is connected via a USB console cable we are luckily able to resolve it remotely.
The console shows this repeatedly, almost flooding the screen.
Tracing command kernel pid 0 tid 309435 td 0xfffff8006e140740 sched_switch() at sched_switch+0x88a/frame 0xfffffe00bc10ee20 mi_switch() at mi_switch+0xba/frame 0xfffffe00bc10ee40 _sleep() at _sleep+0x1be/frame 0xfffffe00bc10eec0 taskqueue_thread_loop() at taskqueue_thread_loop+0xb1/frame 0xfffffe00bc10eef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe00bc10ef30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00bc10ef30 --- trap 0x6e6177, rip = 0x6766637063, rsp = 0x7, rbp = 0x1600000001 ---
Sending Ctrl+C stops that and triggers a reboot. Luckily we found that workaround :)
Netgate 7100
24.03-RELEASE (amd64)
Attached are the output recovered from the console with screen and the info and textdump files offered by the webinterface. Let us know if any other logs are relevant before they are rotated.
Files
Related issues
Updated by Kris Phillips 2 months ago
- Status changed from New to Incomplete
Have you tested the RAM on your appliance to verify this isn't a memory issue? Page faults are typically an issue with RAM and if it's happening frequently enough, it could be intermittently failing hardware.
Updated by Sebastian Wagner 2 months ago
Thank you for the response. There doesn't seem to be a memtest included, so the best option is to use the bootable media with USB from https://memtest.org/, I guess?
Updated by Jordan G about 2 months ago
Sebastian Wagner wrote in #note-2:
Thank you for the response. There doesn't seem to be a memtest included, so the best option is to use the bootable media with USB from https://memtest.org/, I guess?
yes that would work or whatever flavor bootable distribution that contains diagnostic memory testing
Updated by Sebastian Wagner about 1 month ago
We were able to perform a first test now:
Memtest86+ v7.00 | Intel(R) Atom(TM) CPU C3558 @ 2.20GHz CLK/Temp: 2200MHz 58/62*C | Pass 30% ############ L1 Cache: 24KB 41.6 GB/s | Test 49% ################### L2 Cache: 2MB 23.2 GB/s | Test #6 [Moving inversions, 64 bit pattern] L3 Cache: N/A | Testing: 4GB - 5GB [1GB of 7.99GB] Memory : 7.99GB 4.28 GB/s | Pattern: 0xefffffffffffffff -------------------------------------------------------------------------------- CPU: 4 Cores 4 Threads SMP: 4T (PAR) | Time: 0:43:20 Status: Pass / RAM: 1200MHz (DDR4-2400) CAS 17-17-17-39 | Pass: 1 Errors: 0 --------------------------------------------------------------------------------
That showed no errors in one pass.
Meanwhile, the error keeps happening, with varying frequency. The shortest gap was 8 days now.
Updated by Reid Linnemann about 1 month ago
- Status changed from Incomplete to Duplicate
- Assignee set to Reid Linnemann
- Parent task set to #15684
This is a known issue in both CE and 24.03, I've reclassified this as a duplicate and linked the parent task. The parent issue is https://redmine.pfsense.org/issues/15684. You can work around this by disabling selective ack in the system tunables:
net.inet.tcp.sack.enable=0
This will also be fixed in 24.11 which is in BETA at this time.
Updated by Jim Pingle about 1 month ago
- Project changed from pfSense Plus to pfSense
- Category changed from Operating System to Operating System
Updated by Jim Pingle about 1 month ago
- Is duplicate of Bug #15684: Panic in ``tcp_m_copym`` with selective ACK enabled added
Updated by Sebastian Wagner about 1 month ago
Thank you! We applied the workaround and wait for the update. In case you don't hear from us anymore, it worked :)