Project

General

Profile

Actions

Bug #16737

open

Kernel Panic: spin lock held too long: triggered by LAN link event causing unnecessary LAGG/VLAN rebuild

Added by Marek Hajduczenia 7 days ago. Updated 7 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Interfaces
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Release Notes:
Default
Affected Plus Version:
Affected Architecture:
amd64

Description

  1. Kernel Panic: spin lock held too long — triggered by LAN link event causing unnecessary LAGG/VLAN rebuild
  1. Summary

A kernel panic occurs when connecting a device to the LAN interface (igc0). The link event on igc0 triggers a full LAGG bond teardown and rebuild on unrelated interfaces (ixl0/ixl1), which causes LACP flapping and all VLAN interfaces to go down. During this process, the `coretemp` driver attempts an SMP rendezvous to read CPU temperature via MSR, but a CPU is blocked handling the interface storm, causing a spin lock timeout and kernel panic.

  1. Environment

- pfSense Version: 2.8.1-RELEASE
- Kernel: FreeBSD 15.0-CURRENT #21 RELENG_2_8_1-n256095-47c932dcc0e9 (Aug 28, 2025)
- Hardware: Intel Core i7-10710U (6 cores / 12 threads), 32 GB RAM
- NIC (WAN): igc5 (Intel i225/i226)
- NIC (LAN): igc0 (Intel i225/i226)
- NIC (LAGG): ixl0 + ixl1 (Intel X710/XL710 10GbE) in LACP bond as lagg0
- VLANs on lagg0: 10, 11, 150, 151, 153, 154, 1666

  1. Steps to Reproduce

1. Configure a system with:
- igc0 as LAN
- ixl0 + ixl1 as LACP bond (lagg0)
- Multiple VLANs on lagg0
- coretemp thermal sensor enabled (default)
2. System is running normally with all interfaces up
3. Connect a laptop (or any device) to the igc0 (LAN) port, causing a link state change
4. System panics

  1. Expected Behavior

A link event on igc0 (LAN) should only trigger reconfiguration of igc0 itself. It should NOT affect the LAGG bond (lagg0) or any VLAN interfaces on lagg0, as they are completely independent.

  1. Actual Behavior

The igc0 link event triggers a full reconfiguration that:
1. Tears down the LAGG bond (removes ixl0 and ixl1 members)
2. Destroys and recreates all VLAN interfaces on lagg0
3. Causes LACP flapping on the bond
4. All VLAN interfaces go down and back up
5. During this process, a coretemp sysctl read triggers an SMP rendezvous that cannot complete because CPUs are busy handling the interface storm
6. Kernel panics with "spin lock held too long"

  1. Panic Details

```
panic: spin lock held too long
cpuid = 7
time = 1773101390
```

  1. Stack Trace

```
kdb_enter() at kdb_enter+0x33
panic() at panic+0x43
mtx_lock_indefinite_check() at _mtx_lock_indefinite_check+0x64
_mtx_lock_spin_cookie() at _mtx_lock_spin_cookie+0xd4
smp_rendezvous_cpus() at smp_rendezvous_cpus+0x1e0
x86_msr_op() at x86_msr_op+0x175
coretemp_get_val_sysctl() at coretemp_get_val_sysctl+0x4b
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x91
sysctl_root() at sysctl_root+0x20d
userland_sysctl() at userland_sysctl+0x15e
sys
__sysctl() at sys___sysctl+0x65
amd64_syscall() at amd64_syscall+0x115
fast_syscall_common() at fast_syscall_common+0xf8
--- syscall (202, FreeBSD ELF64, __sysctl) ---
```

  1. Kernel Message Buffer (relevant section)

```
ixl0: link state changed to UP
lagg0: link state changed to UP
lagg0.153: link state changed to UP
lagg0.154: link state changed to UP
lagg0.10: link state changed to UP
lagg0.11: link state changed to UP
lagg0.1666: link state changed to UP
lagg0.151: link state changed to UP
lagg0.150: link state changed to UP
igc0: link state changed to DOWN <-- laptop connected to LAN port
igc0: link state changed to UP
igc0: link state changed to DOWN
lagg0: link state changed to DOWN <-- WHY? igc0 is not part of lagg0
lagg0.153: link state changed to DOWN
lagg0.154: link state changed to DOWN
lagg0.10: link state changed to DOWN
lagg0.11: link state changed to DOWN
lagg0.1666: link state changed to DOWN
lagg0.151: link state changed to DOWN
lagg0.150: link state changed to DOWN
ixl0: link state changed to DOWN
lagg0: IPv6 addresses on ixl0 have been removed before adding it as a member to prevent IPv6 address scope violation.
lagg0: link state changed to DOWN
ixl1: link state changed to DOWN
lagg0: IPv6 addresses on ixl1 have been removed before adding it as a member to prevent IPv6 address scope violation.
vlan2: changing name to 'lagg0.10' <-- VLANs being rebuilt from scratch
vlan3: changing name to 'lagg0.11'
vlan4: changing name to 'lagg0.150'
vlan1: changing name to 'lagg0.151'
vlan6: changing name to 'lagg0.153'
vlan0: changing name to 'lagg0.154'
vlan5: changing name to 'lagg0.1666'
[... promiscuous mode toggling on multiple interfaces ...]
ixl1: Interface stopped DISTRIBUTING, possible flapping
ixl0: Interface stopped DISTRIBUTING, possible flapping
lagg0: link state changed to DOWN <-- second DOWN event from LACP flap
lagg0.153: link state changed to DOWN
lagg0.154: link state changed to DOWN
lagg0.10: link state changed to DOWN
lagg0.11: link state changed to DOWN
lagg0.1666: link state changed to DOWN
lagg0.151: link state changed to DOWN
lagg0.150: link state changed to DOWN
spin lock 0xffffffff82ad1ae0 (smp rendezvous) held by 0xfffff80085cdf000 (tid 100517) too long
panic: spin lock held too long
```

  1. Two Separate Issues
  1. Issue 1 (Primary): Link event on igc0 triggers LAGG/VLAN rebuild on unrelated interfaces

The `rc.linkup` handler (or whatever processes the igc0 link event) appears to reconfigure ALL interfaces instead of just the one that changed. This causes:
- Unnecessary LAGG bond teardown/rebuild
- All VLANs destroyed and recreated
- LACP flapping
- Complete network outage on all VLANs for the duration

This should be scoped to only reconfigure the interface that had the link event.

  1. Issue 2 (Secondary): coretemp + SMP rendezvous panic under interface load

The `coretemp` driver uses `smp_rendezvous_cpus()` to read MSR registers, which requires all CPUs to synchronize. When CPUs are busy handling a burst of interface events (as caused by Issue 1), the spin lock times out and causes a kernel panic.

This is a known class of FreeBSD bugs with coretemp under heavy interrupt/softirq load, but the trigger here is entirely avoidable if Issue 1 is fixed.

  1. Workaround

Disable the coretemp driver to prevent the panic (does not fix the unnecessary interface rebuild):

```
echo 'coretemp_load="NO"' >> /boot/loader.conf.local
```

Or via the GUI: System > Advanced > Miscellaneous > Thermal Sensors > None

  1. Attachments

The full textdump (panic.txt, ddb.txt, msgbuf.txt, config.txt, version.txt) is available and can be uploaded to the bug report.


Files

textdump.tar.0 (88.5 KB) textdump.tar.0 Marek Hajduczenia, 03/10/2026 12:26 AM
Actions #1

Updated by Marek Hajduczenia 7 days ago

Opened by accident against plus version, sorry please close

Actions

Also available in: Atom PDF