Bug #12079
closed
Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock
Added by Steve Wheeler over 3 years ago.
Updated 9 months ago.
Plus Target Version:
23.09
Affected Architecture:
All
Description
IGMPProxy can trigger a kernel panic in 2.5.2-RC.
db:0:kdb.enter.default> show pcpu
cpuid = 1
dynamic pcpu = 0xfffffe007dbe5380
curthread = 0xfffff8005ca13000: pid 289 tid 100198 "igmpproxy"
curpcb = 0xfffff8005ca135a0
fpcurthread = 0xfffff8005ca13000: pid 289 "igmpproxy"
idlethread = 0xfffff80004185740: tid 100004 "idle: cpu1"
curpmap = 0xfffff80077342138
tssp = 0xffffffff83717688
commontssp = 0xffffffff83717688
rsp0 = 0xfffffe001e3b0dc0
kcr3 = 0xffffffffffffffff
ucr3 = 0xffffffffffffffff
scr3 = 0x0
gs32p = 0xffffffff8371dea0
ldt = 0xffffffff8371dee0
tss = 0xffffffff8371ded0
tlb gen = 40718
curvnet = 0xfffff8000406db40
db:0:kdb.enter.default> bt
Tracing pid 289 tid 100198 td 0xfffff8005ca13000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe001e3b0820
vpanic() at vpanic+0x197/frame 0xfffffe001e3b0870
panic() at panic+0x43/frame 0xfffffe001e3b08d0
propagate_priority() at propagate_priority+0x282/frame 0xfffffe001e3b0900
turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe001e3b0950
__mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe001e3b09e0
X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe001e3b0ab0
rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe001e3b0ae0
sosetopt() at sosetopt+0xe7/frame 0xfffffe001e3b0b40
kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe001e3b0ba0
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe001e3b0bc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe001e3b0cf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe001e3b0cf0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b57ea, rsp = 0x7fffffffeba8, rbp
Booting with a debug kernel shows:
lock order reversal: (sleepable after non-sleepable)
1st 0xffffffff83795300 IPv4 multicast interfaces (IPv4 multicast interfaces) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/netinet/ip_mroute.c:845
2nd 0xfffff80004445178 iflib ctx lock (iflib ctx lock) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/net/iflib.c:4190
stack backtrace:
#0 0xffffffff80dd7021 at witness_debugger+0x71
#1 0xffffffff80d77387 at _sx_xlock+0x67
#2 0xffffffff80eac72f at iflib_if_ioctl+0x2df
#3 0xffffffff80e81107 at if_setflag+0xd7
#4 0xffffffff80f4cf02 at X_ip_mrouter_set+0x1642
#5 0xffffffff80f55783 at rip_ctloutput+0xf3
#6 0xffffffff80e0a75f at sosetopt+0xff
#7 0xffffffff80e0fa50 at kern_setsockopt+0xb0
#8 0xffffffff80e0f994 at sys_setsockopt+0x24
#9 0xffffffff8134a32e at amd64_syscall+0x2be
#10 0xffffffff81320c7e at fast_syscall_common+0xf8
Tested:
2.5.2-RC (amd64)
built on Fri Jun 25 03:01:13 EDT 2021
FreeBSD 12.2-STABLE
Files
- Assignee set to Mateusz Guzik
First a note that to my understanding the bug is not easy to run into. However, booting a kernel with debug options easily reproduces the warning that the bug exists.
I think the most sensible thing for now is to put the bug on a back burner (reasons below).
The code got rewritten upstream in https://cgit.freebsd.org/src/commit/?id=d40cd26a86a79342d175296b74768dd7183fc02b . The rewrite replaced the lock at hand with a read-write lock which suffers the same problem, so far I don't know how feasible it is to fix.
Thus in order to fix it in pfSense the following has to be performed:
- fix the issue in the rewrite and backport both -- this is not really feasible in my opinion
- fix the code as found in pfSense -- given the impending rebase to new FreeBSD this would be writing code to be thrown away soon and rebase would be an immediate regression
Consequently I think the best course of action is to wait for the rebase.
- Target version changed from 2.5.2 to 2.6.0
- Plus Target Version set to 21.09
Re-targeting this to 2.6.0/21.09
- Plus Target Version changed from 21.09 to 22.01
Per Mateusz, this is still unresolved upstream in FreeBSD, even on HEAD. Moving target ahead.
- Target version changed from 2.6.0 to CE-Next
- Plus Target Version changed from 22.01 to 22.05
- Plus Target Version changed from 22.05 to 22.09
- Plus Target Version changed from 22.09 to 22.11
- Plus Target Version changed from 22.11 to 23.01
Rebase to main happened and the bug remains and as predicted in the previous comment the bug is still there.
Most desirable scenario would convert all potential sx locks into rw locks, which would not conflict with multicast locking. This does not seem very feasible though.
I note multicast locking as implemented right now is kind of dodgy and rather slow, from a quick glance I think it can be reworked so that the paths which can't afford unbounded sleep don't even take the global multicast lock. Then it can be converted into sx and the problem goes away, all while performance is improved.
This will be worked on later.
This is still broken in HEAD and on snapshots, moving forward to 23.05. The attached textdump has a bit more debug info/detail than the backtrace in the original description.
I may have hit this same issue. My pfsense box has crashed three times the last few months.
Due to my almost zero knowledge of FreeBSD, I could not pinpoint the cause.
Can someone with more insight confirm that this is actually the same issue?
Here is the relevant part from the crash report:
db:0:kdb.enter.default> show pcpu
cpuid = 3
dynamic pcpu = 0xfffffe007f0f5200
curthread = 0xfffff80007949000: pid 37532 tid 100171 "igmpproxy"
curpcb = 0xfffff800079495a0
fpcurthread = 0xfffff80007949000: pid 37532 "igmpproxy"
idlethread = 0xfffff8000533e740: tid 100006 "idle: cpu3"
curpmap = 0xfffff80096b09138
tssp = 0xffffffff837198d8
commontssp = 0xffffffff837198d8
rsp0 = 0xfffffe00004b3bc0
kcr3 = 0x8000000070984453
ucr3 = 0x8000000070985c53
scr3 = 0x70985c53
gs32p = 0xffffffff837200f0
ldt = 0xffffffff83720130
tss = 0xffffffff83720120
tlb gen = 17828449
curvnet = 0xfffff8000508fa00
db:0:kdb.enter.default> bt
Tracing pid 37532 tid 100171 td 0xfffff80007949000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00004b3620
vpanic() at vpanic+0x197/frame 0xfffffe00004b3670
panic() at panic+0x43/frame 0xfffffe00004b36d0
propagate_priority() at propagate_priority+0x282/frame 0xfffffe00004b3700
turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe00004b3750
__mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe00004b37e0
X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe00004b38b0
rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe00004b38e0
sosetopt() at sosetopt+0xe7/frame 0xfffffe00004b3940
kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe00004b39a0
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe00004b39c0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00004b3af0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00004b3af0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b7dfa, rsp = 0x7fffffffebb8, rbp = 0x7fffffffebe0 ---
Assuming there is currently no workaround for this, the only option would be to disable IGMP proxy (and say goodby to live TV) until this is solved?
Cheers.
Hello. This is the same issue. I can't make promises, but it possibly going to get fixed some time next month.
Just had another hard crash. Had to reboot the system manually. Any news on this issue?
For the moment I have disabled IGMP proxy, hoping to stop these crashes.
Hey Arturo,
thank you for your patience
I wrote a highly experimental patch to sort it out, I don't know yet if it fixes the whole issue and it did not receive any testing past compilation.
That said, if everything goes right, the issue will be fixed this week. If not, probably the next one. :)
- Plus Target Version changed from 23.05 to 23.09
- Target version changed from 2.7.0 to CE-Next
There seems to be little progress and a possible fix is being postponed.
I can't imagine that I'm the only one bumping into this.
Also, I do not understand how the igmp proxy package is still in the repo when it is clearly bugged.
The only fix as of now is disabling the igmp proxy, as the system has been stable ever since.
If there is some experimental patch available, I would be more than willing to test it.
I believe this should also mitigate the problem: https://reviews.freebsd.org/D41209
The LOR occurs only, at least as far as I can see, if we hold the mroute lock while calling if_ioctl() on a network driver (via if_allmulti()). If we ensure we've releases the mroute lock before we make that call we should not run into this problem.
I've only been able to reproduce the LOR warning, not the actual panic though.
- Status changed from New to Feedback
I've committed that patch and picked it to our branches. It'll be part of the next snapshot build.
Awesome Kristof, I'll be happy to test it.
Could you briefly explain how to apply the patch?
I'm on CE 2.7.0 and the list of recommended patches remains empty.
I also looked on the main Pfsense repo on GitHub but couldn't find the patch.
So close to watching live television again, thanks!
- Subject changed from IGMPProxy: kernel panic, Sleeping thread owns a non-sleepable lock to Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock
Updating subject for release notes.
- Has duplicate Bug #14681: IGMP proxy cause crash on 23.05.1 added
- Status changed from Feedback to Closed
Closing for lack of feedback. If it's still an issue in this release we can reopen and re-target the issue at the next release.
- Target version deleted (
CE-Next)
Also available in: Atom
PDF