Project

General

Profile

Actions

Bug #12079

closed

Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock

Added by Steve Wheeler almost 3 years ago. Updated about 2 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
IGMP Proxy
Target version:
-
Start date:
06/25/2021
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
23.09
Release Notes:
Default
Affected Version:
Affected Architecture:
All

Description

IGMPProxy can trigger a kernel panic in 2.5.2-RC.

db:0:kdb.enter.default>  show pcpu
cpuid        = 1
dynamic pcpu = 0xfffffe007dbe5380
curthread    = 0xfffff8005ca13000: pid 289 tid 100198 "igmpproxy" 
curpcb       = 0xfffff8005ca135a0
fpcurthread  = 0xfffff8005ca13000: pid 289 "igmpproxy" 
idlethread   = 0xfffff80004185740: tid 100004 "idle: cpu1" 
curpmap      = 0xfffff80077342138
tssp         = 0xffffffff83717688
commontssp   = 0xffffffff83717688
rsp0         = 0xfffffe001e3b0dc0
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff8371dea0
ldt          = 0xffffffff8371dee0
tss          = 0xffffffff8371ded0
tlb gen      = 40718
curvnet      = 0xfffff8000406db40
db:0:kdb.enter.default>  bt
Tracing pid 289 tid 100198 td 0xfffff8005ca13000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe001e3b0820
vpanic() at vpanic+0x197/frame 0xfffffe001e3b0870
panic() at panic+0x43/frame 0xfffffe001e3b08d0
propagate_priority() at propagate_priority+0x282/frame 0xfffffe001e3b0900
turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe001e3b0950
__mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe001e3b09e0
X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe001e3b0ab0
rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe001e3b0ae0
sosetopt() at sosetopt+0xe7/frame 0xfffffe001e3b0b40
kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe001e3b0ba0
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe001e3b0bc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe001e3b0cf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe001e3b0cf0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b57ea, rsp = 0x7fffffffeba8, rbp

Booting with a debug kernel shows:

lock order reversal: (sleepable after non-sleepable)
 1st 0xffffffff83795300 IPv4 multicast interfaces (IPv4 multicast interfaces) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/netinet/ip_mroute.c:845
 2nd 0xfffff80004445178 iflib ctx lock (iflib ctx lock) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/net/iflib.c:4190
stack backtrace:
#0 0xffffffff80dd7021 at witness_debugger+0x71
#1 0xffffffff80d77387 at _sx_xlock+0x67
#2 0xffffffff80eac72f at iflib_if_ioctl+0x2df
#3 0xffffffff80e81107 at if_setflag+0xd7
#4 0xffffffff80f4cf02 at X_ip_mrouter_set+0x1642
#5 0xffffffff80f55783 at rip_ctloutput+0xf3
#6 0xffffffff80e0a75f at sosetopt+0xff
#7 0xffffffff80e0fa50 at kern_setsockopt+0xb0
#8 0xffffffff80e0f994 at sys_setsockopt+0x24
#9 0xffffffff8134a32e at amd64_syscall+0x2be
#10 0xffffffff81320c7e at fast_syscall_common+0xf8

Tested:

2.5.2-RC (amd64)
built on Fri Jun 25 03:01:13 EDT 2021
FreeBSD 12.2-STABLE


Files

1671705688783-textdump.tar (86 KB) 1671705688783-textdump.tar Jim Pingle, 12/22/2022 01:10 PM

Related issues

Has duplicate Bug #14681: IGMP proxy cause crash on 23.05.1Duplicate

Actions
Actions #1

Updated by Steve Wheeler almost 3 years ago

  • Assignee set to Mateusz Guzik
Actions #2

Updated by Mateusz Guzik almost 3 years ago

First a note that to my understanding the bug is not easy to run into. However, booting a kernel with debug options easily reproduces the warning that the bug exists.

I think the most sensible thing for now is to put the bug on a back burner (reasons below).

The code got rewritten upstream in https://cgit.freebsd.org/src/commit/?id=d40cd26a86a79342d175296b74768dd7183fc02b . The rewrite replaced the lock at hand with a read-write lock which suffers the same problem, so far I don't know how feasible it is to fix.

Thus in order to fix it in pfSense the following has to be performed:
- fix the issue in the rewrite and backport both -- this is not really feasible in my opinion
- fix the code as found in pfSense -- given the impending rebase to new FreeBSD this would be writing code to be thrown away soon and rebase would be an immediate regression

Consequently I think the best course of action is to wait for the rebase.

Actions #3

Updated by Jim Pingle almost 3 years ago

  • Target version changed from 2.5.2 to 2.6.0
  • Plus Target Version set to 21.09

Re-targeting this to 2.6.0/21.09

Actions #4

Updated by Jim Pingle over 2 years ago

  • Plus Target Version changed from 21.09 to 22.01

Per Mateusz, this is still unresolved upstream in FreeBSD, even on HEAD. Moving target ahead.

Actions #5

Updated by Jim Pingle over 2 years ago

  • Target version changed from 2.6.0 to CE-Next
  • Plus Target Version changed from 22.01 to 22.05
Actions #6

Updated by Jim Pingle almost 2 years ago

  • Plus Target Version changed from 22.05 to 22.09
Actions #7

Updated by Jim Pingle almost 2 years ago

  • Plus Target Version changed from 22.09 to 22.11
Actions #8

Updated by Jim Pingle over 1 year ago

  • Plus Target Version changed from 22.11 to 23.01
Actions #9

Updated by Mateusz Guzik over 1 year ago

Rebase to main happened and the bug remains and as predicted in the previous comment the bug is still there.

Most desirable scenario would convert all potential sx locks into rw locks, which would not conflict with multicast locking. This does not seem very feasible though.

I note multicast locking as implemented right now is kind of dodgy and rather slow, from a quick glance I think it can be reworked so that the paths which can't afford unbounded sleep don't even take the global multicast lock. Then it can be converted into sx and the problem goes away, all while performance is improved.

This will be worked on later.

Actions #10

Updated by Jim Pingle over 1 year ago

This is still broken in HEAD and on snapshots, moving forward to 23.05. The attached textdump has a bit more debug info/detail than the backtrace in the original description.

Actions #11

Updated by Arturo de Vries about 1 year ago

I may have hit this same issue. My pfsense box has crashed three times the last few months.
Due to my almost zero knowledge of FreeBSD, I could not pinpoint the cause.
Can someone with more insight confirm that this is actually the same issue?
Here is the relevant part from the crash report:
db:0:kdb.enter.default> show pcpu
cpuid = 3
dynamic pcpu = 0xfffffe007f0f5200
curthread = 0xfffff80007949000: pid 37532 tid 100171 "igmpproxy"
curpcb = 0xfffff800079495a0
fpcurthread = 0xfffff80007949000: pid 37532 "igmpproxy"
idlethread = 0xfffff8000533e740: tid 100006 "idle: cpu3"
curpmap = 0xfffff80096b09138
tssp = 0xffffffff837198d8
commontssp = 0xffffffff837198d8
rsp0 = 0xfffffe00004b3bc0
kcr3 = 0x8000000070984453
ucr3 = 0x8000000070985c53
scr3 = 0x70985c53
gs32p = 0xffffffff837200f0
ldt = 0xffffffff83720130
tss = 0xffffffff83720120
tlb gen = 17828449
curvnet = 0xfffff8000508fa00
db:0:kdb.enter.default> bt
Tracing pid 37532 tid 100171 td 0xfffff80007949000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe00004b3620
vpanic() at vpanic+0x197/frame 0xfffffe00004b3670
panic() at panic+0x43/frame 0xfffffe00004b36d0
propagate_priority() at propagate_priority+0x282/frame 0xfffffe00004b3700
turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe00004b3750
__mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe00004b37e0
X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe00004b38b0
rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe00004b38e0
sosetopt() at sosetopt+0xe7/frame 0xfffffe00004b3940
kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe00004b39a0
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe00004b39c0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00004b3af0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00004b3af0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b7dfa, rsp = 0x7fffffffebb8, rbp = 0x7fffffffebe0 ---

Assuming there is currently no workaround for this, the only option would be to disable IGMP proxy (and say goodby to live TV) until this is solved?

Cheers.

Actions #12

Updated by Mateusz Guzik about 1 year ago

Hello. This is the same issue. I can't make promises, but it possibly going to get fixed some time next month.

Actions #13

Updated by Arturo de Vries about 1 year ago

Just had another hard crash. Had to reboot the system manually. Any news on this issue?
For the moment I have disabled IGMP proxy, hoping to stop these crashes.

Actions #14

Updated by Mateusz Guzik about 1 year ago

Hey Arturo,

thank you for your patience

I wrote a highly experimental patch to sort it out, I don't know yet if it fixes the whole issue and it did not receive any testing past compilation.

That said, if everything goes right, the issue will be fixed this week. If not, probably the next one. :)

Actions #15

Updated by Jim Pingle 12 months ago

  • Plus Target Version changed from 23.05 to 23.09
Actions #16

Updated by Jim Pingle 10 months ago

  • Target version changed from 2.7.0 to CE-Next
Actions #17

Updated by Arturo de Vries 10 months ago

There seems to be little progress and a possible fix is being postponed.
I can't imagine that I'm the only one bumping into this.
Also, I do not understand how the igmp proxy package is still in the repo when it is clearly bugged.
The only fix as of now is disabling the igmp proxy, as the system has been stable ever since.
If there is some experimental patch available, I would be more than willing to test it.

Actions #18

Updated by Kristof Provost 9 months ago

I believe this should also mitigate the problem: https://reviews.freebsd.org/D41209

The LOR occurs only, at least as far as I can see, if we hold the mroute lock while calling if_ioctl() on a network driver (via if_allmulti()). If we ensure we've releases the mroute lock before we make that call we should not run into this problem.

I've only been able to reproduce the LOR warning, not the actual panic though.

Actions #19

Updated by Kristof Provost 9 months ago

  • Status changed from New to Feedback

I've committed that patch and picked it to our branches. It'll be part of the next snapshot build.

Actions #20

Updated by Arturo de Vries 9 months ago

Awesome Kristof, I'll be happy to test it.
Could you briefly explain how to apply the patch?
I'm on CE 2.7.0 and the list of recommended patches remains empty.
I also looked on the main Pfsense repo on GitHub but couldn't find the patch.
So close to watching live television again, thanks!

Actions #21

Updated by Kristof Provost 9 months ago

This is the relevant commit: https://github.com/pfsense/FreeBSD-src/commit/f10efe9d5708cf2f385f17f6ed13909d84cea737

It's a kernel change, so you'll want a snapshot build, you can't just apply it yourself, unless you're comfortable building and installing your own kernel.

Actions #22

Updated by Jim Pingle 8 months ago

  • Subject changed from IGMPProxy: kernel panic, Sleeping thread owns a non-sleepable lock to Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock

Updating subject for release notes.

Actions #23

Updated by Jim Pingle 8 months ago

  • Has duplicate Bug #14681: IGMP proxy cause crash on 23.05.1 added
Actions #24

Updated by Jim Pingle 6 months ago

  • Status changed from Feedback to Closed

Closing for lack of feedback. If it's still an issue in this release we can reopen and re-target the issue at the next release.

Actions #25

Updated by Jim Pingle about 2 months ago

  • Target version deleted (CE-Next)
Actions

Also available in: Atom PDF