Bug #12079: Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock - pfSense - pfSense bugtracker

Actions

Copy link

Bug #12079

closed

Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock

Added by Steve Wheeler about 4 years ago. Updated over 1 year ago.

Status:

Closed

Priority:

Normal

Assignee:

Mateusz Guzik

Category:

IGMP Proxy

Target version:

Start date:

06/25/2021

Due date:

% Done:

Estimated time:

Plus Target Version:

23.09

Release Notes:

Default

Affected Version:

Affected Architecture:

All

Description

IGMPProxy can trigger a kernel panic in 2.5.2-RC.

db:0:kdb.enter.default>  show pcpu
cpuid        = 1
dynamic pcpu = 0xfffffe007dbe5380
curthread    = 0xfffff8005ca13000: pid 289 tid 100198 "igmpproxy" 
curpcb       = 0xfffff8005ca135a0
fpcurthread  = 0xfffff8005ca13000: pid 289 "igmpproxy" 
idlethread   = 0xfffff80004185740: tid 100004 "idle: cpu1" 
curpmap      = 0xfffff80077342138
tssp         = 0xffffffff83717688
commontssp   = 0xffffffff83717688
rsp0         = 0xfffffe001e3b0dc0
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff8371dea0
ldt          = 0xffffffff8371dee0
tss          = 0xffffffff8371ded0
tlb gen      = 40718
curvnet      = 0xfffff8000406db40
db:0:kdb.enter.default>  bt
Tracing pid 289 tid 100198 td 0xfffff8005ca13000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe001e3b0820
vpanic() at vpanic+0x197/frame 0xfffffe001e3b0870
panic() at panic+0x43/frame 0xfffffe001e3b08d0
propagate_priority() at propagate_priority+0x282/frame 0xfffffe001e3b0900
turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe001e3b0950
__mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe001e3b09e0
X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe001e3b0ab0
rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe001e3b0ae0
sosetopt() at sosetopt+0xe7/frame 0xfffffe001e3b0b40
kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe001e3b0ba0
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe001e3b0bc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe001e3b0cf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe001e3b0cf0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b57ea, rsp = 0x7fffffffeba8, rbp

Booting with a debug kernel shows:

lock order reversal: (sleepable after non-sleepable)
 1st 0xffffffff83795300 IPv4 multicast interfaces (IPv4 multicast interfaces) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/netinet/ip_mroute.c:845
 2nd 0xfffff80004445178 iflib ctx lock (iflib ctx lock) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/net/iflib.c:4190
stack backtrace:
#0 0xffffffff80dd7021 at witness_debugger+0x71
#1 0xffffffff80d77387 at _sx_xlock+0x67
#2 0xffffffff80eac72f at iflib_if_ioctl+0x2df
#3 0xffffffff80e81107 at if_setflag+0xd7
#4 0xffffffff80f4cf02 at X_ip_mrouter_set+0x1642
#5 0xffffffff80f55783 at rip_ctloutput+0xf3
#6 0xffffffff80e0a75f at sosetopt+0xff
#7 0xffffffff80e0fa50 at kern_setsockopt+0xb0
#8 0xffffffff80e0f994 at sys_setsockopt+0x24
#9 0xffffffff8134a32e at amd64_syscall+0x2be
#10 0xffffffff81320c7e at fast_syscall_common+0xf8

Tested:

2.5.2-RC (amd64)
built on Fri Jun 25 03:01:13 EDT 2021
FreeBSD 12.2-STABLE

Files

1671705688783-textdump.tar (86 KB) 1671705688783-textdump.tar

Jim Pingle, 12/22/2022 01:10 PM

Related issues

Actions

Copy link

Updated by Steve Wheeler about 4 years ago

Assignee set to Mateusz Guzik

Actions

Copy link

Updated by Mateusz Guzik about 4 years ago

First a note that to my understanding the bug is not easy to run into. However, booting a kernel with debug options easily reproduces the warning that the bug exists.

I think the most sensible thing for now is to put the bug on a back burner (reasons below).

The code got rewritten upstream in https://cgit.freebsd.org/src/commit/?id=d40cd26a86a79342d175296b74768dd7183fc02b . The rewrite replaced the lock at hand with a read-write lock which suffers the same problem, so far I don't know how feasible it is to fix.

Thus in order to fix it in pfSense the following has to be performed:
- fix the issue in the rewrite and backport both -- this is not really feasible in my opinion
- fix the code as found in pfSense -- given the impending rebase to new FreeBSD this would be writing code to be thrown away soon and rebase would be an immediate regression

Consequently I think the best course of action is to wait for the rebase.

Actions

Copy link

Updated by Jim Pingle about 4 years ago

Target version changed from 2.5.2 to 2.6.0
Plus Target Version set to 21.09

Re-targeting this to 2.6.0/21.09

Actions

Copy link

Updated by Jim Pingle almost 4 years ago

Plus Target Version changed from 21.09 to 22.01

Per Mateusz, this is still unresolved upstream in FreeBSD, even on HEAD. Moving target ahead.

Actions

Copy link

Updated by Jim Pingle over 3 years ago

Target version changed from 2.6.0 to CE-Next
Plus Target Version changed from 22.01 to 22.05

Actions

Copy link

Updated by Jim Pingle about 3 years ago

Plus Target Version changed from 22.05 to 22.09

Actions

Copy link

Updated by Jim Pingle about 3 years ago

Plus Target Version changed from 22.09 to 22.11

Actions

Copy link

Updated by Jim Pingle almost 3 years ago

Plus Target Version changed from 22.11 to 23.01

Actions

Copy link

Updated by Mateusz Guzik over 2 years ago

Rebase to main happened and the bug remains and as predicted in the previous comment the bug is still there.

Most desirable scenario would convert all potential sx locks into rw locks, which would not conflict with multicast locking. This does not seem very feasible though.

I note multicast locking as implemented right now is kind of dodgy and rather slow, from a quick glance I think it can be reworked so that the paths which can't afford unbounded sleep don't even take the global multicast lock. Then it can be converted into sx and the problem goes away, all while performance is improved.

This will be worked on later.

Actions

Copy link

#10

Updated by Jim Pingle over 2 years ago

File 1671705688783-textdump.tar 1671705688783-textdump.tar added
Target version changed from CE-Next to 2.7.0
Plus Target Version changed from 23.01 to 23.05

This is still broken in HEAD and on snapshots, moving forward to 23.05. The attached textdump has a bit more debug info/detail than the backtrace in the original description.

Actions

Copy link

#11

Updated by Arturo de Vries over 2 years ago

I may have hit this same issue. My pfsense box has crashed three times the last few months.
Due to my almost zero knowledge of FreeBSD, I could not pinpoint the cause.
Can someone with more insight confirm that this is actually the same issue?
Here is the relevant part from the crash report:
db:0:kdb.enter.default> show pcpu cpuid = 3 dynamic pcpu = 0xfffffe007f0f5200 curthread = 0xfffff80007949000: pid 37532 tid 100171 "igmpproxy" curpcb = 0xfffff800079495a0 fpcurthread = 0xfffff80007949000: pid 37532 "igmpproxy" idlethread = 0xfffff8000533e740: tid 100006 "idle: cpu3" curpmap = 0xfffff80096b09138 tssp = 0xffffffff837198d8 commontssp = 0xffffffff837198d8 rsp0 = 0xfffffe00004b3bc0 kcr3 = 0x8000000070984453 ucr3 = 0x8000000070985c53 scr3 = 0x70985c53 gs32p = 0xffffffff837200f0 ldt = 0xffffffff83720130 tss = 0xffffffff83720120 tlb gen = 17828449 curvnet = 0xfffff8000508fa00 db:0:kdb.enter.default> bt Tracing pid 37532 tid 100171 td 0xfffff80007949000 kdb_enter() at kdb_enter+0x37/frame 0xfffffe00004b3620 vpanic() at vpanic+0x197/frame 0xfffffe00004b3670 panic() at panic+0x43/frame 0xfffffe00004b36d0 propagate_priority() at propagate_priority+0x282/frame 0xfffffe00004b3700 turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe00004b3750 __mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe00004b37e0 X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe00004b38b0 rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe00004b38e0 sosetopt() at sosetopt+0xe7/frame 0xfffffe00004b3940 kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe00004b39a0 sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe00004b39c0 amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe00004b3af0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00004b3af0 --- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b7dfa, rsp = 0x7fffffffebb8, rbp = 0x7fffffffebe0 ---
Assuming there is currently no workaround for this, the only option would be to disable IGMP proxy (and say goodby to live TV) until this is solved?

Cheers.

Actions

Copy link

#12

Updated by Mateusz Guzik over 2 years ago

Hello. This is the same issue. I can't make promises, but it possibly going to get fixed some time next month.

Actions

Copy link

#13

Updated by Arturo de Vries about 2 years ago

Just had another hard crash. Had to reboot the system manually. Any news on this issue?
For the moment I have disabled IGMP proxy, hoping to stop these crashes.

Actions

Copy link

#14

Updated by Mateusz Guzik about 2 years ago

Hey Arturo,

thank you for your patience

I wrote a highly experimental patch to sort it out, I don't know yet if it fixes the whole issue and it did not receive any testing past compilation.

That said, if everything goes right, the issue will be fixed this week. If not, probably the next one. :)

Actions

Copy link

#15

Updated by Jim Pingle about 2 years ago

Plus Target Version changed from 23.05 to 23.09

Actions

Copy link

#16

Updated by Jim Pingle about 2 years ago

Target version changed from 2.7.0 to CE-Next

Actions

Copy link

#17

Updated by Arturo de Vries about 2 years ago

There seems to be little progress and a possible fix is being postponed.
I can't imagine that I'm the only one bumping into this.
Also, I do not understand how the igmp proxy package is still in the repo when it is clearly bugged.
The only fix as of now is disabling the igmp proxy, as the system has been stable ever since.
If there is some experimental patch available, I would be more than willing to test it.

Actions

Copy link

#18

Updated by Kristof Provost almost 2 years ago

I believe this should also mitigate the problem: https://reviews.freebsd.org/D41209

The LOR occurs only, at least as far as I can see, if we hold the mroute lock while calling if_ioctl() on a network driver (via if_allmulti()). If we ensure we've releases the mroute lock before we make that call we should not run into this problem.

I've only been able to reproduce the LOR warning, not the actual panic though.

Actions

Copy link

#19

Updated by Kristof Provost almost 2 years ago

Status changed from New to Feedback

I've committed that patch and picked it to our branches. It'll be part of the next snapshot build.

Actions

Copy link

#20

Updated by Arturo de Vries almost 2 years ago

Awesome Kristof, I'll be happy to test it.
Could you briefly explain how to apply the patch?
I'm on CE 2.7.0 and the list of recommended patches remains empty.
I also looked on the main Pfsense repo on GitHub but couldn't find the patch.
So close to watching live television again, thanks!

Actions

Copy link

#21

Updated by Kristof Provost almost 2 years ago

This is the relevant commit: https://github.com/pfsense/FreeBSD-src/commit/f10efe9d5708cf2f385f17f6ed13909d84cea737

It's a kernel change, so you'll want a snapshot build, you can't just apply it yourself, unless you're comfortable building and installing your own kernel.

Actions

Copy link

#22

Updated by Jim Pingle almost 2 years ago

Subject changed from IGMPProxy: kernel panic, Sleeping thread owns a non-sleepable lock to Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock

Updating subject for release notes.

Actions

Copy link

#23

Updated by Jim Pingle almost 2 years ago

Has duplicate Bug #14681: IGMP proxy cause crash on 23.05.1 added

Actions

Copy link

#24

Updated by Jim Pingle over 1 year ago

Status changed from Feedback to Closed

Closing for lack of feedback. If it's still an issue in this release we can reopen and re-target the issue at the next release.

Actions

Copy link

#25

Updated by Jim Pingle over 1 year ago

Target version deleted (~~CE-Next~~)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Bug #12079

Kernel panic when running IGMP Proxy: Sleeping thread owns a non-sleepable lock

Updated by Steve Wheeler about 4 years ago

Updated by Mateusz Guzik about 4 years ago

Updated by Jim Pingle about 4 years ago

Updated by Jim Pingle almost 4 years ago

Updated by Jim Pingle over 3 years ago

Updated by Jim Pingle about 3 years ago

Updated by Jim Pingle about 3 years ago

Updated by Jim Pingle almost 3 years ago

Updated by Mateusz Guzik over 2 years ago

Updated by Jim Pingle over 2 years ago

Updated by Arturo de Vries over 2 years ago

Updated by Mateusz Guzik over 2 years ago

Updated by Arturo de Vries about 2 years ago

Updated by Mateusz Guzik about 2 years ago

Updated by Jim Pingle about 2 years ago

Updated by Jim Pingle about 2 years ago

Updated by Arturo de Vries about 2 years ago

Updated by Kristof Provost almost 2 years ago

Updated by Kristof Provost almost 2 years ago

Updated by Arturo de Vries almost 2 years ago

Updated by Kristof Provost almost 2 years ago

Updated by Jim Pingle almost 2 years ago

Updated by Jim Pingle almost 2 years ago

Updated by Jim Pingle over 1 year ago

Updated by Jim Pingle over 1 year ago