Project

General

Profile

Actions

Bug #12079

open

IGMPProxy: kernel panic, Sleeping thread owns a non-sleepable lock

Added by Steve Wheeler over 1 year ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
IGMP Proxy
Target version:
Start date:
06/25/2021
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
23.05
Release Notes:
Default
Affected Version:
Affected Architecture:
All

Description

IGMPProxy can trigger a kernel panic in 2.5.2-RC.

db:0:kdb.enter.default>  show pcpu
cpuid        = 1
dynamic pcpu = 0xfffffe007dbe5380
curthread    = 0xfffff8005ca13000: pid 289 tid 100198 "igmpproxy" 
curpcb       = 0xfffff8005ca135a0
fpcurthread  = 0xfffff8005ca13000: pid 289 "igmpproxy" 
idlethread   = 0xfffff80004185740: tid 100004 "idle: cpu1" 
curpmap      = 0xfffff80077342138
tssp         = 0xffffffff83717688
commontssp   = 0xffffffff83717688
rsp0         = 0xfffffe001e3b0dc0
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff8371dea0
ldt          = 0xffffffff8371dee0
tss          = 0xffffffff8371ded0
tlb gen      = 40718
curvnet      = 0xfffff8000406db40
db:0:kdb.enter.default>  bt
Tracing pid 289 tid 100198 td 0xfffff8005ca13000
kdb_enter() at kdb_enter+0x37/frame 0xfffffe001e3b0820
vpanic() at vpanic+0x197/frame 0xfffffe001e3b0870
panic() at panic+0x43/frame 0xfffffe001e3b08d0
propagate_priority() at propagate_priority+0x282/frame 0xfffffe001e3b0900
turnstile_wait() at turnstile_wait+0x30c/frame 0xfffffe001e3b0950
__mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame 0xfffffe001e3b09e0
X_ip_mrouter_set() at X_ip_mrouter_set+0x13a4/frame 0xfffffe001e3b0ab0
rip_ctloutput() at rip_ctloutput+0xf3/frame 0xfffffe001e3b0ae0
sosetopt() at sosetopt+0xe7/frame 0xfffffe001e3b0b40
kern_setsockopt() at kern_setsockopt+0xb0/frame 0xfffffe001e3b0ba0
sys_setsockopt() at sys_setsockopt+0x24/frame 0xfffffe001e3b0bc0
amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe001e3b0cf0
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe001e3b0cf0
--- syscall (105, FreeBSD ELF64, sys_setsockopt), rip = 0x8003b57ea, rsp = 0x7fffffffeba8, rbp

Booting with a debug kernel shows:

lock order reversal: (sleepable after non-sleepable)
 1st 0xffffffff83795300 IPv4 multicast interfaces (IPv4 multicast interfaces) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/netinet/ip_mroute.c:845
 2nd 0xfffff80004445178 iflib ctx lock (iflib ctx lock) @ /usr/home/mjg/git/netgate/FreeBSD-src/sys/net/iflib.c:4190
stack backtrace:
#0 0xffffffff80dd7021 at witness_debugger+0x71
#1 0xffffffff80d77387 at _sx_xlock+0x67
#2 0xffffffff80eac72f at iflib_if_ioctl+0x2df
#3 0xffffffff80e81107 at if_setflag+0xd7
#4 0xffffffff80f4cf02 at X_ip_mrouter_set+0x1642
#5 0xffffffff80f55783 at rip_ctloutput+0xf3
#6 0xffffffff80e0a75f at sosetopt+0xff
#7 0xffffffff80e0fa50 at kern_setsockopt+0xb0
#8 0xffffffff80e0f994 at sys_setsockopt+0x24
#9 0xffffffff8134a32e at amd64_syscall+0x2be
#10 0xffffffff81320c7e at fast_syscall_common+0xf8

Tested:

2.5.2-RC (amd64)
built on Fri Jun 25 03:01:13 EDT 2021
FreeBSD 12.2-STABLE


Files

1671705688783-textdump.tar (86 KB) 1671705688783-textdump.tar Jim Pingle, 12/22/2022 01:10 PM
Actions #1

Updated by Steve Wheeler over 1 year ago

  • Assignee set to Mateusz Guzik
Actions #2

Updated by Mateusz Guzik over 1 year ago

First a note that to my understanding the bug is not easy to run into. However, booting a kernel with debug options easily reproduces the warning that the bug exists.

I think the most sensible thing for now is to put the bug on a back burner (reasons below).

The code got rewritten upstream in https://cgit.freebsd.org/src/commit/?id=d40cd26a86a79342d175296b74768dd7183fc02b . The rewrite replaced the lock at hand with a read-write lock which suffers the same problem, so far I don't know how feasible it is to fix.

Thus in order to fix it in pfSense the following has to be performed:
- fix the issue in the rewrite and backport both -- this is not really feasible in my opinion
- fix the code as found in pfSense -- given the impending rebase to new FreeBSD this would be writing code to be thrown away soon and rebase would be an immediate regression

Consequently I think the best course of action is to wait for the rebase.

Actions #3

Updated by Jim Pingle over 1 year ago

  • Target version changed from 2.5.2 to 2.6.0
  • Plus Target Version set to 21.09

Re-targeting this to 2.6.0/21.09

Actions #4

Updated by Jim Pingle over 1 year ago

  • Plus Target Version changed from 21.09 to 22.01

Per Mateusz, this is still unresolved upstream in FreeBSD, even on HEAD. Moving target ahead.

Actions #5

Updated by Jim Pingle over 1 year ago

  • Target version changed from 2.6.0 to CE-Next
  • Plus Target Version changed from 22.01 to 22.05
Actions #6

Updated by Jim Pingle 9 months ago

  • Plus Target Version changed from 22.05 to 22.09
Actions #7

Updated by Jim Pingle 7 months ago

  • Plus Target Version changed from 22.09 to 22.11
Actions #8

Updated by Jim Pingle 4 months ago

  • Plus Target Version changed from 22.11 to 23.01
Actions #9

Updated by Mateusz Guzik about 2 months ago

Rebase to main happened and the bug remains and as predicted in the previous comment the bug is still there.

Most desirable scenario would convert all potential sx locks into rw locks, which would not conflict with multicast locking. This does not seem very feasible though.

I note multicast locking as implemented right now is kind of dodgy and rather slow, from a quick glance I think it can be reworked so that the paths which can't afford unbounded sleep don't even take the global multicast lock. Then it can be converted into sx and the problem goes away, all while performance is improved.

This will be worked on later.

Actions #10

Updated by Jim Pingle about 2 months ago

This is still broken in HEAD and on snapshots, moving forward to 23.05. The attached textdump has a bit more debug info/detail than the backtrace in the original description.

Actions

Also available in: Atom PDF