Project

General

Profile

Actions

Bug #4685

closed

Crash/panic "Sleeping thread owns a non-sleepable lock"

Added by Jim Pingle almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Operating System
Target version:
Start date:
05/07/2015
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.2.x
Affected Architecture:

Description

Several reported similar panics have been happening to users. There appears to be an issue with BPF/ARP resolution at times that leads to a locking problem.

It has happened on both bare metal and virtual machines, so far only em(4) and igb(4) NICs, but possibly others.

Sample backtrace:

Sleeping thread (tid 100067, pid 12) owns a non-sleepable lock
KDB: stack backtrace of thread 100067:
sched_switch() at sched_switch+0x2b3/frame 0xfffffe001c2d4120
mi_switch() at mi_switch+0xe1/frame 0xfffffe001c2d4160
sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe001c2d4190
_sleep() at _sleep+0x287/frame 0xfffffe001c2d4210
filt_bpfread() at filt_bpfread+0x94/frame 0xfffffe001c2d4250
knote() at knote+0xdb/frame 0xfffffe001c2d42b0
catchpacket() at catchpacket+0x67b/frame 0xfffffe001c2d43a0
bpf_mtap() at bpf_mtap+0x1d0/frame 0xfffffe001c2d4410
igb_mq_start_locked() at igb_mq_start_locked+0xe4/frame 0xfffffe001c2d4470
igb_mq_start() at igb_mq_start+0x224/frame 0xfffffe001c2d44e0
ether_output() at ether_output+0x58d/frame 0xfffffe001c2d4550
arprequest() at arprequest+0x23a/frame 0xfffffe001c2d45d0
arpresolve() at arpresolve+0x3c2/frame 0xfffffe001c2d4640
ether_output() at ether_output+0x1e0/frame 0xfffffe001c2d46b0
ip_output() at ip_output+0x115b/frame 0xfffffe001c2d47b0
ip_forward() at ip_forward+0x347/frame 0xfffffe001c2d4860
ip_input() at ip_input+0x714/frame 0xfffffe001c2d48b0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4920
ether_demux() at ether_demux+0x149/frame 0xfffffe001c2d4950
ether_nh_input() at ether_nh_input+0x349/frame 0xfffffe001c2d49b0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4a20
igb_rxeof() at igb_rxeof+0x698/frame 0xfffffe001c2d4ad0
igb_msix_que() at igb_msix_que+0x16d/frame 0xfffffe001c2d4b20
intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe001c2d4b60
ithread_loop() at ithread_loop+0x96/frame 0xfffffe001c2d4bb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe001c2d4bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001c2d4bf0

More details in tickets VCQ-32476, TRD-65473, EYM-15868, and AIA-46143

Suggested workaround are effective in some cases but not others:

net.bpf.zerocopy_enable=0
net.inet.ipsec.directdispatch=0
net.isr.dispatch=deferred

Still waiting on confirmation if all of the above help one particularly problematic system.

Paraphrasing Ermal's comments from a day or two ago: "That seems [to be] happening when an ARP entry is expired and needs a requery. He probably can solve that by putting net.isr to always queue (deferred). Since the driver on input is running on ISR context, you are grabbing a sleeping lock in that context which should not be allowed and the sleeping lock is probably the ARP code or the TX handler of the NIC"

There are complete crash dumps in the projects repo under "nonsleepablelock" named starting with their relevant ticket numbers.

Actions

Also available in: Atom PDF