Bug #4685
closedCrash/panic "Sleeping thread owns a non-sleepable lock"
100%
Description
Several reported similar panics have been happening to users. There appears to be an issue with BPF/ARP resolution at times that leads to a locking problem.
It has happened on both bare metal and virtual machines, so far only em(4) and igb(4) NICs, but possibly others.
Sample backtrace:
Sleeping thread (tid 100067, pid 12) owns a non-sleepable lock KDB: stack backtrace of thread 100067: sched_switch() at sched_switch+0x2b3/frame 0xfffffe001c2d4120 mi_switch() at mi_switch+0xe1/frame 0xfffffe001c2d4160 sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe001c2d4190 _sleep() at _sleep+0x287/frame 0xfffffe001c2d4210 filt_bpfread() at filt_bpfread+0x94/frame 0xfffffe001c2d4250 knote() at knote+0xdb/frame 0xfffffe001c2d42b0 catchpacket() at catchpacket+0x67b/frame 0xfffffe001c2d43a0 bpf_mtap() at bpf_mtap+0x1d0/frame 0xfffffe001c2d4410 igb_mq_start_locked() at igb_mq_start_locked+0xe4/frame 0xfffffe001c2d4470 igb_mq_start() at igb_mq_start+0x224/frame 0xfffffe001c2d44e0 ether_output() at ether_output+0x58d/frame 0xfffffe001c2d4550 arprequest() at arprequest+0x23a/frame 0xfffffe001c2d45d0 arpresolve() at arpresolve+0x3c2/frame 0xfffffe001c2d4640 ether_output() at ether_output+0x1e0/frame 0xfffffe001c2d46b0 ip_output() at ip_output+0x115b/frame 0xfffffe001c2d47b0 ip_forward() at ip_forward+0x347/frame 0xfffffe001c2d4860 ip_input() at ip_input+0x714/frame 0xfffffe001c2d48b0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4920 ether_demux() at ether_demux+0x149/frame 0xfffffe001c2d4950 ether_nh_input() at ether_nh_input+0x349/frame 0xfffffe001c2d49b0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4a20 igb_rxeof() at igb_rxeof+0x698/frame 0xfffffe001c2d4ad0 igb_msix_que() at igb_msix_que+0x16d/frame 0xfffffe001c2d4b20 intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe001c2d4b60 ithread_loop() at ithread_loop+0x96/frame 0xfffffe001c2d4bb0 fork_exit() at fork_exit+0x9a/frame 0xfffffe001c2d4bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001c2d4bf0
More details in tickets VCQ-32476, TRD-65473, EYM-15868, and AIA-46143
Suggested workaround are effective in some cases but not others:
net.bpf.zerocopy_enable=0 net.inet.ipsec.directdispatch=0 net.isr.dispatch=deferred
Still waiting on confirmation if all of the above help one particularly problematic system.
Paraphrasing Ermal's comments from a day or two ago: "That seems [to be] happening when an ARP entry is expired and needs a requery. He probably can solve that by putting net.isr to always queue (deferred). Since the driver on input is running on ISR context, you are grabbing a sleeping lock in that context which should not be allowed and the sleeping lock is probably the ARP code or the TX handler of the NIC"
There are complete crash dumps in the projects repo under "nonsleepablelock" named starting with their relevant ticket numbers.