Project

General

Profile

Bug #4685

Crash/panic "Sleeping thread owns a non-sleepable lock"

Added by Jim Pingle over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Operating System
Target version:
Start date:
05/07/2015
Due date:
% Done:

100%

Estimated time:
Affected Version:
2.2.x
Affected Architecture:

Description

Several reported similar panics have been happening to users. There appears to be an issue with BPF/ARP resolution at times that leads to a locking problem.

It has happened on both bare metal and virtual machines, so far only em(4) and igb(4) NICs, but possibly others.

Sample backtrace:

Sleeping thread (tid 100067, pid 12) owns a non-sleepable lock
KDB: stack backtrace of thread 100067:
sched_switch() at sched_switch+0x2b3/frame 0xfffffe001c2d4120
mi_switch() at mi_switch+0xe1/frame 0xfffffe001c2d4160
sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe001c2d4190
_sleep() at _sleep+0x287/frame 0xfffffe001c2d4210
filt_bpfread() at filt_bpfread+0x94/frame 0xfffffe001c2d4250
knote() at knote+0xdb/frame 0xfffffe001c2d42b0
catchpacket() at catchpacket+0x67b/frame 0xfffffe001c2d43a0
bpf_mtap() at bpf_mtap+0x1d0/frame 0xfffffe001c2d4410
igb_mq_start_locked() at igb_mq_start_locked+0xe4/frame 0xfffffe001c2d4470
igb_mq_start() at igb_mq_start+0x224/frame 0xfffffe001c2d44e0
ether_output() at ether_output+0x58d/frame 0xfffffe001c2d4550
arprequest() at arprequest+0x23a/frame 0xfffffe001c2d45d0
arpresolve() at arpresolve+0x3c2/frame 0xfffffe001c2d4640
ether_output() at ether_output+0x1e0/frame 0xfffffe001c2d46b0
ip_output() at ip_output+0x115b/frame 0xfffffe001c2d47b0
ip_forward() at ip_forward+0x347/frame 0xfffffe001c2d4860
ip_input() at ip_input+0x714/frame 0xfffffe001c2d48b0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4920
ether_demux() at ether_demux+0x149/frame 0xfffffe001c2d4950
ether_nh_input() at ether_nh_input+0x349/frame 0xfffffe001c2d49b0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4a20
igb_rxeof() at igb_rxeof+0x698/frame 0xfffffe001c2d4ad0
igb_msix_que() at igb_msix_que+0x16d/frame 0xfffffe001c2d4b20
intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe001c2d4b60
ithread_loop() at ithread_loop+0x96/frame 0xfffffe001c2d4bb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe001c2d4bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001c2d4bf0

More details in tickets VCQ-32476, TRD-65473, EYM-15868, and AIA-46143

Suggested workaround are effective in some cases but not others:

net.bpf.zerocopy_enable=0
net.inet.ipsec.directdispatch=0
net.isr.dispatch=deferred

Still waiting on confirmation if all of the above help one particularly problematic system.

Paraphrasing Ermal's comments from a day or two ago: "That seems [to be] happening when an ARP entry is expired and needs a requery. He probably can solve that by putting net.isr to always queue (deferred). Since the driver on input is running on ISR context, you are grabbing a sleeping lock in that context which should not be allowed and the sleeping lock is probably the ARP code or the TX handler of the NIC"

There are complete crash dumps in the projects repo under "nonsleepablelock" named starting with their relevant ticket numbers.

History

#1 Updated by Jim Pingle over 4 years ago

Reports from customers indicate that crashes still occur even with net.bpf.zerocopy_enable=0
and net.isr.dispatch=deferred set.

#2 Updated by Ermal Luçi over 4 years ago

  • Status changed from New to Feedback

choparp was blocking on bpf mutex and making full buffers on BPF and panicing due to context of ISR routines on drivers.

#3 Updated by Jim Pingle over 4 years ago

For those who would like to test a version of choparp including Ermal's fixes, following this procedure:

1. Stop the existing choparp process (will interrupt Proxy ARP VIP connectivity):

killall -9 choparp

2. Ensure the pkg utility is bootstrapped properly:

env ASSUME_ALWAYS_YES=yes pkg bootstrap -f

3. Install the updated utility.

pkg add http://files.atx.pfsense.org/jimp/pkg/`uname -m`/choparp-20021107_5.txz

4. Restart the choparp daemon by editing a Proxy ARP VIP, clicking Save and then Apply Changes.

#4 Updated by Jim Pingle over 4 years ago

  • Status changed from Feedback to Confirmed

One user still reports crashes with the new daemon. Updated crash dump is in the projects repo.

#6 Updated by Luke Hamburg over 4 years ago

I have experienced a lot of crashes (hard crash that triggers the box to reboot) on 2 different RCC-VE 2440 units (igb NIC) that are running 2.2.2 64bit - the crashes specifically happen when enabling/disabling the Captive Portal. It is very reproducible and happens soon after switching on the CP. If this is unrelated then please feel free to remove this comment otherwise could those possibly be related to this bug? I submitted the crash reports when prompted to do so by the WebConfigurator but unfortunately they get automatically deleted so I can't look at them anymore to compare to the above.

#7 Updated by Jim Pingle over 4 years ago

Luke Hamburg wrote:

I have experienced a lot of crashes (hard crash that triggers the box to reboot) on 2 different RCC-VE 2440 units (igb NIC) that are running 2.2.2 64bit - the crashes specifically happen when enabling/disabling the Captive Portal. It is very reproducible and happens soon after switching on the CP. If this is unrelated then please feel free to remove this comment otherwise could those possibly be related to this bug? I submitted the crash reports when prompted to do so by the WebConfigurator but unfortunately they get automatically deleted so I can't look at them anymore to compare to the above.

Without seeing the full crash report it's impossible to say if it's related. If you haven't already, please start a new forum thread and post the crash report there. Or if you have purchased support you can open a ticket with us and send in the crash report that way.

#8 Updated by Jim Pingle over 4 years ago

Still seeing a steady stream of crashes on certain systems, I've added more crash reports to the repo. Two of them have attempted all of the proposed workaround with no relief.

#9 Updated by Ermal Luçi over 4 years ago

  • Status changed from Confirmed to Feedback

Patch put on the tree.
Those who want to test need to update to snapshot coming out next.

#10 Updated by Luke Hamburg over 4 years ago

Jim P wrote:

Without seeing the full crash report it's impossible to say if it's related. If you haven't already, please start a new forum thread and post the crash report there. Or if you have purchased support you can open a ticket with us and send in the crash report that way.

Thanks Jim. Sorry to pollute this thread. But when the unit reboots, it immediately asks if I want to submit the bugreport and if I click YES the coredump/stack trace seems to get deleted. What's the proper way to preserve these and submit them to the developers?

#11 Updated by Chris Buechler over 4 years ago

  • Target version changed from 2.2.3 to 2.3
  • Affected Version changed from 2.2.2 to 2.2.x

Ermal suggested replicating with very low bpf buffers and high ARP traffic. I've had an arp-scan across one /16 and one /24 running in a loop for hours against 2.2.2 in that scenario, well into millions of ARP requests and replies, without being able to replicate. I'm leaving it running.

This may be resolved, though I haven't been able to replicate the issue to confirm or deny that.

#12 Updated by Jim Pingle over 4 years ago

  • Status changed from Feedback to New
  • Assignee changed from Ermal Luçi to George Neville-Neil

Customers are still reporting panics on 2.2.3 with all of the fixes thus far applied. Crash dump looks virtually identical, new report is in the projects repo under nonsleepablelock/TRD-65473-error2.txt

#13 Updated by Renato Botelho about 4 years ago

  • Assignee changed from George Neville-Neil to Luiz Souza

#14 Updated by Renato Botelho about 4 years ago

  • Target version changed from 2.3 to 2.2.5

#15 Updated by Robert Olofsson about 4 years ago

I upgraded a pfSense 2.0.x machine yesterday to 2.2.4 and came across what I think was this issue. Unfortunately I didn't any swap on the machine so I was unable to capture the dump. The machine in question uses em NICs and has a high number of virtual IP's.

#16 Updated by Luiz Souza about 4 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

The real issue was tracked down and fixed in FreeBSD and pfSense. It will be included in the next release(s).

Also available in: Atom PDF