Bug #4685
closedCrash/panic "Sleeping thread owns a non-sleepable lock"
100%
Description
Several reported similar panics have been happening to users. There appears to be an issue with BPF/ARP resolution at times that leads to a locking problem.
It has happened on both bare metal and virtual machines, so far only em(4) and igb(4) NICs, but possibly others.
Sample backtrace:
Sleeping thread (tid 100067, pid 12) owns a non-sleepable lock KDB: stack backtrace of thread 100067: sched_switch() at sched_switch+0x2b3/frame 0xfffffe001c2d4120 mi_switch() at mi_switch+0xe1/frame 0xfffffe001c2d4160 sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe001c2d4190 _sleep() at _sleep+0x287/frame 0xfffffe001c2d4210 filt_bpfread() at filt_bpfread+0x94/frame 0xfffffe001c2d4250 knote() at knote+0xdb/frame 0xfffffe001c2d42b0 catchpacket() at catchpacket+0x67b/frame 0xfffffe001c2d43a0 bpf_mtap() at bpf_mtap+0x1d0/frame 0xfffffe001c2d4410 igb_mq_start_locked() at igb_mq_start_locked+0xe4/frame 0xfffffe001c2d4470 igb_mq_start() at igb_mq_start+0x224/frame 0xfffffe001c2d44e0 ether_output() at ether_output+0x58d/frame 0xfffffe001c2d4550 arprequest() at arprequest+0x23a/frame 0xfffffe001c2d45d0 arpresolve() at arpresolve+0x3c2/frame 0xfffffe001c2d4640 ether_output() at ether_output+0x1e0/frame 0xfffffe001c2d46b0 ip_output() at ip_output+0x115b/frame 0xfffffe001c2d47b0 ip_forward() at ip_forward+0x347/frame 0xfffffe001c2d4860 ip_input() at ip_input+0x714/frame 0xfffffe001c2d48b0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4920 ether_demux() at ether_demux+0x149/frame 0xfffffe001c2d4950 ether_nh_input() at ether_nh_input+0x349/frame 0xfffffe001c2d49b0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe001c2d4a20 igb_rxeof() at igb_rxeof+0x698/frame 0xfffffe001c2d4ad0 igb_msix_que() at igb_msix_que+0x16d/frame 0xfffffe001c2d4b20 intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe001c2d4b60 ithread_loop() at ithread_loop+0x96/frame 0xfffffe001c2d4bb0 fork_exit() at fork_exit+0x9a/frame 0xfffffe001c2d4bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe001c2d4bf0
More details in tickets VCQ-32476, TRD-65473, EYM-15868, and AIA-46143
Suggested workaround are effective in some cases but not others:
net.bpf.zerocopy_enable=0 net.inet.ipsec.directdispatch=0 net.isr.dispatch=deferred
Still waiting on confirmation if all of the above help one particularly problematic system.
Paraphrasing Ermal's comments from a day or two ago: "That seems [to be] happening when an ARP entry is expired and needs a requery. He probably can solve that by putting net.isr to always queue (deferred). Since the driver on input is running on ISR context, you are grabbing a sleeping lock in that context which should not be allowed and the sleeping lock is probably the ARP code or the TX handler of the NIC"
There are complete crash dumps in the projects repo under "nonsleepablelock" named starting with their relevant ticket numbers.
Updated by Jim Pingle over 9 years ago
Reports from customers indicate that crashes still occur even with net.bpf.zerocopy_enable=0
and net.isr.dispatch=deferred set.
Updated by Ermal Luçi over 9 years ago
- Status changed from New to Feedback
choparp was blocking on bpf mutex and making full buffers on BPF and panicing due to context of ISR routines on drivers.
Updated by Jim Pingle over 9 years ago
For those who would like to test a version of choparp including Ermal's fixes, following this procedure:
1. Stop the existing choparp process (will interrupt Proxy ARP VIP connectivity):
killall -9 choparp
2. Ensure the pkg utility is bootstrapped properly:
env ASSUME_ALWAYS_YES=yes pkg bootstrap -f
3. Install the updated utility.
pkg add http://files.atx.pfsense.org/jimp/pkg/`uname -m`/choparp-20021107_5.txz
4. Restart the choparp daemon by editing a Proxy ARP VIP, clicking Save and then Apply Changes.
Updated by Jim Pingle over 9 years ago
- Status changed from Feedback to Confirmed
One user still reports crashes with the new daemon. Updated crash dump is in the projects repo.
Updated by Chris Buechler over 9 years ago
FreeBSD PR is https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200323
Updated by → luckman212 over 9 years ago
I have experienced a lot of crashes (hard crash that triggers the box to reboot) on 2 different RCC-VE 2440 units (igb NIC) that are running 2.2.2 64bit - the crashes specifically happen when enabling/disabling the Captive Portal. It is very reproducible and happens soon after switching on the CP. If this is unrelated then please feel free to remove this comment otherwise could those possibly be related to this bug? I submitted the crash reports when prompted to do so by the WebConfigurator but unfortunately they get automatically deleted so I can't look at them anymore to compare to the above.
Updated by Jim Pingle over 9 years ago
Luke Hamburg wrote:
I have experienced a lot of crashes (hard crash that triggers the box to reboot) on 2 different RCC-VE 2440 units (igb NIC) that are running 2.2.2 64bit - the crashes specifically happen when enabling/disabling the Captive Portal. It is very reproducible and happens soon after switching on the CP. If this is unrelated then please feel free to remove this comment otherwise could those possibly be related to this bug? I submitted the crash reports when prompted to do so by the WebConfigurator but unfortunately they get automatically deleted so I can't look at them anymore to compare to the above.
Without seeing the full crash report it's impossible to say if it's related. If you haven't already, please start a new forum thread and post the crash report there. Or if you have purchased support you can open a ticket with us and send in the crash report that way.
Updated by Jim Pingle over 9 years ago
Still seeing a steady stream of crashes on certain systems, I've added more crash reports to the repo. Two of them have attempted all of the proposed workaround with no relief.
Updated by Ermal Luçi over 9 years ago
- Status changed from Confirmed to Feedback
Patch put on the tree.
Those who want to test need to update to snapshot coming out next.
Updated by → luckman212 over 9 years ago
Jim P wrote:
Without seeing the full crash report it's impossible to say if it's related. If you haven't already, please start a new forum thread and post the crash report there. Or if you have purchased support you can open a ticket with us and send in the crash report that way.
Thanks Jim. Sorry to pollute this thread. But when the unit reboots, it immediately asks if I want to submit the bugreport and if I click YES the coredump/stack trace seems to get deleted. What's the proper way to preserve these and submit them to the developers?
Updated by Chris Buechler over 9 years ago
- Target version changed from 2.2.3 to 2.3
- Affected Version changed from 2.2.2 to 2.2.x
Ermal suggested replicating with very low bpf buffers and high ARP traffic. I've had an arp-scan across one /16 and one /24 running in a loop for hours against 2.2.2 in that scenario, well into millions of ARP requests and replies, without being able to replicate. I'm leaving it running.
This may be resolved, though I haven't been able to replicate the issue to confirm or deny that.
Updated by Jim Pingle over 9 years ago
- Status changed from Feedback to New
- Assignee changed from Ermal Luçi to George Neville-Neil
Customers are still reporting panics on 2.2.3 with all of the fixes thus far applied. Crash dump looks virtually identical, new report is in the projects repo under nonsleepablelock/TRD-65473-error2.txt
Updated by Renato Botelho over 9 years ago
- Assignee changed from George Neville-Neil to Luiz Souza
Updated by Renato Botelho over 9 years ago
- Target version changed from 2.3 to 2.2.5
Updated by Robert Olofsson over 9 years ago
I upgraded a pfSense 2.0.x machine yesterday to 2.2.4 and came across what I think was this issue. Unfortunately I didn't any swap on the machine so I was unable to capture the dump. The machine in question uses em NICs and has a high number of virtual IP's.
Updated by Luiz Souza over 9 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
The real issue was tracked down and fixed in FreeBSD and pfSense. It will be included in the next release(s).