Regression #11839
closedPanic on 21.05/2.6.0 snapshots when memory usage is high
100%
Description
On several systems (hardware and VMs) running Plus 21.05 and CE 2.6.0 snapshots I am seeing panics when the systems are experiencing high memory usage. Though memory usage alone is not always sufficient to induce a panic, the lower the memory on the system the easier it appears to be to trigger the condition.
The one system I can reproduce it on most reliably is easily triggered by an apparent bug in ospf6d which causes it to eat all available RAM after an interface event (See #11838 for details). On these systems all I need to do is save/apply on an assigned VTI interface taking part in OSPF6 or stop/start IPsec (not restart), and when IPsec reconnects it panics every time.
Another way to induce a panic in a system which is in a state where it's susceptible to panic is to run tail /dev/zero
from an ssh or console shell prompt. That does not reliably induce a panic every time, however, even with multiple instances run in parallel. Thus I suspect there is some other compounding factor besides memory pressure which we haven't yet identified.
Textdumps from the most easily reproducible system are attached. The panic backtraces almost, but not entirely, happen in pf, but that may just happen to be what it was busy doing at the time.
Files
Updated by Jim Pingle over 3 years ago
- Subject changed from Panic on 21.05/2.6.0 snapshots when VM memory usage is high to Panic on 21.05/2.6.0 snapshots when memory usage is high
Updated by Jim Pingle over 3 years ago
Attaching another crash with a potentially more interesting backtrace.
Updated by Jim Pingle over 3 years ago
- File textdump-ESX-2.6.0-1.tar textdump-ESX-2.6.0-1.tar added
- File textdump-ESX-2.6.0-0.tar textdump-ESX-2.6.0-0.tar added
This continues to be simple to hit and quite annoying. Installs that worked fine for years all of a sudden can't run much beyond the base OS and remain stable.
Updated by Jim Pingle over 3 years ago
- File textdump-ESX-2.6.0-2.tar textdump-ESX-2.6.0-2.tar added
- File textdump-KVM-2.6.0-3.tar textdump-KVM-2.6.0-3.tar added
A couple more. I have additional ones I haven't posted as well... Not sure how helpful they might be at this point since they all seem fairly random.
Updated by Jim Pingle over 3 years ago
- Target version changed from 21.05 to 2.6.0
- Plus Target Version set to 21.05
Updated by Jim Pingle over 3 years ago
The attached configuration when loaded on a VM with 512MB of RAM can reproduce the panic reliably but with some variations in behavior. It leverages the OSPF6 bug to run the system out of RAM quickly. On some attempts ospf6d dies on its own (which is what should happen) but on other attempts it triggers a panic (no bueno).
Load the config on a fresh install and make sure FRR is installed and running (the config has it included). I would load the same config on a second unit as well so it will have at least one active OSPF6 neighbor. If you do that, make sure to adjust any system-specific parameters like the router ID in FRR OSPF6.
Once it's up and running:
- Navigate to Interfaces > WAN, click save and then click apply changes
- Wait about 20-30 seconds after applying.
- If it doesn't panic, check Status > Services and see if ospf6d is running. If not, restart it, then try again.
In most of my trials it panics on the second attempt. Occasionally I have to restart ospf6d after applying and then test again, resulting in it taking 3-4 attempts at most.
The process used to create the config was:
* Create VM with 512MB RAM * Install pfSense Plus 21.05 RC (latest snap) or CE 2.6.0 * pfSsh.php playback enableallowallwan * Enable SSH * Update to current build (if available) * Interfaces > Assignments, GIF tab, create a new GIF on WAN, doesn't need to work, just exist (e.g. WAN, 198.51.100.101, 10.103.111.1, 10.103.111.2, 30), save * Interfaces > Assignments, assign the GIF, should be OPT1 * Interfaces > OPT1, Enable, Save/Apply * Install FRR * Services > FRR > OSPF6, Interfaces tab. Add WAN interface w/Area 0.0.0.0, save * Services > FRR > OSPF6, Interfaces tab. Add LAN, save * Services > FRR > OSPF6, Interfaces tab. Add OPT1, save * OSPF6 tab, enable, set router ID to something sane, set Area to 0.0.0.0, save * FRR Global/Zebra tab, enable, set a master password (e.g. "abc123"), save
Updated by Peter Grehan over 3 years ago
There are 3 signatures in the panics: I'd be interested in seeing more.
The KVM one is possibly fixed in FreeBSD-current (with 4174e45fb4320dc2), but it's more a symptom of low memory resulting in a rare allocation failure in pmap code.
2 of the ESX ones are the same: seems a race in VM code between 2 threads. The code path has been long removed in FreeBSD current so perhaps another side-effect of low-mem. The 7100 crash has the same signature
Thanks for the repro case: I'll give that a try.
Updated by Jim Pingle over 3 years ago
- File textdump-KVM-21.05-4.tar textdump-KVM-21.05-4.tar added
- File textdump-KVM-21.05-3.tar textdump-KVM-21.05-3.tar added
- File textdump-KVM-21.05-2.tar textdump-KVM-21.05-2.tar added
- File textdump-KVM-21.05-1.tar textdump-KVM-21.05-1.tar added
- File textdump-KVM-21.05-0.tar textdump-KVM-21.05-0.tar added
- File textdump-KVM-2.6.0-8.tar textdump-KVM-2.6.0-8.tar added
- File textdump-ESX-2.6.0-3.tar textdump-ESX-2.6.0-3.tar added
- File textdump-KVM-2.6.0-7.tar textdump-KVM-2.6.0-7.tar added
- File textdump-KVM-2.6.0-6.tar textdump-KVM-2.6.0-6.tar added
- File textdump-APU-2.6.0-0.tar textdump-APU-2.6.0-0.tar added
Adding a few more I collected from a few misc installs during testing (some were deliberate crashes, others happened "naturally")
Updated by Peter Grehan over 3 years ago
Thanks. The majority of these are associated with the pf counter_u64 issue (anything with pf in the traceback).
However, some others may not be: the pmap backtraces are possibly associated with the fix in FreeBSD (4174e45fb4320dc2), and the uma_reclaim() ones still unexplained.
Updated by Kristof Provost over 3 years ago
I believe these crashes all share the same root cause, which is that we (in certain places) mis-use the rule/state counters (we increment them directly rather than using the counter_u64 functions). Fixes have been pushed and are being tested.
Updated by Jim Pingle over 3 years ago
- Status changed from New to Closed
- Assignee set to Kristof Provost
- % Done changed from 0 to 100
I've been aggressively attempting to crash the latest builds of 21.05 and 2.6.0 which include the fixes for this problem and thus far have had no success in triggering a panic. This is looking good to me. I could trigger it at-will a couple different ways before and now none of those methods lead to failures on any hardware or VM I try.
I'm willing to call this solved for the time being. If anything comes up I can reopen it.
Updated by Jim Pingle over 3 years ago
- Release Notes changed from Default to Force Exclusion
Excluding from release notes since it was a problem introduced by changes after the last release.
Updated by Jim Pingle over 3 years ago
- Target version changed from 2.6.0 to 2.5.2