Project

General

Profile

Regression #11839

Panic on 21.05/2.6.0 snapshots when memory usage is high

Added by Jim Pingle about 2 months ago. Updated 19 days ago.

Status:
Closed
Priority:
Very High
Category:
Operating System
Target version:
Start date:
04/22/2021
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
21.05
Release Notes:
Force Exclusion
Affected Version:
2.6.0
Affected Architecture:

Description

On several systems (hardware and VMs) running Plus 21.05 and CE 2.6.0 snapshots I am seeing panics when the systems are experiencing high memory usage. Though memory usage alone is not always sufficient to induce a panic, the lower the memory on the system the easier it appears to be to trigger the condition.

The one system I can reproduce it on most reliably is easily triggered by an apparent bug in ospf6d which causes it to eat all available RAM after an interface event (See #11838 for details). On these systems all I need to do is save/apply on an assigned VTI interface taking part in OSPF6 or stop/start IPsec (not restart), and when IPsec reconnects it panics every time.

Another way to induce a panic in a system which is in a state where it's susceptible to panic is to run tail /dev/zero from an ssh or console shell prompt. That does not reliably induce a panic every time, however, even with multiple instances run in parallel. Thus I suspect there is some other compounding factor besides memory pressure which we haven't yet identified.

Textdumps from the most easily reproducible system are attached. The panic backtraces almost, but not entirely, happen in pf, but that may just happen to be what it was busy doing at the time.

textdump-7551-21.05-3.tar (73.5 KB) textdump-7551-21.05-3.tar Jim Pingle, 04/22/2021 08:45 AM
textdump-7551-21.05-4.tar (111 KB) textdump-7551-21.05-4.tar Jim Pingle, 04/22/2021 08:45 AM
textdump-7551-21.05-1.tar (73.5 KB) textdump-7551-21.05-1.tar Jim Pingle, 04/22/2021 08:45 AM
textdump-7551-21.05-0.tar (90 KB) textdump-7551-21.05-0.tar Jim Pingle, 04/22/2021 08:45 AM
textdump-7551-21.05-2.tar (95.5 KB) textdump-7551-21.05-2.tar Jim Pingle, 04/22/2021 08:45 AM
textdump-7100-21.05-0.tar (138 KB) textdump-7100-21.05-0.tar Jim Pingle, 04/22/2021 10:11 AM
textdump-ESX-2.6.0-1.tar (95 KB) textdump-ESX-2.6.0-1.tar Jim Pingle, 04/27/2021 09:27 AM
textdump-ESX-2.6.0-0.tar (77.5 KB) textdump-ESX-2.6.0-0.tar Jim Pingle, 04/27/2021 09:27 AM
textdump-ESX-2.6.0-2.tar (142 KB) textdump-ESX-2.6.0-2.tar Jim Pingle, 05/07/2021 09:38 AM
textdump-KVM-2.6.0-3.tar (101 KB) textdump-KVM-2.6.0-3.tar Jim Pingle, 05/07/2021 09:38 AM
config-pfSense.home.arpa-20210518194823.xml (21.1 KB) config-pfSense.home.arpa-20210518194823.xml Jim Pingle, 05/18/2021 02:57 PM
textdump-KVM-21.05-4.tar (142 KB) textdump-KVM-21.05-4.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-21.05-3.tar (128 KB) textdump-KVM-21.05-3.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-21.05-1.tar (100 KB) textdump-KVM-21.05-1.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-21.05-0.tar (72 KB) textdump-KVM-21.05-0.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-21.05-2.tar (114 KB) textdump-KVM-21.05-2.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-2.6.0-8.tar (154 KB) textdump-KVM-2.6.0-8.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-ESX-2.6.0-3.tar (154 KB) textdump-ESX-2.6.0-3.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-2.6.0-6.tar (90.5 KB) textdump-KVM-2.6.0-6.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-KVM-2.6.0-7.tar (126 KB) textdump-KVM-2.6.0-7.tar Jim Pingle, 05/19/2021 10:03 AM
textdump-APU-2.6.0-0.tar (84 KB) textdump-APU-2.6.0-0.tar Jim Pingle, 05/19/2021 10:03 AM

History

#1 Updated by Jim Pingle about 2 months ago

  • Subject changed from Panic on 21.05/2.6.0 snapshots when VM memory usage is high to Panic on 21.05/2.6.0 snapshots when memory usage is high

#2 Updated by Jim Pingle about 2 months ago

Attaching another crash with a potentially more interesting backtrace.

#3 Updated by Jim Pingle about 2 months ago

This continues to be simple to hit and quite annoying. Installs that worked fine for years all of a sudden can't run much beyond the base OS and remain stable.

#4 Updated by Jim Pingle about 1 month ago

A couple more. I have additional ones I haven't posted as well... Not sure how helpful they might be at this point since they all seem fairly random.

#5 Updated by Jim Pingle about 1 month ago

  • Target version changed from 21.05 to 2.6.0
  • Plus Target Version set to 21.05

#6 Updated by Jim Pingle 27 days ago

The attached configuration when loaded on a VM with 512MB of RAM can reproduce the panic reliably but with some variations in behavior. It leverages the OSPF6 bug to run the system out of RAM quickly. On some attempts ospf6d dies on its own (which is what should happen) but on other attempts it triggers a panic (no bueno).

Load the config on a fresh install and make sure FRR is installed and running (the config has it included). I would load the same config on a second unit as well so it will have at least one active OSPF6 neighbor. If you do that, make sure to adjust any system-specific parameters like the router ID in FRR OSPF6.

Once it's up and running:

  • Navigate to Interfaces > WAN, click save and then click apply changes
  • Wait about 20-30 seconds after applying.
  • If it doesn't panic, check Status > Services and see if ospf6d is running. If not, restart it, then try again.

In most of my trials it panics on the second attempt. Occasionally I have to restart ospf6d after applying and then test again, resulting in it taking 3-4 attempts at most.

The process used to create the config was:

* Create VM with 512MB RAM
* Install pfSense Plus 21.05 RC (latest snap) or CE 2.6.0
* pfSsh.php playback enableallowallwan
* Enable SSH
* Update to current build (if available)
* Interfaces > Assignments, GIF tab, create a new GIF on WAN, doesn't need to work, just exist (e.g. WAN, 198.51.100.101, 10.103.111.1, 10.103.111.2, 30), save
* Interfaces > Assignments, assign the GIF, should be OPT1
* Interfaces > OPT1, Enable, Save/Apply
* Install FRR
* Services > FRR > OSPF6, Interfaces tab. Add WAN interface w/Area 0.0.0.0, save
* Services > FRR > OSPF6, Interfaces tab. Add LAN, save
* Services > FRR > OSPF6, Interfaces tab. Add OPT1, save
* OSPF6 tab, enable, set router ID to something sane, set Area to 0.0.0.0, save
* FRR Global/Zebra tab, enable, set a master password (e.g. "abc123"), save

#7 Updated by Peter Grehan 27 days ago

There are 3 signatures in the panics: I'd be interested in seeing more.

The KVM one is possibly fixed in FreeBSD-current (with 4174e45fb4320dc2), but it's more a symptom of low memory resulting in a rare allocation failure in pmap code.

2 of the ESX ones are the same: seems a race in VM code between 2 threads. The code path has been long removed in FreeBSD current so perhaps another side-effect of low-mem. The 7100 crash has the same signature

Thanks for the repro case: I'll give that a try.

#9 Updated by Peter Grehan 26 days ago

Thanks. The majority of these are associated with the pf counter_u64 issue (anything with pf in the traceback).

However, some others may not be: the pmap backtraces are possibly associated with the fix in FreeBSD (4174e45fb4320dc2), and the uma_reclaim() ones still unexplained.

#10 Updated by Kristof Provost 26 days ago

I believe these crashes all share the same root cause, which is that we (in certain places) mis-use the rule/state counters (we increment them directly rather than using the counter_u64 functions). Fixes have been pushed and are being tested.

#11 Updated by Jim Pingle 25 days ago

  • Status changed from New to Closed
  • Assignee set to Kristof Provost
  • % Done changed from 0 to 100

I've been aggressively attempting to crash the latest builds of 21.05 and 2.6.0 which include the fixes for this problem and thus far have had no success in triggering a panic. This is looking good to me. I could trigger it at-will a couple different ways before and now none of those methods lead to failures on any hardware or VM I try.

I'm willing to call this solved for the time being. If anything comes up I can reopen it.

#12 Updated by Jim Pingle 21 days ago

  • Release Notes changed from Default to Force Exclusion

Excluding from release notes since it was a problem introduced by changes after the last release.

#13 Updated by Jim Pingle 19 days ago

  • Target version changed from 2.6.0 to 2.5.2

Also available in: Atom PDF