Project

General

Profile

Actions

Bug #14758

closed

``status_carp.php`` and ``diag_dump_states.php`` unresponsive with large state tables

Added by Kris Phillips about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Category:
Web Interface
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
23.09
Release Notes:
Default
Affected Version:
2.7.0
Affected Architecture:
All

Description

When attempting to load the CARP Status Page or States Diagnostics page in pfSense Plus when there is 2-3 Million State Table Entries present, the firewall will fail to load either page with a 504 Gateway Timed Out. This also happens when attempting to click the filtered state view link from a firewall rule to jump to the state table, regardless of how many states are in the filtered result (tested with just 1 state on a rule and still resulted in a timeout).

This also results in a single process spawning for pctl with the flags -vvss that consumes 100% CPU usage one one core. Every time you try to load one of these pages, it will spawn a new process and consume another CPU core at 100%. This will continue until all cores are consumed, if the end user continues to try and load these pages, until the webConfigurator crashes. These processes will not kill themselves and continue to exist until either killed with the "kill [pid]" command or rebooting the firewall.

An example of this:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 11 177.3 0.0 0 192 - RNL 14:59 5611:03.85 [idle]
root 25046 100.0 8.0 3167392 2630076 - R 23:22 55:35.18 /sbin/pfctl -vvss
root 28261 100.0 8.0 3167392 2629760 - R 23:52 26:02.90 /sbin/pfctl -vvss
root 41707 100.0 8.0 3167392 2629668 - R 00:02 15:47.69 /sbin/pfctl -vvss
root 42569 100.0 8.0 3167392 2630168 - R 23:14 63:15.87 /sbin/pfctl -vvss
root 60190 100.0 8.0 3167392 2630340 - R 23:02 76:06.92 /sbin/pfctl -vvss
root 66730 100.0 8.0 3167392 2629956 - R 23:32 45:36.44 /sbin/pfctl -vvss
root 70156 100.0 8.0 3167392 2629536 - R 00:16 1:13.20 /sbin/pfctl -vvss
root 93064 100.0 8.0 3167392 2630284 - R 23:06 71:31.45 /sbin/pfctl -vvss
root 98465 100.0 8.0 3173024 2634816 - RN 17:05 432:55.05 /sbin/pfctl -ss
root 45574 99.7 8.0 3167392 2630224 - R 23:10 67:21.82 /sbin/pfctl -vvss

At the very least these processes should "stop trying" after the PHP code fails to complete, but we should also optimize the PHP code to not try and load the entire state table every single time these pages are loaded.

Actions #1

Updated by Steve Wheeler about 1 year ago

  • Project changed from pfSense Plus to pfSense
  • Category changed from Web Interface to Web Interface
  • Target version set to 2.8.0
  • Affected Plus Version deleted (23.05.1)
  • Plus Target Version set to 23.09
  • Affected Version set to 2.7.0

The command run on the CARP status page shows the list of creator IDs for all sync'd states:

<?php
    $my_id = strtolower(ltrim(filter_get_host_id(), '0'));
    exec("/sbin/pfctl -vvss | /usr/bin/awk '/creatorid:/ {print $4;}' | /usr/bin/sort -u", $hostids);
    if (!is_array($hostids)) {
        $hostids = array();
    }
?>

Actions #2

Updated by Kristof Provost about 1 year ago

Replicating what I said in Slack: it'd be good to attach truss to one of the pfctl processes, to see what it's doing. A `procstat -k <pid>` to get the kernel stack would also be interesting.

Aside from that we should also consider introducing a new ioctl for this, because retrieving all states to just get the hostids (likely just two different ones!) is a very slow operation, and we could make that a lot faster by doing the loop in the kernel instead.
To be clear: there's absolutely a bug here. These processes should return, and we're going to try to fix that, but there's also an opportunity to let systems avoid a whole lot of pointless work.

Actions #3

Updated by Kris Phillips about 1 year ago

Kristof Provost wrote in #note-2:

Replicating what I said in Slack: it'd be good to attach truss to one of the pfctl processes, to see what it's doing. A `procstat -k <pid>` to get the kernel stack would also be interesting.

Aside from that we should also consider introducing a new ioctl for this, because retrieving all states to just get the hostids (likely just two different ones!) is a very slow operation, and we could make that a lot faster by doing the loop in the kernel instead.
To be clear: there's absolutely a bug here. These processes should return, and we're going to try to fix that, but there's also an opportunity to let systems avoid a whole lot of pointless work.

Shell Output - procstat -k 45380

PID    TID COMM                TDNAME              KSTACK
45380 100666 pfctl - &lt;running&gt;
Actions #4

Updated by Kristof Provost about 1 year ago

So the lack of kernel stack as well as the lack of truss output (reported on Slack) would point in the direction of this being a userspace problem.
It's not immediately clear to me where we could be ending up in this loop in pfctl, but that's a useful clue at least.

Actions #5

Updated by Kristof Provost about 1 year ago

I believe the problem is that we're overflowing the size field in the DIOCGETSTATESV2 call, and that's causing confusion between the kernel and userspace, which results in userspace looping forever.

I'm going to see if we can change the signed int to an unsigned one, which will buy us a bit of breathing room. Once that's done I'm also going to see about replacing this awful pfctl -ss | awk | sort construct with something that summarises the relevant information directly in the kernel.

Actions #6

Updated by Jim Pingle about 1 year ago

  • Subject changed from Status --> CARP and Diagnostics --> States Unresponsive with Large State Table to ``status_carp.php`` and ``diag_dump_states.php`` unresponsive with large state tables
  • Assignee set to Kristof Provost
  • Plus Target Version changed from 23.09 to 24.01

Bumping this ahead. It would be nice to fix but I don't think it's a release blocker.

Actions #7

Updated by Kristof Provost about 1 year ago

I have a fix for the infinite pfctl loop, and in-progress patches for the improved code to retrieve creator ids. It ought to still be possible for 23.09.

Actions #8

Updated by Kristof Provost about 1 year ago

I've merged the fix for the pfctl loop, as well as the new 'list creator ids' command.
https://gitlab.netgate.com/pfSense/pfSense/-/merge_requests/1075 is still required to actually make use of that.

Actions #9

Updated by Kris Phillips about 1 year ago

Kristof Provost wrote in #note-8:

I've merged the fix for the pfctl loop, as well as the new 'list creator ids' command.
https://gitlab.netgate.com/pfSense/pfSense/-/merge_requests/1075 is still required to actually make use of that.

Running the latest development build of 23.09 I applied the diff here as a patch. The State Creator Host IDs field is broken and empty. Reverting the patch restored functionality. Are there any other patches needed to test?

Actions #10

Updated by Kristof Provost about 1 year ago

You. do also need the kernel and pfctl changes. I'm not sure if there's been a successful build since those landed.
The easiest way to verify is to see if `pfctl -sc` produces the creator id list.

Actions #11

Updated by Jim Pingle about 1 year ago

On a current snapshot the `pfctl -sc` changes are present and working on status_carp.php and the CLI. I pushed a small correction to ensure it's using the full path to tail in status_carp.php and re-added a sort to be sure the output is in the expected order. I didn't see anything in the pfctl code that sorted the output, but if it's there and I just missed it, then that sort command could come back out.

Actions #12

Updated by Jim Pingle about 1 year ago

  • Status changed from New to Feedback
Actions #13

Updated by Kristof Provost about 1 year ago

The kernel does not sort the list (and neither does pfctl). I had assumed that the sort was only there to ensure we had unique entries, so adding it again was the correct thing to do.

Actions #14

Updated by Jim Pingle about 1 year ago

  • Plus Target Version changed from 24.01 to 24.03
Actions #15

Updated by Jim Pingle about 1 year ago

  • Plus Target Version changed from 24.03 to 23.09
Actions #16

Updated by Jim Pingle about 1 year ago

  • Status changed from Feedback to Resolved

This has been working well since it went in.

Actions #17

Updated by Jim Pingle about 1 year ago

  • Target version changed from 2.8.0 to 2.7.1
Actions

Also available in: Atom PDF