Project

General

Profile

Actions

Regression #12827

closed

High latency and packet loss during a filter reload

Added by Michael Novotny almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Rules / NAT
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
22.05
Release Notes:
Default
Affected Version:
2.6.0
Affected Architecture:
All

Description

Every 15 minutes I am seeing 2 seconds latency that disrupts VPN, VoIP between sites, video conferencing, etc.

I have narrowed it down to the cronjob "/etc/rc.filter_configure_sync" that runs every 15 minutes.

The latency also occurs on demand when manually ran and also occurs with both of our HA appliances.


Files

rc_filter_configure-latency.png (157 KB) rc_filter_configure-latency.png Michael Novotny, 02/18/2022 02:28 PM
Actions #1

Updated by Steve Wheeler almost 3 years ago

  • Tracker changed from Bug to Regression
  • Subject changed from Major latency when /etc/rc.filter_configure_sync runs to High latency when reloading the ruleset
  • Priority changed from Normal to High
  • Target version set to Plus-Next
  • Affected Architecture All added
  • Affected Architecture deleted (7100)

I have replicated this with a generated ruleset between 21.05.2 and 22.01:

[22.01-RELEASE][root@5100.stevew.lan]/root: pfctl -sr | wc -l
    1121
[22.01-RELEASE][root@5100.stevew.lan]/root: time pfctl -f /tmp/rules.debug
0.520u 2.295s 0:02.81 100.0%    203+182k 1+0io 0pf+0w

[21.05.2-RELEASE][root@5100.stevew.lan]/root: pfctl -sr | wc -l
    1116
[21.05.2-RELEASE][root@5100.stevew.lan]/root: time pfctl -f /tmp/rules.debug
0.302u 0.270s 0:00.57 100.0%    202+176k 0+0io 0pf+0w

Actions #2

Updated by Steve Wheeler almost 3 years ago

  • Project changed from pfSense Plus to pfSense
  • Category changed from Unknown to Unknown
  • Target version changed from Plus-Next to CE-Next
  • Affected Plus Version deleted (22.01)
  • Plus Target Version set to Plus-Next
  • Affected Version set to 2.6.0
Actions #3

Updated by Steve Wheeler almost 3 years ago

  • Target version changed from CE-Next to 2.7.0
  • Plus Target Version changed from Plus-Next to 22.05
Actions #4

Updated by Kristof Provost almost 3 years ago

  • Assignee set to Kristof Provost

I strongly suspect https://github.com/pfsense/FreeBSD-src/commit/a5a03901798c76f1f7c77535a2282a60f54b0ec2 is the main culprit here.
The easiest improvement will be to move this work out from the PF_RULE_WLOCK. That may affect the how accurately we copy over the counters, but a small inaccuracy there is far less of an issue than stalling traffic.
While we're looking at this we should also see if there's more work that can be moved out from the locked section.

Actions #5

Updated by Michael Novotny almost 3 years ago

FYI. This latency also occurs when any rules, traffic shaper, etc. (anything that reloads the rules) are applied/modified, not just the scheduled cron.

Actions #6

Updated by Fabio Giudici almost 3 years ago

I can confirm that any rules roload introduces high latency. Even the shutdown of the sync interface (that as far as I can see fires a rule reload on the active member) gets this.
The version affected are both 2.6.0 and 22.01 plus.
We have an ha cluster who has 3960 rules that introduces like 20 seconds of latency for that...

Actions #7

Updated by Flole Systems almost 3 years ago

The current approach of the code mentioned by Kristof is bad in so many ways: There is a lock and within that lock there is a loop within a loop, so the complexity is exponential. These loops and comparison should be moved outside of the lock, then a new list could be prepared which contains old and new rules so within the lock a fast copy operation could be done with no need to iterate multiple times over the entire ruleset while holding a lock.

That's of course assuming that it is actually intended and necessary to copy while holding a lock.

Actions #8

Updated by Flole Systems almost 3 years ago

To add to this: Removing the "set keepcounters" option from /etc/inc/filter.inc seems to fix it. So if someone doesn't care about the counters, just remove that line for now.

Actions #9

Updated by Michael Novotny almost 3 years ago

Flole Systems wrote in #note-8:

To add to this: Removing the "set keepcounters" option from /etc/inc/filter.inc seems to fix it. So if someone doesn't care about the counters, just remove that line for now.

I can confirm that commenting out $rules .= "set keepcounters\n"; on line 433 of /etc/inc/filter.inc does indeed stop the high latency from occurring. According to hrping, my latency went from over 2700ms to between 50-275ms when the filter applied, then back to normal single digit latency once script was finished.

If you are running HA, it would need to be done on all firewalls as a fix in the mean time.

Actions #10

Updated by Mateusz Guzik almost 3 years ago

  • Assignee changed from Kristof Provost to Mateusz Guzik

I had a look at the issue with a profiler. While the loop you are mentioning is a problem to some extent, the real issue is what the loop is doing inside (md5 hashing of all rules).

%SAMP IMAGE      FUNCTION                       CALLERS
 23.6 kernel     lock_delay                     __mtx_lock_sleep
 16.6 kernel     MD5Transform                   MD5Update:11.2 MD5Final:5.4
 12.4 kernel     MD5Update                      pf_hash_rule:9.1 pf_hash_rule_addr:3.1
  9.3 kernel     sched_idletd                   fork_exit
  6.2 kernel     memcpy_erms                    pf_hash_rule:4.0 pf_hash_rule_addr:1.5
  1.7 kernel     strlen                         pf_hash_rule
  1.6 kernel     rn_addmask                     rn_lookup
  1.5 kernel     rn_match                       rn_lookup
  1.4 kernel     pf_hash_rule                   pf_commit_rules
  1.1 libc.so.7  __free                         SHA512_224_Update
  0.8 kernel     smp_rendezvous_action
  0.7 kernel     rn_insert                      rn_addroute
  0.7 kernel     rn_walktree                    pfr_ina_commit
  0.6 kernel     strcmp
  0.6 kernel     memset_erms                    uma_zalloc_arg
  0.6 kernel     cpu_idle                       sched_idletd
  0.6 kernel     MD5Final                       pf_commit_rules
  0.5 libc.so.7  inet_ntoa

The lock_delay thing is an attempt to get any traffic.

This particular issue can be mostly solved by pre-computing the hash so that a comparison is no longer so expensive. But even then, as you noted, that's a lot of work.

Complete fix would rework the entire ruleset loading mechanism to do it in one go, as currently it is performed a series of syscalls, each of which temporarily takes the rules lock for writing.

I expect to have some patches next week.

Taking the ticket as discussed with Kristof.

Actions #11

Updated by Flole Systems almost 3 years ago

I don't even fully understand why there's hashing going on instead of comparing directly, that doesn't really make any sense and as you noted wastes CPU time. Rewriting that compare function so it compares and returns 0 as soon as there's a mismatch, otherwise 1 would accelerate this a lot already while also eliminating the (low) risk of a hash collision.

All we really need though is a map of old/new rules though so we can iterate over the new rules and copy the old ones, right? That could get prepared outside of the lock and then within the lock that list gets iterated over and the counters get copied over. It just needs to be made sure that the ruleset is not modified in the meantime.

But maybe it's not even necessary to stop traffic passing through, there's an atomic_add function in FreeBSD so you could make the new ruleset active and then add the old count while there's already traffic passing through again. You'd just need to block the statistic retrieving until the counter adding is done so there is no chance invalid values are retrieved.

The entire freeing of the old rules can probably be kicked out of the lock aswell.
I'd really appreciate it if you find other ways to further optimize the time in the lock by looking at other parts of the code and reduce reload latency spikes even further.

Actions #12

Updated by Mateusz Guzik almost 3 years ago

As a status update I added a red-black tree so that rules can be looked up cheaper. Pre-computed md5 hash is used as the key.

With this in place on my test box with ~3100 rules:
before: 0.81s user 8.47s system 99% cpu 9.284 total
after: 0.89s user 0.44s system 99% cpu 1.332 total

At the same time amount of work done with the lock held was reduced.

I'm going to be able to ship you a test kernel early next week. As is patches have rough edges which may end up crashing in production environment.

Actions #13

Updated by Flole Systems almost 3 years ago

Why is there any need for hashing? You want to compare rules if I understand that correctly, there's no need to hash anything....

Also as I said the atomic add operation could be used here, that would lower the time that lock is held aswell.

Also have you compared to a case where the keepcounters-option isn't set? We're looking at a negative performance impact of that addition, so even if it got better now it might still be a lot slower than previously.

Actions #14

Updated by Kris Phillips almost 3 years ago

Have run into this bug twice with customers, once with a standalone firewall that had 200+ interfaces and another with an HA pair with a large ruleset.

Symptoms include:
1. VGA/Serial output delays/lag in input
2. High Load Averages on one or (if applicable) both firewalls
3. webConfigurator not responding or slow

Applying the recommended workaround patch in System Patches makes this issue go away.

Actions #15

Updated by Kevin Bentlage over 2 years ago

Have the same issues on our PFSense 2.6.0 cluster (2 members) after upgrading from 2.5.2.

Firewalls have 75 interfaces and somewhere around 1500 rules loaded.

Symptoms:

1. Console delay/lag in input
2. High load spikes (caused by pfctl)
3. Temporary disconnects of the vlans / interfaces. Results in downtime on our networks.
4. OpenVPN connections are dropped / disconnected
5. WebConfigurator not responding or slow (Nginx 503 errors). After restarting the WebConfigurator and PHP-FPM services the issues are resolved (for a short period of time)
6. (re)loading the firewall rules are taking really long, on 2.6.0 this takes around 5 minutes. On 2.5.2 this took 10 - 20 seconds.
7. CARP failover is flapping between the 2 member nodes.

I've applied the workaround patch in "System Patches" and this seem to solve most of the issues. The WebConfigurator is still slow/unstable and the (re)loading the firewall rules ("Configuring firewall...") on boot is taking a few minutes.

Actions #16

Updated by Kris Phillips over 2 years ago

Wanted to add additional observations from situations I've seen this issue crop up:

1. pfBlockerNG causes this with large rulesets as it reloads the filter on a cron schedule
2. HA configurations with XMLRPC Sync run into this because the primary and secondary reload filters as well

Actions #17

Updated by Mateusz Guzik over 2 years ago

Apologies for late reply, other things got in the way.

Flole Systems wrote in #note-13:

Why is there any need for hashing? You want to compare rules if I understand that correctly, there's no need to hash anything....

Pre-hashing allows for cheaper comparison -- only the hash, not the entire rule, which is hundreds of bytes.

Also as I said the atomic add operation could be used here, that would lower the time that lock is held aswell.

Atomic operations are of no use here I'm afraid. The counter is distributed per-CPU (that is each CPU has its own). Currently the code holds the rules lock for writing in part to be able to obtain a stable snapshot without losing any actions as they happen to pass through while configuration change is taking place. As counters are specifically allocated to work this way, the real fix would simply avoid allocating another set for the new config and instead reusing them when a matching rule is found. This unfortunately requires some surgery which I elected not to do at this time.

Also have you compared to a case where the keepcounters-option isn't set? We're looking at a negative performance impact of that addition, so even if it got better now it might still be a lot slower than previously.

I did not benchmark this specifically. I managed to straight up remove some of the work previously done, so I would expect the net single-threaded reload time to be about the same despite hashing (if not slightly faster). What really matters here is that the total time the rules lock is held is reduced significantly, meaning it stalls other cores to a much smaller extent and it may be perfectly tolerable now.

If things turn out to still be too slow, I can continue the rework to make loading happen with virtually no stall for other cores. That is however time consuming to implement.

Actions #18

Updated by Mateusz Guzik over 2 years ago

Kevin Bentlage wrote in #note-15:

Have the same issues on our PFSense 2.6.0 cluster (2 members) after upgrading from 2.5.2.

Firewalls have 75 interfaces and somewhere around 1500 rules loaded.

Symptoms:

1. Console delay/lag in input
2. High load spikes (caused by pfctl)
3. Temporary disconnects of the vlans / interfaces. Results in downtime on our networks.
4. OpenVPN connections are dropped / disconnected
5. WebConfigurator not responding or slow (Nginx 503 errors). After restarting the WebConfigurator and PHP-FPM services the issues are resolved (for a short period of time)
6. (re)loading the firewall rules are taking really long, on 2.6.0 this takes around 5 minutes. On 2.5.2 this took 10 - 20 seconds.
7. CARP failover is flapping between the 2 member nodes.

I've applied the workaround patch in "System Patches" and this seem to solve most of the issues. The WebConfigurator is still slow/unstable and the (re)loading the firewall rules ("Configuring firewall...") on boot is taking a few minutes.

Hi Kevin,

can tell me what are the hardware spec if the problematic machines? A likely fix will be available soon for testing, if you are interested.

Actions #19

Updated by Kevin Bentlage over 2 years ago

Mateusz Guzik wrote in #note-18:

Hi Kevin,

can tell me what are the hardware spec if the problematic machines? A likely fix will be available soon for testing, if you are interested.

2x Dell PE R620
- 2x Xeon E5-2650v2 (2.60GHz)
- 64GB RAM
- 4x 1 Gbit NIC
- 2x 10 Gbit NIC
- 4x 120GB SSD (RAID-5)

Actions #20

Updated by Kevin Bentlage over 2 years ago

Any updates on this?

Actions #21

Updated by Mateusz Guzik over 2 years ago

Huh, apologies for lack of updates.

The issue is largely fixed for over 3 weeks now in the snapshots. If you can't install a snapshot, I can provide you with a patched kernel for whatever version you are running right now, just give me uname -a.

The patchset has one remaining bug which I expect to fix today, but you are not going to run into it if you don't use miniupnp.

Actions #22

Updated by Kevin Bentlage over 2 years ago

Mateusz Guzik wrote in #note-21:

Huh, apologies for lack of updates.

The issue is largely fixed for over 3 weeks now in the snapshots. If you can't install a snapshot, I can provide you with a patched kernel for whatever version you are running right now, just give me uname -a.

The patchset has one remaining bug which I expect to fix today, but you are not going to run into it if you don't use miniupnp.

I cannot (want to) run the snapshot version in our production environment. If you can supply the patch that would be nice.

FreeBSD fw01.xxxxx.io 12.3-STABLE FreeBSD 12.3-STABLE RELENG_2_6_0-n226742-1285d6d205f pfSense  amd64
FreeBSD fw02.xxxxx.io 12.3-STABLE FreeBSD 12.3-STABLE RELENG_2_6_0-n226742-1285d6d205f pfSense  amd64

Any ideas on when this patch will be released on the stable branch?

Actions #23

Updated by Marcos M over 2 years ago

Mateusz Guzik wrote in #note-21:

Huh, apologies for lack of updates.

The issue is largely fixed for over 3 weeks now in the snapshots. If you can't install a snapshot, I can provide you with a patched kernel for whatever version you are running right now, just give me uname -a.

The patchset has one remaining bug which I expect to fix today, but you are not going to run into it if you don't use miniupnp.

Does this mean the patch which removes $rules .= "set keepcounters\n"; is no longer necessary on current snapshots?

Edit: The miniupnp issue referenced earlier was resolved by:
Actions #24

Updated by Kris Phillips over 2 years ago

Mateusz Guzik wrote in #note-21:

Huh, apologies for lack of updates.

The issue is largely fixed for over 3 weeks now in the snapshots. If you can't install a snapshot, I can provide you with a patched kernel for whatever version you are running right now, just give me uname -a.

The patchset has one remaining bug which I expect to fix today, but you are not going to run into it if you don't use miniupnp.

Mateusz,

What miniupnp issue should we be looking out for in testing and how is it reproduced? Is there another redmine for it? I'd be happy to spin up a large rule set on a 22.05 image and test.

Actions #25

Updated by Kris Phillips over 2 years ago

Hello,

Can we please get an update on this and what needs to be tested before release?

Actions #26

Updated by Steve Wheeler over 2 years ago

Ruleset load times in 22.05 look like:

[22.05-BETA][admin@5100.stevew.lan]/root: pfctl -sr | wc -l
    1136
[22.05-BETA][admin@5100.stevew.lan]/root: time pfctl -f /tmp/rules.debug
0.524u 0.492s 0:01.08 93.5%    211+180k 1+0io 0pf+0w

Tested in: 22.05.b.20220523.0600

Actions #27

Updated by Michael Novotny over 2 years ago

Steve Wheeler wrote in #note-26:

Ruleset load times in 22.05 look like:
[...]

Tested in: 22.05.b.20220523.0600

from note 1
[22.01-RELEASE][]/root: pfctl -sr | wc -l
1121
[22.01-RELEASE][]/root: time pfctl -f /tmp/rules.debug
0.520u 2.295s 0:02.81 100.0% 203+182k 1+0io 0pf+0w

[21.05.2-RELEASE][]/root: pfctl -sr | wc -l
1116
[21.05.2-RELEASE][]/root: time pfctl -f /tmp/rules.debug
0.302u 0.270s 0:00.57 100.0% 202+176k 0+0io 0pf+0w

Ouch. Still nearly twice a long as version 21.05.2 but at least 4-5 times quicker than 22.01.

Actions #28

Updated by Marcos M over 2 years ago

Here are some additional results between current and previous versions.

====21.05====
[21.05-RELEASE][root@site-ab-router.lab.arpa]/root: pfctl -sr | wc -l
    1100
[21.05-RELEASE][root@site-ab-router.lab.arpa]/root: time pfctl -f /tmp/rules.debug
0.030u 0.022s 0:00.05 100.0%    224+196k 1+0io 0pf+0w
[21.05-RELEASE][root@site-ab-router.lab.arpa]/root: time /etc/rc.filter_configure_sync
0.490u 0.189s 0:00.69 97.1%     2687+279k 2148+46io 3pf+0w

====22.05====
[22.05-BETA][root@site-ab-router.lab.arpa]/root: pfctl -sr | wc -l
    1100
[22.05-BETA][root@site-ab-router.lab.arpa]/root: time pfctl -f /tmp/rules.debug
0.240u 0.155s 0:00.40 97.5%     214+183k 1+0io 0pf+0w
[22.05-BETA][root@site-ab-router.lab.arpa]/root: time /etc/rc.filter_configure_sync
0.621u 0.395s 0:01.05 96.1%     2015+250k 2172+44io 3pf+0w

On a different VM for which I only have 22.05 results, but significantly larger configuration (/tmp/rules.debug is 2527.43 KiB vs 439.36 KiB), I did notice much higher latency. Compared to 22.01 where there was packet loss, this time there's no packet loss:

[22.05-BETA][root@dut]/root: pfctl -sr | wc -l
     898
[22.05-BETA][root@dut]/root: time pfctl -f /tmp/rules.debug
0.777u 0.808s 0:01.59 98.7%     211+180k 1+0io 0pf+0w

[22.05-BETA][root@site-ab-router.lab.arpa]/root: ping -i 0.1 10.0.50.1
PING 10.0.50.1 (10.0.50.1): 56 data bytes
64 bytes from 10.0.50.1: icmp_seq=0 ttl=64 time=0.345 ms
[...]
64 bytes from 10.0.50.1: icmp_seq=38 ttl=64 time=0.345 ms
^C
--- 10.0.50.1 ping statistics ---
39 packets transmitted, 39 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.188/19.600/311.000/61.209 ms

Actions #29

Updated by Jim Pingle over 2 years ago

  • Category changed from Unknown to Rules / NAT
  • Status changed from New to Resolved
  • % Done changed from 0 to 100

As there is no packet loss and the impact is significantly better than the last release we can call this solved for now.

We can create a new issue to revisit the performance in the next version if further refinements are possible.

Actions #30

Updated by Jim Pingle over 2 years ago

  • Subject changed from High latency when reloading the ruleset to High latency and packet loss during a filter reload

Updating subject for release notes.

Actions #31

Updated by Flole Systems over 2 years ago

There is still packet loss by the way and latency spikes up to 300ms on 22.05.... It becomes super obvious when the tinc-VPN-Bug reloads the filter every few seconds. Of course the packet loss is not there every time, the packet has to arrive at the right time to be dropped but since that bug is causing filter reloads so often chances are quite good that some packets are lost.

To sum up what happened here: A regression what introduced which was super bad, then that regression was a little improved and now it's still worse than before but it's fixed since it's an improvement since the regression was introduced. Seriously? You break something, then IĆ½ou fix it partially but it's still broken and you just call it fixed since I improved it a little bit?

Actions #32

Updated by Chris Collins about 2 years ago

Hi guys feedback from myself.

I had this enabled when I first updated to 2.6.0. Had noticed no issues.

But yesterday I decided to revert the patch, and then for hours since I have been trying to figure out what was causing massive latency spikes on the device, LAN and WAN both affected, eventually realised it was happening on every firewall reload then it triggered in my head it was this issue.

But the thing is my firewall ruleset isnt that big, maybe 100 rules or so in total, so if this is still only a "maybe" to put in base code, I support it been patched in base code as even with not many rules it causes latency spikes upwards of 300ms.

/tmp/rules.debug is 69KiB.

Actions #33

Updated by Flole Systems almost 2 years ago

Yeah unfortunately this is still an issue. As I said, it's still worse than before, even though it was improved. Increased latency has been confirmed above but it was never addressed and this was just called "solved".

Actions #34

Updated by Mateusz Guzik almost 2 years ago

Hello.

To reiterate, there are 2 distinct issues remaining.

What was patched, was one change which resulted in O(n^2) processing with pf ruleset lock held for writing, which blocks packet processing. This bit was converted into O(nlogn) which provides a marked improvement.

Here is what was not patched:
1. single-threaded execution uses a new library which replaces an ABI-sensitive interface with a key-value format, which comes at a significant cost. While it uses time on one cpu, it does not affect packet processing on any other cpus. however, if the machine is already struggling to keep up then it will indeed cause a slowdown.
2. loading a new ruleset is performed with a series of calls, several of which cause a temporary pause in packet processing and it happens in quick succession. Some groundwork was performed to make it feasible to only do one short pause, but it's not finished yet. Going all the way to never pause would require quite a bit more work and I don't know if it is feasible.

All that said, I expect to get back to this in coming weeks.

Actions

Also available in: Atom PDF