Regression #12660: High CPU usage due to incorrect gateway on some policy routed states - pfSense - pfSense bugtracker

Actions

Copy link

Regression #12660

closed

High CPU usage due to incorrect gateway on some policy routed states

Added by Jim Pingle about 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

Kristof Provost

Category:

Rules / NAT

Target version:

2.6.0

Start date:

Due date:

% Done:

Estimated time:

Plus Target Version:

22.01

Release Notes:

Force Exclusion

Affected Version:

Affected Architecture:

Description

It's not clear why, but sometimes pf is putting the interface address as the gateway on a state instead of the actual next hop gateway. When this happens, the packet gets stuck in a loop in pf. The packet counters rapidly increase, and CPU usage on netisr queues hits 100%. The more states this happens to, the more queues end up at 100% and the larger the load on the firewall. Eventually, the firewall locks up, but that could be due to heat from most/all the cores running at 100%.

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   12 root        -72    -     0B   576K CPU5     5  18:49  99.69% [intr{swi1: netisr 2}]
   12 root        -72    -     0B   576K CPU6     6  38:01  99.66% [intr{swi1: netisr 5}]
   12 root        -72    -     0B   576K CPU3     3  37:51  99.65% [intr{swi1: netisr 0}]
   12 root        -72    -     0B   576K CPU4     4  32:17  99.65% [intr{swi1: netisr 1}]
   12 root        -72    -     0B   576K CPU2     2  38:03  99.59% [intr{swi1: netisr 6}]
   11 root        155 ki31     0B   128K RUN      7  14:14  99.20% [idle{idle: cpu7}]
   12 root        -72    -     0B   576K CPU0     0  34:25  98.88% [intr{swi1: netisr 4}]
   11 root        155 ki31     0B   128K CPU1     1  18:11  92.38% [idle{idle: cpu1}]

In most cases the states are fine, but every once in a while, a state ends up with the wrong gateway without any clear apparent cause.

The following sets of states were all present in the state table at the same time:

all tcp 142.250.190.14:443 <- a.a.a.58:39980       CLOSED:SYN_SENT
   [0 + 4294901760]  [2831073638 + 16777216]
   age 00:01:20, expires in 00:00:30, 11979109:0 pkts, 718746540:0 bytes, rule 268
   id: d9cbd66100000000 creatorid: 6a679c92 gateway: w.w.w.2
   origif: igb0
all tcp a.a.a.58:39980 -> 142.250.190.14:443       SYN_SENT:CLOSED
   [2831073638 + 16777216]  [0 + 4294901760]
   age 00:01:20, expires in 00:00:30, 11979109:0 pkts, 718746540:0 bytes, rule 164
   id: dacbd66100000000 creatorid: 6a679c92 gateway: 0.0.0.0
   origif: igb5

As you can see in the above state, it's only the first TCP SYN packet of a connection, and no response. The packet counter and byte counter are exceptionally large. The gateway is w.w.w.2 which is actually the interface address of the WAN the packet should exit, igb5.

When making a packet capture on the client and on the firewall interfaces these packets are not actually on the wire. Thus, they appear to be looping internally in pf.

This next state pair is for the exact same source and destination but the connection completed successfully. Note that it has the correct gateway on the first state, w.w.w.1.

all tcp 142.250.190.14:443 <- a.a.a.58:39978       FIN_WAIT_2:FIN_WAIT_2
   [1673491087 + 4287496704] wscale 8  [3482141425 + 327936] wscale 8
   age 00:01:20, expires in 00:00:11, 12:10 pkts, 1237:7154 bytes, rule 268
   id: 1bf3d66100000001 creatorid: 6a679c92 gateway: w.w.w.1
   origif: igb0
all tcp a.a.a.58:39978 -> 142.250.190.14:443       FIN_WAIT_2:FIN_WAIT_2
   [3482141425 + 327936] wscale 8  [1673491087 + 4287496704] wscale 8
   age 00:01:20, expires in 00:00:11, 12:10 pkts, 1237:7154 bytes, rule 164
   id: 1cf3d66100000001 creatorid: 6a679c92 gateway: 0.0.0.0
   origif: igb5

There was no filter reload between when those states were created, as they show in the output they were created at almost the exact same time.

The rule which created those states is set to policy route that connection out via w.w.w.1, which should be the only gateway for states from the rule:

@268(0) pass in quick on igb0 route-to { (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1) } round-robin inet from a.a.a.0/24 to any flags S/SA keep state label "USER_RULE: Default LAN to any" ridentifier 1418235528
  [ Evaluations: 8399      Packets: 1718677448  Bytes: 2975065697  States: 57    ]
  [ Inserted: pid 53819 State Creations: 232   ]

The gateway was repeated so many times due to the gateway weight setting (it was set to 10).

There is only one IP address on the igb5 interface, w.w.w.2:

igb5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=8120b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER>
    ether 00:xx:xx:xx:xx:b4
    inet6 fe80::208:a2ff:fe09:95b4%igb5 prefixlen 64 scopeid 0x6
    inet w.w.w.2 netmask 0xffffff00 broadcast w.w.w.255
    media: Ethernet autoselect (1000baseT <full-duplex>)
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

The upstream gateway on the igb5 interface is w.w.w.1 and it is also the default route when this happens:


1:20 PM
default            w.w.w.1        UGS        igb5
[...]
w.w.w.2        link#6             UHS         lo0

This exact same configuration was running fine on the 22.01.a.20211128.0600 snapshot, but failed when upgraded to 22.01.b.20220103.0600

I cannot reproduce this reliably, unfortunately. It has occurred at boot time on three separate occasions so far, and if I reset the states it stabilizes. After that, it happens randomly at unpredictable times on different interfaces. If I monitor the system closely I can catch single states doing it (as above) and kill just the specific problem states before they get too out of control.

Also worth noting, in this setup, that interface does not need outbound NAT so it does not perform outbound NAT, as NAT happens on the next hop. If I disable the gateways on that WAN and let another WAN take over (that does NAT), the problem does not occur on the other WAN that I've seen so far. It's possible I didn't wait long enough (~2hrs) but I'll keep experimenting.

Files

0001-pf-protect-the-rpool-from-races.patch (4.61 KB) 0001-pf-protect-the-rpool-from-races.patch

Kristof Provost, 01/10/2022 01:20 PM

Actions

Copy link

Updated by Jim Pingle about 4 years ago

Description updated (diff)

Actions

Copy link

Updated by Jim Pingle about 4 years ago

This appears to be related to the gateway "weight" advanced parameter. When the problem happens, the gateway in question above has a weight of 10 which is why it is repeated multiple times on that rule.

I set my gateways to a weight of 1, reloaded the filter, reset the states and the problem did not return. If I reverse that, the problem returns (put the weight to 10, reload filter, and wait).

With a weight of 1, the same rule doesn't have the repeated gateway. Compare this to the rule in the original description:

@269(0) pass in quick on igb0 route-to (igb5 w.w.w.1) inet from a.a.a.0/24 to any flags S/SA keep state label "USER_RULE: Default LAN to any" ridentifier 1418235528
  [ Evaluations: 886       Packets: 8994      Bytes: 3882366     States: 10    ]
  [ Inserted: pid 4355 State Creations: 44    ]

The rule in pfSense is set to a failover gateway which includes the gateway above and a gateway on a second WAN (with a weight of 3 before, now 1).

I need to run more tests yet but so far I've been unable to replicate this in a VM. I expect it would show up with a Load Balanced group as well.

Not related to this bug necessarily, but with a rule using a failover group like this, the gateway shouldn't be repeated in the rule based on weight anyhow since it's just multiple entries all for the same gateway. That should only be happening for rules using load balanced groups. If we correct that behavior, however, it would limit this problem to only affecting load balanced rules, it's not a proper fix for the underlying problem.

Actions

Copy link

Updated by Jim Pingle about 4 years ago

I made a commit which should make this a non-issue for failover groups (one gateway per tier) but it does still happen with load balance groups.

I used a group with two gateways on the same tier, one with a weight of 10, the other a weight of 3. After maybe 10 minutes I already had one of the problematic state pairs show up.

udp       In  198.51.100.13:123                      209.115.181.106:123                     NO_TRAFFIC:SINGLE       00:09:09  00:00:30  131929K    9791M
udp       Out 198.51.100.13:123                      209.115.181.106:123                         SINGLE:NO_TRAFFIC   00:09:09  00:00:30  131929K    9791M

Thus far I have been unable to replicate this anywhere but my edge, despite trying it on several other systems (Plus and CE, hardware and VMs), so there is some other compounding factor here, which may be as simple as total volume of connections being handled. My edge is obviously handling a lot more connections/churn than the lab systems I'm testing.

Continuing to test now with different weight combinations to see if there is a specific point at which it becomes a problem.

Actions

Copy link

Updated by Jim Pingle about 4 years ago

Further testing with Load Balancing showed the problem happens at a 3:2 and even a 1:1 ratio of weights for load balancing. The only way it doesn't happen is failover when the gateway has a weight of 1.

Not sure if it's related, but another interesting data point is thus far every time I have seen a problematic state, it's for traffic which is coming in a local interface but exiting the same WAN, a WAN that does not perform NAT on most outgoing traffic. I have yet to see any problem states on the other WAN.

Actions

Copy link

Updated by Jim Pingle about 4 years ago

To confirm the NAT hypothesis I added a NAT rule on that interface and the problem disappeared. So this appears to only affect load balancing when at least one of the exit interfaces does not have outbound NAT. I removed the NAT rule and the problem came back. I put the NAT rule back in place again and it ran problem free for 3 hours when before it wouldn't make it more than 10-15 minutes.

So to recap, this appears to require:

Multiple gateways to be in use on a rule (route-to, round-robin style) even if it's the same gateway repeated multiple times
Traffic must exit toward that gateway without having NAT applied

I still haven't replicated it on a lab system.

Impact is likely lower than it would be otherwise due to the fact that most people using load balancing in this way will be using NAT, but there are some common scenarios where people could still hit it.

Actions

Copy link

Updated by Kristof Provost about 4 years ago

Would you be able to test if the problem occurs if the rule doesn't use round-robin but uses random or source-hash instead? That'd help to narrow down the responsible code.

I'd also recommend using multicore systems to attempt reproduction. I currently hypothesise the 'case PF_POOL_ROUNDROBIN' case in pf_map_addr() may have a race condition that results in an invalid address being selected. (Hence the request to attempt reproduction with a different option. If it can be reproduced with random or source-hash we'll know that hypothesis is incorrect.)

Actions

Copy link

Updated by Kristof Provost about 4 years ago

I think I see where the problem happens and it's indeed in the PF_POOL_ROUDROBIN case of pf_map_addr().
In get_addr: it copies rpool->counter to naddr (naddr is the result it ends up copying into the state), and later does PF_AINC(&rpool->counter) (which increments the address by one, to the code to track state for the pool).
That whole section of code runs unlocked (or to be precise, with only the rules read lock), so multiple instances of that function can be running at the same time. If we get unlucky enough that thread A is doing PF_AINC(&rpool->counter) just before thread B does PF_ACPY(naddr, &rpool->counter) (but after it's done all of the counter validation) we'd see exactly what you describe.

It's less obvious how this should be fixed. Running this code with the rules write lock would prevent this issue, but would have a fairly large performance impact. I'm currently thinking of adding a dedicated mutex to struct pf_kpool for use in pf_map_addr(). That'd still have some performance effects, but far less so (because they're unique locks rules would not affect each other, and we also only run pf_map_addr() when we're establishing new connections).

Actions

Copy link

Updated by Jim Pingle about 4 years ago

I already talked with Kristof about this on Slack, but so it's also on Redmine:

Would you be able to test if the problem occurs if the rule doesn't use round-robin but uses random or source-hash instead?

The way we use multiple gateways in route-to for load balancing between different WANs, the only type pf will allow is round-robin. Attempting to use other types results in an error:

: pfctl -f /tmp/rules.debug
/tmp/rules.debug:498: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:504: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:560: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:595: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:596: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:616: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:623: r.rpool.opts must be PF_POOL_ROUNDROBIN
/tmp/rules.debug:630: r.rpool.opts must be PF_POOL_ROUNDROBIN
pfctl: Syntax error in config file: pf rules not loaded

Actions

Copy link