Regression #12660
closedHigh CPU usage due to incorrect gateway on some policy routed states
0%
Description
It's not clear why, but sometimes pf is putting the interface address as the gateway on a state instead of the actual next hop gateway. When this happens, the packet gets stuck in a loop in pf. The packet counters rapidly increase, and CPU usage on netisr queues hits 100%. The more states this happens to, the more queues end up at 100% and the larger the load on the firewall. Eventually, the firewall locks up, but that could be due to heat from most/all the cores running at 100%.
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 12 root -72 - 0B 576K CPU5 5 18:49 99.69% [intr{swi1: netisr 2}] 12 root -72 - 0B 576K CPU6 6 38:01 99.66% [intr{swi1: netisr 5}] 12 root -72 - 0B 576K CPU3 3 37:51 99.65% [intr{swi1: netisr 0}] 12 root -72 - 0B 576K CPU4 4 32:17 99.65% [intr{swi1: netisr 1}] 12 root -72 - 0B 576K CPU2 2 38:03 99.59% [intr{swi1: netisr 6}] 11 root 155 ki31 0B 128K RUN 7 14:14 99.20% [idle{idle: cpu7}] 12 root -72 - 0B 576K CPU0 0 34:25 98.88% [intr{swi1: netisr 4}] 11 root 155 ki31 0B 128K CPU1 1 18:11 92.38% [idle{idle: cpu1}]
In most cases the states are fine, but every once in a while, a state ends up with the wrong gateway without any clear apparent cause.
The following sets of states were all present in the state table at the same time:
all tcp 142.250.190.14:443 <- a.a.a.58:39980 CLOSED:SYN_SENT [0 + 4294901760] [2831073638 + 16777216] age 00:01:20, expires in 00:00:30, 11979109:0 pkts, 718746540:0 bytes, rule 268 id: d9cbd66100000000 creatorid: 6a679c92 gateway: w.w.w.2 origif: igb0 all tcp a.a.a.58:39980 -> 142.250.190.14:443 SYN_SENT:CLOSED [2831073638 + 16777216] [0 + 4294901760] age 00:01:20, expires in 00:00:30, 11979109:0 pkts, 718746540:0 bytes, rule 164 id: dacbd66100000000 creatorid: 6a679c92 gateway: 0.0.0.0 origif: igb5
As you can see in the above state, it's only the first TCP SYN packet of a connection, and no response. The packet counter and byte counter are exceptionally large. The gateway is w.w.w.2
which is actually the interface address of the WAN the packet should exit, igb5
.
When making a packet capture on the client and on the firewall interfaces these packets are not actually on the wire. Thus, they appear to be looping internally in pf.
This next state pair is for the exact same source and destination but the connection completed successfully. Note that it has the correct gateway on the first state, w.w.w.1
.
all tcp 142.250.190.14:443 <- a.a.a.58:39978 FIN_WAIT_2:FIN_WAIT_2 [1673491087 + 4287496704] wscale 8 [3482141425 + 327936] wscale 8 age 00:01:20, expires in 00:00:11, 12:10 pkts, 1237:7154 bytes, rule 268 id: 1bf3d66100000001 creatorid: 6a679c92 gateway: w.w.w.1 origif: igb0 all tcp a.a.a.58:39978 -> 142.250.190.14:443 FIN_WAIT_2:FIN_WAIT_2 [3482141425 + 327936] wscale 8 [1673491087 + 4287496704] wscale 8 age 00:01:20, expires in 00:00:11, 12:10 pkts, 1237:7154 bytes, rule 164 id: 1cf3d66100000001 creatorid: 6a679c92 gateway: 0.0.0.0 origif: igb5
There was no filter reload between when those states were created, as they show in the output they were created at almost the exact same time.
The rule which created those states is set to policy route that connection out via w.w.w.1
, which should be the only gateway for states from the rule:
@268(0) pass in quick on igb0 route-to { (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1), (igb5 w.w.w.1) } round-robin inet from a.a.a.0/24 to any flags S/SA keep state label "USER_RULE: Default LAN to any" ridentifier 1418235528 [ Evaluations: 8399 Packets: 1718677448 Bytes: 2975065697 States: 57 ] [ Inserted: pid 53819 State Creations: 232 ]
The gateway was repeated so many times due to the gateway weight setting (it was set to 10).
There is only one IP address on the igb5
interface, w.w.w.2
:
igb5: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=8120b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER> ether 00:xx:xx:xx:xx:b4 inet6 fe80::208:a2ff:fe09:95b4%igb5 prefixlen 64 scopeid 0x6 inet w.w.w.2 netmask 0xffffff00 broadcast w.w.w.255 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
The upstream gateway on the igb5
interface is w.w.w.1
and it is also the default route when this happens:
1:20 PM default w.w.w.1 UGS igb5 [...] w.w.w.2 link#6 UHS lo0
This exact same configuration was running fine on the 22.01.a.20211128.0600
snapshot, but failed when upgraded to 22.01.b.20220103.0600
I cannot reproduce this reliably, unfortunately. It has occurred at boot time on three separate occasions so far, and if I reset the states it stabilizes. After that, it happens randomly at unpredictable times on different interfaces. If I monitor the system closely I can catch single states doing it (as above) and kill just the specific problem states before they get too out of control.
Also worth noting, in this setup, that interface does not need outbound NAT so it does not perform outbound NAT, as NAT happens on the next hop. If I disable the gateways on that WAN and let another WAN take over (that does NAT), the problem does not occur on the other WAN that I've seen so far. It's possible I didn't wait long enough (~2hrs) but I'll keep experimenting.
Files
Updated by Jim Pingle almost 3 years ago
This appears to be related to the gateway "weight" advanced parameter. When the problem happens, the gateway in question above has a weight of 10
which is why it is repeated multiple times on that rule.
I set my gateways to a weight of 1
, reloaded the filter, reset the states and the problem did not return. If I reverse that, the problem returns (put the weight to 10
, reload filter, and wait).
With a weight of 1
, the same rule doesn't have the repeated gateway. Compare this to the rule in the original description:
@269(0) pass in quick on igb0 route-to (igb5 w.w.w.1) inet from a.a.a.0/24 to any flags S/SA keep state label "USER_RULE: Default LAN to any" ridentifier 1418235528 [ Evaluations: 886 Packets: 8994 Bytes: 3882366 States: 10 ] [ Inserted: pid 4355 State Creations: 44 ]
The rule in pfSense is set to a failover gateway which includes the gateway above and a gateway on a second WAN (with a weight of 3
before, now 1
).
I need to run more tests yet but so far I've been unable to replicate this in a VM. I expect it would show up with a Load Balanced group as well.
Not related to this bug necessarily, but with a rule using a failover group like this, the gateway shouldn't be repeated in the rule based on weight anyhow since it's just multiple entries all for the same gateway. That should only be happening for rules using load balanced groups. If we correct that behavior, however, it would limit this problem to only affecting load balanced rules, it's not a proper fix for the underlying problem.
Updated by Jim Pingle almost 3 years ago
I made a commit which should make this a non-issue for failover groups (one gateway per tier) but it does still happen with load balance groups.
I used a group with two gateways on the same tier, one with a weight of 10, the other a weight of 3. After maybe 10 minutes I already had one of the problematic state pairs show up.
udp In 198.51.100.13:123 209.115.181.106:123 NO_TRAFFIC:SINGLE 00:09:09 00:00:30 131929K 9791M udp Out 198.51.100.13:123 209.115.181.106:123 SINGLE:NO_TRAFFIC 00:09:09 00:00:30 131929K 9791M
Thus far I have been unable to replicate this anywhere but my edge, despite trying it on several other systems (Plus and CE, hardware and VMs), so there is some other compounding factor here, which may be as simple as total volume of connections being handled. My edge is obviously handling a lot more connections/churn than the lab systems I'm testing.
Continuing to test now with different weight combinations to see if there is a specific point at which it becomes a problem.
Updated by Jim Pingle almost 3 years ago
Further testing with Load Balancing showed the problem happens at a 3:2 and even a 1:1 ratio of weights for load balancing. The only way it doesn't happen is failover when the gateway has a weight of 1.
Not sure if it's related, but another interesting data point is thus far every time I have seen a problematic state, it's for traffic which is coming in a local interface but exiting the same WAN, a WAN that does not perform NAT on most outgoing traffic. I have yet to see any problem states on the other WAN.
Updated by Jim Pingle almost 3 years ago
To confirm the NAT hypothesis I added a NAT rule on that interface and the problem disappeared. So this appears to only affect load balancing when at least one of the exit interfaces does not have outbound NAT. I removed the NAT rule and the problem came back. I put the NAT rule back in place again and it ran problem free for 3 hours when before it wouldn't make it more than 10-15 minutes.
So to recap, this appears to require:
- Multiple gateways to be in use on a rule (
route-to
,round-robin
style) even if it's the same gateway repeated multiple times - Traffic must exit toward that gateway without having NAT applied
I still haven't replicated it on a lab system.
Impact is likely lower than it would be otherwise due to the fact that most people using load balancing in this way will be using NAT, but there are some common scenarios where people could still hit it.
Updated by Kristof Provost almost 3 years ago
Would you be able to test if the problem occurs if the rule doesn't use round-robin but uses random or source-hash instead? That'd help to narrow down the responsible code.
I'd also recommend using multicore systems to attempt reproduction. I currently hypothesise the 'case PF_POOL_ROUNDROBIN' case in pf_map_addr() may have a race condition that results in an invalid address being selected. (Hence the request to attempt reproduction with a different option. If it can be reproduced with random or source-hash we'll know that hypothesis is incorrect.)
Updated by Kristof Provost almost 3 years ago
I think I see where the problem happens and it's indeed in the PF_POOL_ROUDROBIN case of pf_map_addr().
In get_addr: it copies rpool->counter to naddr (naddr is the result it ends up copying into the state), and later does PF_AINC(&rpool->counter) (which increments the address by one, to the code to track state for the pool).
That whole section of code runs unlocked (or to be precise, with only the rules read lock), so multiple instances of that function can be running at the same time. If we get unlucky enough that thread A is doing PF_AINC(&rpool->counter) just before thread B does PF_ACPY(naddr, &rpool->counter) (but after it's done all of the counter validation) we'd see exactly what you describe.
It's less obvious how this should be fixed. Running this code with the rules write lock would prevent this issue, but would have a fairly large performance impact. I'm currently thinking of adding a dedicated mutex to struct pf_kpool for use in pf_map_addr(). That'd still have some performance effects, but far less so (because they're unique locks rules would not affect each other, and we also only run pf_map_addr() when we're establishing new connections).
Updated by Jim Pingle almost 3 years ago
I already talked with Kristof about this on Slack, but so it's also on Redmine:
Would you be able to test if the problem occurs if the rule doesn't use round-robin but uses random or source-hash instead?
The way we use multiple gateways in route-to
for load balancing between different WANs, the only type pf will allow is round-robin
. Attempting to use other types results in an error:
: pfctl -f /tmp/rules.debug /tmp/rules.debug:498: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:504: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:560: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:595: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:596: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:616: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:623: r.rpool.opts must be PF_POOL_ROUNDROBIN /tmp/rules.debug:630: r.rpool.opts must be PF_POOL_ROUNDROBIN pfctl: Syntax error in config file: pf rules not loaded
Updated by Kristof Provost almost 3 years ago
Can you test this patch? It should prevent simultaneous access in pf_map_addr(), hopefully (and expected) without major performance effects.
Updated by Jim Pingle almost 3 years ago
- Assignee set to Kristof Provost
I booted a kernel with the patch and ran for over 5 hours without any problems. Looks like that took care of the problem to me.
Updated by Kristof Provost almost 3 years ago
MR (devel-12): https://gitlab.netgate.com/pfSense/FreeBSD-src/-/merge_requests/46
MR (RELENG_2_6_0): https://gitlab.netgate.com/pfSense/FreeBSD-src/-/merge_requests/47
Updated by Kristof Provost almost 3 years ago
Also merged upstream: https://cgit.freebsd.org/src/commit/?id=5f5e32f1b3945087a687c5962071d3f46e34e1ff
Updated by Jim Pingle almost 3 years ago
- Status changed from New to Feedback
This has been merged back to the RC branch now, the next RC build will have it included.
Updated by Jim Pingle almost 3 years ago
- Release Notes changed from Default to Force Exclusion
Not a problem in a release, excluding from release notes.
Updated by Jim Pingle almost 3 years ago
- Status changed from Feedback to Resolved
Can't replicate it on a current RC image with the fix built-in, and no negative side effects that I've noticed thus far.