Bug #10513
openState issues with policy routing and HA failover
0%
Description
Seeing some odd behavior on HA pairs which have multiple WANs and use policy routing. In some cases, the states for a client disappear when failing over. In others, the state is present but the traffic may be egressing the wrong interface.
Consider this scenario:
WAN1 is default, some clients policy routed out WAN2. In this example, 10.11.0.12.
Start a TCP connection from 10.11.0.12 to an Internet host. States on both and packet capture on primary show the traffic entering LAN, exiting WAN2 (OK)
Put the primary node into CARP maintenance mode. State is OK on primary. The state, which was there moments ago, is no longer in the state table on the secondary. Traffic from the client stops entirely.
Take the primary node out of CARP maintenance mode. States and packet capture on primary still show the traffic entering LAN, exiting WAN2 (OK).
Wait a bit and the state eventually re-syncs to the secondary node.
Now put the primary node back into CARP maintenance mode again. States on the secondary still show the traffic entering LAN, exiting WAN2 (OK) but the packet capture shows the packets actually leaving WAN1, with the address of WAN2 on the packets.
Note that if this is tested with ICMP, the second step will be different, as ICMP will result in a new state created to replace the missing state. That case appears to show the problem on the first fail back instead of taking a second turn.
Tested on 2.5.0.a.20200430.0741 (12.1-STABLE) but we have a report from a customer who is seeing this happen on 2.4.5-RELEASE
Updated by Anonymous about 4 years ago
- Target version changed from 2.5.0 to CE-Next
Updated by Jose Duarte about 3 years ago
Tested in 2.5.2. This seems to still be a big issue.
pfSync is basically useless on a Multi-WAN setup, all states from WANs, which are not the default gateway, will be killed on failover.
I'm happy to help with testing if you have any suggestions on how to fix it
Updated by Christian Ullrich almost 3 years ago
Tested in 2.5.2. This seems to still be a big issue.
In 2.6.0, too. I'm not sure about the lost states, but the traffic going out the wrong WAN is definitely still there. See also https://forum.netgate.com/topic/170501/, but that is two pages of what fits in one sentence above.
Updated by Jose Duarte 27 days ago
Was anyone able to test this on newer versions?
I'll try to get a quick lab going to test on pfSense+