Project

General

Profile

Actions

Bug #10513

open

State issues with policy routing and HA failover

Added by Jim Pingle almost 4 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Rules / NAT
Target version:
Start date:
04/30/2020
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
All
Affected Architecture:

Description

Seeing some odd behavior on HA pairs which have multiple WANs and use policy routing. In some cases, the states for a client disappear when failing over. In others, the state is present but the traffic may be egressing the wrong interface.

Consider this scenario:

WAN1 is default, some clients policy routed out WAN2. In this example, 10.11.0.12.

Start a TCP connection from 10.11.0.12 to an Internet host. States on both and packet capture on primary show the traffic entering LAN, exiting WAN2 (OK)

Put the primary node into CARP maintenance mode. State is OK on primary. The state, which was there moments ago, is no longer in the state table on the secondary. Traffic from the client stops entirely.

Take the primary node out of CARP maintenance mode. States and packet capture on primary still show the traffic entering LAN, exiting WAN2 (OK).

Wait a bit and the state eventually re-syncs to the secondary node.

Now put the primary node back into CARP maintenance mode again. States on the secondary still show the traffic entering LAN, exiting WAN2 (OK) but the packet capture shows the packets actually leaving WAN1, with the address of WAN2 on the packets.

Note that if this is tested with ICMP, the second step will be different, as ICMP will result in a new state created to replace the missing state. That case appears to show the problem on the first fail back instead of taking a second turn.

Tested on 2.5.0.a.20200430.0741 (12.1-STABLE) but we have a report from a customer who is seeing this happen on 2.4.5-RELEASE

Actions #1

Updated by Anonymous over 3 years ago

  • Assignee set to Renato Botelho
Actions #2

Updated by Anonymous over 3 years ago

  • Target version changed from 2.5.0 to CE-Next
Actions #3

Updated by Jose Duarte over 2 years ago

Tested in 2.5.2. This seems to still be a big issue.
pfSync is basically useless on a Multi-WAN setup, all states from WANs, which are not the default gateway, will be killed on failover.

I'm happy to help with testing if you have any suggestions on how to fix it

Actions #4

Updated by Viktor Gurov about 2 years ago

#8100 - maybe related

Actions #5

Updated by Christian Ullrich almost 2 years ago

Tested in 2.5.2. This seems to still be a big issue.

In 2.6.0, too. I'm not sure about the lost states, but the traffic going out the wrong WAN is definitely still there. See also https://forum.netgate.com/topic/170501/, but that is two pages of what fits in one sentence above.

Actions #6

Updated by Renato Botelho almost 2 years ago

  • Assignee deleted (Renato Botelho)
Actions

Also available in: Atom PDF