Project

General

Profile

Actions

Bug #8465

closed

Lost default gateway after recover from failover with CARP VIP and HA

Added by Tom DL7BJ over 6 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Category:
High Availability
Target version:
Start date:
04/17/2018
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.4.3
Affected Architecture:
amd64

Description

Both boxes works with SuperMicro Boards which have two interfaces on board and an additional i350 4 Port network card. HA is on dedicated interfaces, directly connected without switch. All other interfaces are connected to a switch with untagged VLANs for every interface.

WAN Master and Slave - Switch VLAN WAN - ISP
LAN Master and Slave - Switch VLAN LAN - Internal net
DMZ Master and Slave - Switch VLAN DMZ - DMZ
GUEST Master and Slave - Switch VLAN Guest - Guest network
OPT Master and Slave - Switch VLAN OPT - currently not used

Master

WAN Interface: Static IPv4 10.10.75.251/24
Gateway: x.x.x.17

Slave

WAN Interface: Static IPv4 10.10.75.252/24
Gateway: x.x.x.17

The gateway is a public IP address, 62.x.x.17 and "use non local gateway" is set. Outbound NAT is also set (This firewall, WAN Interface, CARP VIP).

External IP

Currently there are 4 static external IPs configured as CARP VIP.

The "master" IP for outgoing traffic is x.x.x.20/29, VHID Group on both 20. The advertising frequency is on master Base = 1 and Skew = 0, on slave Base = 1 and Skew = 100.

The other IPs are for incoming traffic to some webservers and the mailrelay in DMZ.

NAT

There is on both machines Outbound NAT: This Firewall, any source port, any destination, any destination port with NAT Address x.x.x.20

Additional Outbound NAT is configured for some machines, ports and the other CARP VIPs, i.e. outgoing mail is the IP of the MX record and so on.

There is no problem if I switch form master to slave. But back from slave to master the default gateway on master is missing. If I set it in the console or simple save it with a click in the GUI of the master WAN interface or System / Routing / Gateways / Edit without changing something, the default gateway is immediatley set.

I have also done some debugging on console:

a) console on master

- enter persistent CARP maintenance mode on MASTER
- failover to slave, all connections established
- default gw lost on master (netstat r)
leave persistent CARP maintenance mode on MASTER
- all interfaces and services "green"
- only default gw lost
- route add default 62.x.x.17
- all is up

b) console on master

- ifconfig ibg4 down (WAN interface)
- failover to slave, all connections established
- default gw present on master
- ifconfig ibg4 up
- go back to master as active
- all interfaces and services "green"
- only default gw lost
- route add default 62.x.x.17
- all is up

c) console on master

- sysctl net.inet.carp.demotion=250
- failover to slave, all connections established
- default gw present on master
- sysctl net.inet.carp.demotion=-250
- go back to master as active
- all interfaces and services "green"
- default gw present on master!!!
- all is up

I tried c) several times and pf always switches perfectly between master and slave
without lost of any connection.

If I simulate a lost WAN interface with b) the default gw will be present. The default
gw not lost during failover, but when the Master takes over again.

If I set the Master in maintenance mode a) , the default gw is lost immadiatley.

Why the default gateway will be only restored with c) but not with a) or b)?


Files

fixgw.sh.txt (173 Bytes) fixgw.sh.txt fixgw Milad Soltanian, 10/05/2020 03:36 PM
fixgw-pf.png (14.9 KB) fixgw-pf.png poc Milad Soltanian, 10/05/2020 03:36 PM
Actions #1

Updated by Adam Sweet about 6 years ago

Can I ask if any investigation has been done on this or whether anyone else has been able to replicate it? This could bite me after upgrading to 2.4.3-p1 which is planned shortly for a production environment. I note the ticket is still unassigned after 3 months.

I see that this is reported in an environment using a 'non-local gateway', which is not something my environment has but it's not clear whether this issue is specific to using a non-local gateway or not. Given the wide usage of CARP, I'd expect this issue would have reported far more if it were not.

I think it might have been reported in the forums separately here:

https://forum.netgate.com/topic/131367/route-lost-by-carp-change

Actions #2

Updated by Tom Huerlimann about 6 years ago

I was able to reproduce excactly the same issue with 2.4.3-p1-x64 and with 2.4.4.a.20180803.0952 as well.

Setup on Box 1

- WAN: 10.4.0.1/29
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/29
- CARP Address 2: xxx.xxx.84.235/29
- CARP Address 3: xxx.xxx.84.236/29
- CARP Address 4: xxx.xxx.84.237/29

Setup on Box 2

- WAN: 10.4.0.2/29
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/29
- CARP Address 2: xxx.xxx.84.235/29
- CARP Address 3: xxx.xxx.84.236/29
- CARP Address 4: xxx.xxx.84.237/29

Modifications i made for testing

- I changed WAN on Box 1 to xxx.xxx.84.225/28
- I changed WAN on Box 2 to xxx.xxx.84.226/28
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/28
- CARP Address 2: xxx.xxx.84.235/28
- CARP Address 3: xxx.xxx.84.236/28
- CARP Address 4: xxx.xxx.84.237/28

After the modifications above i was not able to reproduce the issue anymore - but for sure, i can not leave this config in production, because my ISP did not assign a /28 subnet to me. As suggested around the web: Technically i would become able to use CARP with 3 IPs, as i have a /29 subnet with 4 addresses useable - but i prefer no to do this, because from my point of view it's a waste of IP-addresses. Additionally; if this can be solved, it would be possible for all those people with only one public IP to use CARP and take profit from HA (i’ve inbound & outbound NAT, portforwarding and ipsec tested - probably the things the most people use in such setups)

Actions #3

Updated by Anonymous about 6 years ago

  • Assignee set to Renato Botelho
Actions #4

Updated by Anonymous about 6 years ago

  • Target version changed from 2.4.4 to 2.4.4-GS
Actions #5

Updated by Anonymous about 6 years ago

  • Target version changed from 2.4.4-GS to 2.4.4-p1
Actions #6

Updated by John K almost 6 years ago

I'm having the exact same issue with 2.4.4. Using IPs outside the WAN-VIP subnet on the WAN interfaces forces the default gateway route to be lost when returning to the master after a fail-over. I simply can't sacrifice 3 public IP4 addresses to the alter of pfSense HA.

Please increase the priority of this issue. Please stop pushing back the target version!

Actions #7

Updated by Renato Botelho almost 6 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Renato Botelho almost 6 years ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100
Actions #9

Updated by Chris Linstruth almost 6 years ago

Tested on CE build from Friday November 16th. Duplicated missing default gateway on primary node after failover and failback.

Upgraded both nodes to Nov 20. Default gateway was present through carp maintenance and back on the primary. Looks good.

Actions #10

Updated by Renato Botelho almost 6 years ago

  • Status changed from Feedback to Resolved
Actions #11

Updated by Christian Grunfeld over 5 years ago

The same issue is back in 2.4.4-RELEASE-p2 (amd64) built on Wed Dec 12 07:40:18 EST 2018. Tested with one WAN IP (/30) and "gateway in non local net" is set.

Node A:
wan: 10.0.0.1/30
lan: 16X.XXX.100.251/24

Node B:
wan: 10.0.0.2/30
lan: 16X.XXX.100.252/24

Carp:
wan vip: 16X.XXX.198.154/30
lan vip: 16X.XXX.100.254/24

Default Gateway of nodes is 16X.XXX.198.153/30 is lost on "temporarily dissable carp" and "persistent carp maintenance mode"

Actions #12

Updated by Tom Huerlimann over 5 years ago

Hi all

The problem is still (or again) reproducable.

Best regards
Tom

Actions #13

Updated by Milad Soltanian about 4 years ago

well we solved the problem by this way , first create a script to check if the default route is still exists or no then if does not just add it :)

I added a cronjob for this though

fixgw.sh :

HOSTNAME="$(hostname)"

if ! [ $(route -n show 0.0.0.0 | grep gateway | cut -d ":" -f 2 | cut -d " " -f 2) == "10.10.10.1" ]; then route add -net 0.0.0.0/0 10.10.10.1 ; fi

Actions

Also available in: Atom PDF