Project

General

Profile

Bug #8465

Lost default gateway after recover from failover with CARP VIP and HA

Added by Tom DL7BJ about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
High Availability
Target version:
Start date:
04/17/2018
Due date:
% Done:

100%

Estimated time:
Affected Version:
2.4.3
Affected Architecture:
amd64

Description

Both boxes works with SuperMicro Boards which have two interfaces on board and an additional i350 4 Port network card. HA is on dedicated interfaces, directly connected without switch. All other interfaces are connected to a switch with untagged VLANs for every interface.

WAN Master and Slave - Switch VLAN WAN - ISP
LAN Master and Slave - Switch VLAN LAN - Internal net
DMZ Master and Slave - Switch VLAN DMZ - DMZ
GUEST Master and Slave - Switch VLAN Guest - Guest network
OPT Master and Slave - Switch VLAN OPT - currently not used

Master

WAN Interface: Static IPv4 10.10.75.251/24
Gateway: x.x.x.17

Slave

WAN Interface: Static IPv4 10.10.75.252/24
Gateway: x.x.x.17

The gateway is a public IP address, 62.x.x.17 and "use non local gateway" is set. Outbound NAT is also set (This firewall, WAN Interface, CARP VIP).

External IP

Currently there are 4 static external IPs configured as CARP VIP.

The "master" IP for outgoing traffic is x.x.x.20/29, VHID Group on both 20. The advertising frequency is on master Base = 1 and Skew = 0, on slave Base = 1 and Skew = 100.

The other IPs are for incoming traffic to some webservers and the mailrelay in DMZ.

NAT

There is on both machines Outbound NAT: This Firewall, any source port, any destination, any destination port with NAT Address x.x.x.20

Additional Outbound NAT is configured for some machines, ports and the other CARP VIPs, i.e. outgoing mail is the IP of the MX record and so on.

There is no problem if I switch form master to slave. But back from slave to master the default gateway on master is missing. If I set it in the console or simple save it with a click in the GUI of the master WAN interface or System / Routing / Gateways / Edit without changing something, the default gateway is immediatley set.

I have also done some debugging on console:

a) console on master

- enter persistent CARP maintenance mode on MASTER
- failover to slave, all connections established
- default gw lost on master (netstat r)
leave persistent CARP maintenance mode on MASTER
- all interfaces and services "green"
- only default gw lost
- route add default 62.x.x.17
- all is up

b) console on master

- ifconfig ibg4 down (WAN interface)
- failover to slave, all connections established
- default gw present on master
- ifconfig ibg4 up
- go back to master as active
- all interfaces and services "green"
- only default gw lost
- route add default 62.x.x.17
- all is up

c) console on master

- sysctl net.inet.carp.demotion=250
- failover to slave, all connections established
- default gw present on master
- sysctl net.inet.carp.demotion=-250
- go back to master as active
- all interfaces and services "green"
- default gw present on master!!!
- all is up

I tried c) several times and pf always switches perfectly between master and slave
without lost of any connection.

If I simulate a lost WAN interface with b) the default gw will be present. The default
gw not lost during failover, but when the Master takes over again.

If I set the Master in maintenance mode a) , the default gw is lost immadiatley.

Why the default gateway will be only restored with c) but not with a) or b)?

Associated revisions

Revision 8bffe226 (diff)
Added by Renato Botelho over 1 year ago

Fix #8465: Preserve default gw when switch to BACKUP

interfaces_carp_set_maintenancemode() calls interface_carp_configure()
to each configured CARP and it ends up reconfiguring completely the
interface when it's not necessary.

Add a new parameter $maintenancemode_only to interface_carp_configure()
and use it to only change advskew to 254 when going to forced
maintenance mode and move it back to configured value when leaving

Revision 31e18c7b (diff)
Added by Renato Botelho over 1 year ago

Fix #8465: Preserve default gw when switch to BACKUP

interfaces_carp_set_maintenancemode() calls interface_carp_configure()
to each configured CARP and it ends up reconfiguring completely the
interface when it's not necessary.

Add a new parameter $maintenancemode_only to interface_carp_configure()
and use it to only change advskew to 254 when going to forced
maintenance mode and move it back to configured value when leaving

History

#1 Updated by Adam Sweet about 2 years ago

Can I ask if any investigation has been done on this or whether anyone else has been able to replicate it? This could bite me after upgrading to 2.4.3-p1 which is planned shortly for a production environment. I note the ticket is still unassigned after 3 months.

I see that this is reported in an environment using a 'non-local gateway', which is not something my environment has but it's not clear whether this issue is specific to using a non-local gateway or not. Given the wide usage of CARP, I'd expect this issue would have reported far more if it were not.

I think it might have been reported in the forums separately here:

https://forum.netgate.com/topic/131367/route-lost-by-carp-change

#2 Updated by Tom Huerlimann almost 2 years ago

I was able to reproduce excactly the same issue with 2.4.3-p1-x64 and with 2.4.4.a.20180803.0952 as well.

Setup on Box 1

- WAN: 10.4.0.1/29
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/29
- CARP Address 2: xxx.xxx.84.235/29
- CARP Address 3: xxx.xxx.84.236/29
- CARP Address 4: xxx.xxx.84.237/29

Setup on Box 2

- WAN: 10.4.0.2/29
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/29
- CARP Address 2: xxx.xxx.84.235/29
- CARP Address 3: xxx.xxx.84.236/29
- CARP Address 4: xxx.xxx.84.237/29

Modifications i made for testing

- I changed WAN on Box 1 to xxx.xxx.84.225/28
- I changed WAN on Box 2 to xxx.xxx.84.226/28
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/28
- CARP Address 2: xxx.xxx.84.235/28
- CARP Address 3: xxx.xxx.84.236/28
- CARP Address 4: xxx.xxx.84.237/28

After the modifications above i was not able to reproduce the issue anymore - but for sure, i can not leave this config in production, because my ISP did not assign a /28 subnet to me. As suggested around the web: Technically i would become able to use CARP with 3 IPs, as i have a /29 subnet with 4 addresses useable - but i prefer no to do this, because from my point of view it's a waste of IP-addresses. Additionally; if this can be solved, it would be possible for all those people with only one public IP to use CARP and take profit from HA (i’ve inbound & outbound NAT, portforwarding and ipsec tested - probably the things the most people use in such setups)

#3 Updated by Steve Beaver almost 2 years ago

  • Assignee set to Renato Botelho

#4 Updated by Steve Beaver almost 2 years ago

  • Target version changed from 2.4.4 to 2.4.4-GS

#5 Updated by Steve Beaver almost 2 years ago

  • Target version changed from 2.4.4-GS to 2.4.4-p1

#6 Updated by John K over 1 year ago

I'm having the exact same issue with 2.4.4. Using IPs outside the WAN-VIP subnet on the WAN interfaces forces the default gateway route to be lost when returning to the master after a fail-over. I simply can't sacrifice 3 public IP4 addresses to the alter of pfSense HA.

Please increase the priority of this issue. Please stop pushing back the target version!

#7 Updated by Renato Botelho over 1 year ago

  • Status changed from New to In Progress

#8 Updated by Renato Botelho over 1 year ago

  • Status changed from In Progress to Feedback
  • % Done changed from 0 to 100

#9 Updated by Chris Linstruth over 1 year ago

Tested on CE build from Friday November 16th. Duplicated missing default gateway on primary node after failover and failback.

Upgraded both nodes to Nov 20. Default gateway was present through carp maintenance and back on the primary. Looks good.

#10 Updated by Renato Botelho over 1 year ago

  • Status changed from Feedback to Resolved

#11 Updated by Christian Grunfeld over 1 year ago

The same issue is back in 2.4.4-RELEASE-p2 (amd64) built on Wed Dec 12 07:40:18 EST 2018. Tested with one WAN IP (/30) and "gateway in non local net" is set.

Node A:
wan: 10.0.0.1/30
lan: 16X.XXX.100.251/24

Node B:
wan: 10.0.0.2/30
lan: 16X.XXX.100.252/24

Carp:
wan vip: 16X.XXX.198.154/30
lan vip: 16X.XXX.100.254/24

Default Gateway of nodes is 16X.XXX.198.153/30 is lost on "temporarily dissable carp" and "persistent carp maintenance mode"

#12 Updated by Tom Huerlimann over 1 year ago

Hi all

The problem is still (or again) reproducable.

Best regards
Tom

Also available in: Atom PDF