Bug #8465
closedLost default gateway after recover from failover with CARP VIP and HA
100%
Description
Both boxes works with SuperMicro Boards which have two interfaces on board and an additional i350 4 Port network card. HA is on dedicated interfaces, directly connected without switch. All other interfaces are connected to a switch with untagged VLANs for every interface.
WAN Master and Slave - Switch VLAN WAN - ISP
LAN Master and Slave - Switch VLAN LAN - Internal net
DMZ Master and Slave - Switch VLAN DMZ - DMZ
GUEST Master and Slave - Switch VLAN Guest - Guest network
OPT Master and Slave - Switch VLAN OPT - currently not used
Master
WAN Interface: Static IPv4 10.10.75.251/24
Gateway: x.x.x.17
Slave
WAN Interface: Static IPv4 10.10.75.252/24
Gateway: x.x.x.17
The gateway is a public IP address, 62.x.x.17 and "use non local gateway" is set. Outbound NAT is also set (This firewall, WAN Interface, CARP VIP).
External IP
Currently there are 4 static external IPs configured as CARP VIP.
The "master" IP for outgoing traffic is x.x.x.20/29, VHID Group on both 20. The advertising frequency is on master Base = 1 and Skew = 0, on slave Base = 1 and Skew = 100.
The other IPs are for incoming traffic to some webservers and the mailrelay in DMZ.
NAT
There is on both machines Outbound NAT: This Firewall, any source port, any destination, any destination port with NAT Address x.x.x.20
Additional Outbound NAT is configured for some machines, ports and the other CARP VIPs, i.e. outgoing mail is the IP of the MX record and so on.
There is no problem if I switch form master to slave. But back from slave to master the default gateway on master is missing. If I set it in the console or simple save it with a click in the GUI of the master WAN interface or System / Routing / Gateways / Edit without changing something, the default gateway is immediatley set.
I have also done some debugging on console:
a) console on master
- enter persistent CARP maintenance mode on MASTER
- failover to slave, all connections established
- default gw lost on master (netstat r) leave persistent CARP maintenance mode on MASTER
- all interfaces and services "green"
- only default gw lost
- route add default 62.x.x.17
- all is up
b) console on master
- ifconfig ibg4 down (WAN interface)
- failover to slave, all connections established
- default gw present on master
- ifconfig ibg4 up
- go back to master as active
- all interfaces and services "green"
- only default gw lost
- route add default 62.x.x.17
- all is up
c) console on master
- sysctl net.inet.carp.demotion=250
- failover to slave, all connections established
- default gw present on master
- sysctl net.inet.carp.demotion=-250
- go back to master as active
- all interfaces and services "green"
- default gw present on master!!!
- all is up
I tried c) several times and pf always switches perfectly between master and slave
without lost of any connection.
If I simulate a lost WAN interface with b) the default gw will be present. The default
gw not lost during failover, but when the Master takes over again.
If I set the Master in maintenance mode a) , the default gw is lost immadiatley.
Why the default gateway will be only restored with c) but not with a) or b)?
Files
Updated by Adam Sweet over 6 years ago
Can I ask if any investigation has been done on this or whether anyone else has been able to replicate it? This could bite me after upgrading to 2.4.3-p1 which is planned shortly for a production environment. I note the ticket is still unassigned after 3 months.
I see that this is reported in an environment using a 'non-local gateway', which is not something my environment has but it's not clear whether this issue is specific to using a non-local gateway or not. Given the wide usage of CARP, I'd expect this issue would have reported far more if it were not.
I think it might have been reported in the forums separately here:
https://forum.netgate.com/topic/131367/route-lost-by-carp-change
Updated by Tom Huerlimann over 6 years ago
I was able to reproduce excactly the same issue with 2.4.3-p1-x64 and with 2.4.4.a.20180803.0952 as well.
Setup on Box 1
- WAN: 10.4.0.1/29
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/29
- CARP Address 2: xxx.xxx.84.235/29
- CARP Address 3: xxx.xxx.84.236/29
- CARP Address 4: xxx.xxx.84.237/29
Setup on Box 2
- WAN: 10.4.0.2/29
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/29
- CARP Address 2: xxx.xxx.84.235/29
- CARP Address 3: xxx.xxx.84.236/29
- CARP Address 4: xxx.xxx.84.237/29
Modifications i made for testing
- I changed WAN on Box 1 to xxx.xxx.84.225/28
- I changed WAN on Box 2 to xxx.xxx.84.226/28
- GW: xxx.xxx.84.233
- CARP Address 1: xxx.xxx.84.234/28
- CARP Address 2: xxx.xxx.84.235/28
- CARP Address 3: xxx.xxx.84.236/28
- CARP Address 4: xxx.xxx.84.237/28
After the modifications above i was not able to reproduce the issue anymore - but for sure, i can not leave this config in production, because my ISP did not assign a /28 subnet to me. As suggested around the web: Technically i would become able to use CARP with 3 IPs, as i have a /29 subnet with 4 addresses useable - but i prefer no to do this, because from my point of view it's a waste of IP-addresses. Additionally; if this can be solved, it would be possible for all those people with only one public IP to use CARP and take profit from HA (i’ve inbound & outbound NAT, portforwarding and ipsec tested - probably the things the most people use in such setups)
Updated by Anonymous about 6 years ago
- Target version changed from 2.4.4 to 2.4.4-GS
Updated by Anonymous about 6 years ago
- Target version changed from 2.4.4-GS to 2.4.4-p1
Updated by John K about 6 years ago
I'm having the exact same issue with 2.4.4. Using IPs outside the WAN-VIP subnet on the WAN interfaces forces the default gateway route to be lost when returning to the master after a fail-over. I simply can't sacrifice 3 public IP4 addresses to the alter of pfSense HA.
Please increase the priority of this issue. Please stop pushing back the target version!
Updated by Renato Botelho about 6 years ago
- Status changed from New to In Progress
Updated by Renato Botelho about 6 years ago
- Status changed from In Progress to Feedback
- % Done changed from 0 to 100
Applied in changeset 8bffe226d5183dda310dde2a89c78f2d8d79789c.
Updated by Chris Linstruth about 6 years ago
Tested on CE build from Friday November 16th. Duplicated missing default gateway on primary node after failover and failback.
Upgraded both nodes to Nov 20. Default gateway was present through carp maintenance and back on the primary. Looks good.
Updated by Renato Botelho about 6 years ago
- Status changed from Feedback to Resolved
Updated by Christian Grunfeld over 5 years ago
The same issue is back in 2.4.4-RELEASE-p2 (amd64) built on Wed Dec 12 07:40:18 EST 2018. Tested with one WAN IP (/30) and "gateway in non local net" is set.
Node A:
wan: 10.0.0.1/30
lan: 16X.XXX.100.251/24
Node B:
wan: 10.0.0.2/30
lan: 16X.XXX.100.252/24
Carp:
wan vip: 16X.XXX.198.154/30
lan vip: 16X.XXX.100.254/24
Default Gateway of nodes is 16X.XXX.198.153/30 is lost on "temporarily dissable carp" and "persistent carp maintenance mode"
Updated by Tom Huerlimann over 5 years ago
Hi all
The problem is still (or again) reproducable.
Best regards
Tom
Updated by Milad Soltanian about 4 years ago
- File fixgw.sh.txt fixgw.sh.txt added
- File fixgw-pf.png fixgw-pf.png added
well we solved the problem by this way , first create a script to check if the default route is still exists or no then if does not just add it :)
I added a cronjob for this though
fixgw.sh :
HOSTNAME="$(hostname)"
if ! [ $(route -n show 0.0.0.0 | grep gateway | cut -d ":" -f 2 | cut -d " " -f 2) == "10.10.10.1" ]; then route add -net 0.0.0.0/0 10.10.10.1 ; fi