Project

General

Profile

Actions

Regression #11570

closed

Gateway monitoring services is not always restarted on interface events, which may prevent a WAN from recovering back to an online state

Added by M L about 3 years ago. Updated 4 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Gateway Monitoring
Target version:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Force Exclusion
Affected Version:
2.5.x
Affected Architecture:
All

Description

Good evening. This seems to be a new bug in 2.5, and was not a problem in 2.4. In gateway group configured for main/failover (tier 1 and tier 2), the switch from main to failover worked perfectly. But when the main is restored, it fails to even notice and doesn't failback. This has been reported by numerous users in the subreddit. My post on reddit: https://www.reddit.com/r/PFSENSE/comments/lnuolf/failover_back_to_main_wan_not_switching_without/

This is actually a very expensive and troubling bug. Many people use an LTE modem with metered data, paying by the MB or GB for data. This bug keeps racking up dollars until you go in to manually change it back.

Main to failover switching:
  1. Unplug WAN1
  2. WAN1 interface status shows link down. Check.
  3. Gateway monitor detects loss and marks as offline. Check.
  4. Default gateway changes to WAN2. Check.
  5. Traffic begins flowing properly on WAN2 (only 30 seconds downtime). Check.
  6. Dynamic DNS clients (5) all get updated. Check.
  7. OpenVPN clients (3) all go down and come back up on WAN2. Check.
  8. All systems normal, no meltdowns, smoke contained in devices.
Failover back to main, not so great:
  1. Plug in WAN1
  2. WAN1 interface status shows link up with the IP. Check.
  3. Gateway monitor shows pending/unknown.
  4. The end. Default gateway fails to switch back to main, and obviously nothing else after that happens either.
I can go into System > Routing > Click Save/Apply (no changes), and that seems to kick the gateway monitor. The default gateway switches back to main.
  1. Traffic begins flowing on the main virtually uninterrupted. Check.
  2. Dynamic DNS clients all update back to the main. Check.
  3. OpenVPN clients fail to change back to the main. The OpenVPN clients all remain on WAN2. I have to restart the OpenVPN service for each client, and then they come back up on the main.
  4. All systems back to normal. Yay.

I understand the OpenVPN not cycling back may be an existing issue for many years that people solve with a cron job. But the rest of this problem is new with 2.5.


Files

clipboard-202206071059-njc9h.png (324 KB) clipboard-202206071059-njc9h.png → luckman212, 06/07/2022 09:59 AM
11570test.diff (1.36 KB) 11570test.diff Marcos M, 06/07/2022 08:54 PM

Related issues

Related to Bug #11142: rc.newwanip restarts VPN services when the IP matchesResolvedViktor Gurov12/08/2020

Actions
Related to Regression #12215: OpenVPN does not resync when running on a gateway groupClosed

Actions
Related to Bug #12771: Automatic filter reload with OpenVPN client gateway uplink happens too soon or not at allResolvedViktor Gurov

Actions
Related to Bug #12613: DNS Resolver does not restart during link up/down events on a static IP address interfaceResolvedViktor Gurov

Actions
Related to Bug #12811: Services are not restarted when PPP interfaces connectResolvedJim Pingle

Actions
Related to Regression #14616: dpinger does not start after renewing DHCPResolvedMarcos M

Actions
Related to Bug #12920: Gateway behavior differs when the gateway does not exist in the configurationFeedbackMarcos M

Actions
Related to Bug #14725: Primary IPv6 interface address may be incorrect when a ULA is setResolvedMarcos M

Actions
Related to Bug #12947: DHCP6 client does not take any action if the interface IPv6 address changes during renewalFeedback

Actions
Actions #1

Updated by M L about 3 years ago

I forgot to mention... this does problem only seems to occur when you fail the main by way of unplugging the WAN interface, or powering off the modem, where the link goes down. If you fail the main by for example unplugging the coax to the cable modem, or the ISP goes down, something other than the actual link going down, everything works fine in both directions.

Actions #2

Updated by Viktor Gurov almost 3 years ago

related to #10716 and #11298 (?)

Actions #3

Updated by Viktor Gurov almost 3 years ago

M L wrote:

Failover back to main, not so great:
  1. Plug in WAN1
  2. WAN1 interface status shows link up with the IP. Check.
  3. Gateway monitor shows pending/unknown.
  4. The end. Default gateway fails to switch back to main, and obviously nothing else after that happens either.

Unable to reproduce this part - after a while the Gateway monitor shows "Online" and successfully restarts the filter/ovpn/ipsec on WAN1.

Maybe there is some kind of race condition

Actions #4

Updated by James Blanton almost 3 years ago

Viktor Gurov wrote:

M L wrote:

Failover back to main, not so great:
  1. Plug in WAN1
  2. WAN1 interface status shows link up with the IP. Check.
  3. Gateway monitor shows pending/unknown.
  4. The end. Default gateway fails to switch back to main, and obviously nothing else after that happens either.

Unable to reproduce this part - after a while the Gateway monitor shows "Online" and successfully restarts the filter/ovpn/ipsec on WAN1.

Maybe there is some kind of race condition

This sounds similar to my issue on Bug #11630.

Actions #5

Updated by Fred Latke almost 3 years ago

I can reproduce exactly the same behavior. If I loose connectivity to the ISP or disconect the coaxil cable from my modem, the main WAN gateway gets placed as default just fine after the outage. If I disconnect the UTP cable or turn off the router, after everythings back up the interface status will show as up, but the gateways widget will show the interface as "offline, packet loss".

Going into System > Routing and clicking save/apply without any changes fixes everything.

Actions #6

Updated by Marcos M almost 3 years ago

It would seem this is fixed on 2.5.1/2.6 according to the comment on #11805

Hi, just want to report its working fine now for me using the latest dev CE version 2.6.0.a.20210524.0100
More details: Running in Hyper-V, Gateway group Load balancing with 3 Tier 1 Openvpn Gateways.
For me, 2.5.0-dev broke the Gateway Group. 2.5.1 broke Port forward and fixed Gateway Groups, 2.6.0.a fixed them both.

If you were/are having this issue, please test on either of these versions.

Actions #7

Updated by Jim Pingle almost 3 years ago

  • Status changed from New to Feedback
Actions #8

Updated by Lars Möller over 2 years ago

We are having the same problem on SG-3100, XG-7100, SG-5100. It occours on 21.* up to 21.05.1. On 2.4.5 everything was fine.

The problem occours if the main WAN is DHCP. In another setup where main WAN is PPPOE everything is working fine.

Here 2 example setups:

Not working, it never switches back to main:
Main WAN: DHCP (LTE-Hybrid Router) (Interface is not going down, but has packet loss)
Backup WAN: DHCP (DSL-Router, very slow)
Gateway Group: "Packet Loss" or "Packet Loss or low latency"

Working fine in case of main WAN down (could not test packet loss case, main WAN is very reliable):
Main WAN: PPPOE (Fiber-Modem)
Backup WAN: fixed IPv4 (VDSL Lancom Router)
Gateway Group: "Packet Loss" or "Packet Loss or low latency"

The only work around we could find is to manually switch WANs. Our customers are getting more and more frustrated. When can we expect a solution?

Actions #9

Updated by Chris B over 2 years ago

I'm seeing this on 21.05.2-RELEASE too. Once failover from WAN to WAN2 happens it will never fail back. the WAN get a DHCP address but the gateway stays Pending. Even pulling out WAN2 completely just causes the default to go away and you end up with nothing. WAN never comes out of Pending until you bounce WAN.
WAN is Tier1 and WAN2 is Tier2.

Actions #10

Updated by Marcos M over 2 years ago

Tested this on 22.01.a.20211013.0500 - it worked correctly (as in the default gateway did change under Diagnostics / Routes). The logging is somewhat inconsistent however:

Statically assigned:

Nov 2 20:47:24     rc.gateway_alarm     62185     >>> Gateway alarm: WAN1GW (Addr:192.0.2.1 Alarm:1 RTT:.383ms RTTsd:.133ms Loss:22%)
Nov 2 20:47:24     check_reload_status     384     updating dyndns WAN1GW
Nov 2 20:47:24     check_reload_status     384     Restarting IPsec tunnels
Nov 2 20:47:24     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 20:47:24     check_reload_status     384     Reloading filter
Nov 2 20:47:25     php-fpm     40189     /rc.dyndns.update: MONITOR: WAN1GW has packet loss, omitting from routing group WANGWGROUP
Nov 2 20:47:25     php-fpm     40189     192.0.2.1|192.0.2.2|WAN1GW|0.385ms|0.134ms|24%|down|highloss
Nov 2 20:47:25     php-fpm     40189     /rc.dyndns.update: Gateway, switch to: WAN2GW
Nov 2 20:47:25     php-fpm     40189     /rc.dyndns.update: Default gateway setting WAN2GW as default.
Nov 2 20:47:25     php-fpm     14272     /rc.openvpn: Gateway, switch to: WAN2GW
Nov 2 20:47:25     php-fpm     14272     /rc.openvpn: Default gateway setting WAN2GW as default.
Nov 2 20:47:25     php-fpm     14272     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 20:47:26     php-fpm     40189     /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.244
Nov 2 20:47:40     php-fpm     97321     /rc.ipsec: IPSEC: One or more IPsec tunnel gateways have changed. Refreshing.
Nov 2 20:47:40     check_reload_status     384     Reloading filter
Nov 2 20:47:41     php-fpm     97321     /rc.ipsec: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 20:49:26     rc.gateway_alarm     4482     >>> Gateway alarm: WAN1GW (Addr:192.0.2.1 Alarm:0 RTT:.394ms RTTsd:.196ms Loss:5%)
Nov 2 20:49:26     check_reload_status     384     updating dyndns WAN1GW
Nov 2 20:49:26     check_reload_status     384     Restarting IPsec tunnels
Nov 2 20:49:26     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 20:49:26     check_reload_status     384     Reloading filter
Nov 2 20:49:27     php-fpm     13321     /rc.dyndns.update: MONITOR: WAN1GW is available now, adding to routing group WANGWGROUP
Nov 2 20:49:27     php-fpm     13321     192.0.2.1|192.0.2.2|WAN1GW|0.394ms|0.195ms|4%|online|none
Nov 2 20:49:27     php-fpm     13321     /rc.dyndns.update: Gateway, switch to: WAN1GW
Nov 2 20:49:27     php-fpm     13321     /rc.dyndns.update: Default gateway setting WAN1GW as default.
Nov 2 20:49:27     php-fpm     38053     /rc.openvpn: Gateway, switch to: WAN1GW
Nov 2 20:49:27     php-fpm     38053     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 20:49:28     php-fpm     13321     /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.4
Nov 2 20:49:42     check_reload_status     384     Reloading filter 

DHCP:

Nov 2 21:37:09     rc.gateway_alarm     82217     >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:1 RTT:.855ms RTTsd:4.492ms Loss:21%)
Nov 2 21:37:09     check_reload_status     384     updating dyndns WAN1_DHCP
Nov 2 21:37:09     check_reload_status     384     Restarting IPsec tunnels
Nov 2 21:37:09     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 21:37:09     check_reload_status     384     Reloading filter
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: MONITOR: WAN1_DHCP has packet loss, omitting from routing group WANGWGROUP
Nov 2 21:37:10     php-fpm     45785     192.0.2.1|192.0.2.2|WAN1_DHCP|0.875ms|4.566ms|23%|down|highloss
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: Gateway, switch to: WAN2_DHCP
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: Default gateway setting Interface WAN2_DHCP Gateway as default.
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: route_add_or_change: Invalid gateway and/or network interface ipsec1
Nov 2 21:37:25     check_reload_status     384     Reloading filter
Nov 2 21:39:15     rc.gateway_alarm     94172     >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:0 RTT:.408ms RTTsd:.142ms Loss:5%)
Nov 2 21:39:15     check_reload_status     384     updating dyndns WAN1_DHCP
Nov 2 21:39:15     check_reload_status     384     Restarting IPsec tunnels
Nov 2 21:39:15     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 21:39:15     check_reload_status     384     Reloading filter
Nov 2 21:39:31     php-fpm     19377     /rc.ipsec: IPSEC: One or more IPsec tunnel gateways have changed. Refreshing.
Nov 2 21:39:31     check_reload_status     384     Reloading filter
Nov 2 21:39:32     php-fpm     19377     /rc.ipsec: Gateway, none 'available' for inet6, use the first one configured. '' 

Another try using DHCP:

Nov 2 21:58:51     rc.gateway_alarm     2969     >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:1 RTT:.447ms RTTsd:.242ms Loss:22%)
Nov 2 21:58:51     check_reload_status     384     updating dyndns WAN1_DHCP
Nov 2 21:58:51     check_reload_status     384     Restarting IPsec tunnels
Nov 2 21:58:51     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 21:58:51     check_reload_status     384     Reloading filter
Nov 2 21:58:53     php-fpm     45785     /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.242
Nov 2 21:58:54     php-fpm     45785     /rc.dyndns.update: phpDynDNS: Not updating sitea.dyndns.lab.arpa A record because the IP address has not changed.
Nov 2 21:59:07     check_reload_status     384     Reloading filter
Nov 2 22:00:13     rc.gateway_alarm     16897     >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:0 RTT:.699ms RTTsd:3.371ms Loss:6%)
Nov 2 22:00:13     check_reload_status     384     updating dyndns WAN1_DHCP
Nov 2 22:00:13     check_reload_status     384     Restarting IPsec tunnels
Nov 2 22:00:13     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 22:00:13     check_reload_status     384     Reloading filter
Nov 2 22:00:15     php-fpm     19377     /rc.openvpn: MONITOR: WAN1_DHCP is available now, adding to routing group WANGWGROUP
Nov 2 22:00:15     php-fpm     19377     192.0.2.1|192.0.2.2|WAN1_DHCP|0.688ms|3.327ms|4%|online|none
Nov 2 22:00:15     php-fpm     19377     /rc.openvpn: Gateway, switch to: WAN1_DHCP
Nov 2 22:00:15     php-fpm     19377     /rc.openvpn: Default gateway setting Interface WAN1_DHCP Gateway as default.
Nov 2 22:00:15     php-fpm     45785     /rc.dyndns.update: Gateway, switch to: WAN1_DHCP
Nov 2 22:00:15     php-fpm     19377     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 22:00:15     php-fpm     19377     /rc.openvpn: route_add_or_change: Invalid gateway and/or network interface ipsec1
Nov 2 22:00:15     php-fpm     45785     /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.2
Nov 2 22:00:16     php-fpm     45785     /rc.dyndns.update: phpDynDNS: Not updating sitea.dyndns.lab.arpa A record because the IP address has not changed. 

Actions #11

Updated by Viktor Gurov over 2 years ago

  • Status changed from Feedback to New

same issue on 22.01.a.20211029.0500 - once failover from WAN to LTE happens it will never fail back until I manually click 'apply' on the System / Routing / Gateways page.

Actions #12

Updated by dave wilson about 2 years ago

Does anyone have a good automated workaround? I have Starlink (DHCP) as primary WAN and LTE modem w/ethernet as backup. Should I try assigning static IPs for primary? The manual 'click apply' isn't ideal if I'm not available to execute it.

Actions #13

Updated by Scott Silver about 2 years ago

I think I may have tracked down one of the problems here. It seems that pfSense is forgetting to reset the gateway monitor when the WAN interface comes back up in certain cases. In my case, the WAN IP comes back up as the same IP address as it was previous. So newwanip, the script that runs when a WAN gets a new IP, seems to not reset the gateway monitor (because it checks for this case, possibly as an optimization, possibly for other reasons I don't understand)

Here are the details:

  • One of my interfaces goes away, so pfSense loses one of its WANs.
  • When it comes back pfSense requests a new IP via DHCP.
  • Subsequently there is the script rc.newwanip that is supposed to run when a WAN interfaces gets a new IP.
  • rc.newwanip guards this code with "isSameAsLastWANAddress()" and since my ISP issues the same address, pfSense does not run this code.
  • This code, in particular, would reset the gateway monitor. Since pfSense does not reset it, the old instance of the gateway monitor (dpinger) will continue to run. However, it can never send out any new ICMP/ping messages because the socket refers to a dead interface and not the new one so no pings come back.
  • Thus, dpinger never thinks the interface comes back.
  • So why does running dpinger from the command line work, even when the gateway monitor instance doesn't? When we run dpinger from the comman dpinger gets a working socket for the new interface.
  • The "quick but probably wrong" fix is to make this code on line 204 always run. See that I OR'd in 1 into the conditional below.
if (/*added so we do this all the time*/ 1 || !is_ipaddr($oldip) || ($curwanip != $oldip) ||
    (!is_ipaddrv4($config['interfaces'][$interface]['ipaddr']) && ($config['interfaces'][$interface]['ipaddr'] != 'dhcp'))) {
    /*
     * Some services (e.g. dyndns, see ticket #4066) depend on
     * filter_configure() to be called before, otherwise pass out
     * route-to rules have the old ip set in 'from' and connections
     * do not go through the correct link
     */
    filter_configure_sync();

    /* reconfigure our gateway monitor, dpinger results need to be 
     * available when configuring the default gateway */
    setup_gateways_monitor();
Actions #14

Updated by Scott Silver about 2 years ago

Note that https://redmine.pfsense.org/issues/11142 was the bug that someone fixed that tries to solve some other problem.

I suspect the correct fix will not touch the VPN and will only reset gateway_monitor.

Actions #15

Updated by Viktor Gurov about 2 years ago

  • Related to Bug #11142: rc.newwanip restarts VPN services when the IP matches added
Actions #16

Updated by Viktor Gurov about 2 years ago

  • Tracker changed from Bug to Regression
Actions #18

Updated by Jim Pingle about 2 years ago

  • Assignee set to Viktor Gurov
  • Priority changed from High to Normal
  • Target version set to CE-Next
  • Plus Target Version set to 22.05
Actions #19

Updated by Jim Pingle about 2 years ago

  • Status changed from New to Pull Request Review
Actions #20

Updated by Viktor Gurov about 2 years ago

  • Related to Regression #12215: OpenVPN does not resync when running on a gateway group added
Actions #21

Updated by Viktor Gurov about 2 years ago

  • Related to Bug #12771: Automatic filter reload with OpenVPN client gateway uplink happens too soon or not at all added
Actions #22

Updated by Viktor Gurov about 2 years ago

  • Status changed from Pull Request Review to Feedback
  • % Done changed from 0 to 100
Actions #23

Updated by Viktor Gurov about 2 years ago

  • Related to Bug #12613: DNS Resolver does not restart during link up/down events on a static IP address interface added
Actions #24

Updated by → luckman212 about 2 years ago

Did this make it into 2.6 / 22.01 or do we need to use System Patches to get it? - edit nevermind, I see it's targeted at 22.05

Actions #25

Updated by Viktor Gurov about 2 years ago

  • Related to Bug #12811: Services are not restarted when PPP interfaces connect added
Actions #26

Updated by Jim Pingle about 2 years ago

  • Target version changed from CE-Next to 2.7.0
Actions #27

Updated by Wayne Sherman almost 2 years ago

Setup:
2.6.0-RELEASE (amd64), dual WAN with both WANs on DHCP, and failover via Gateway groups. (default gateway = PreferWAN1)

Test:
Unplugging one of the WAN network cables, wait for a few minutes, and then plug back in

Problems:
1) dpinger does not monitor a WAN port after the port comes back up
2) If I manually restart dpinger, both gateways show as online, but the default gateway does not switch back to WAN1.

Fixed by patch:
After applying the patch, both problems above are fixed.
( https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/ec73bb89489d830ec21c4e04ffa3ec401791b55d )

New problem after patching:
After applying the patch referenced above, a new problem shows up in the logs with an error trying to restart unbound:
pfSense php-fpm[373]: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'

Unbound error in context:
Mar 30 11:00:53 pfSense php-fpm[372]: /rc.linkup: DEVD Ethernet attached event for opt1
Mar 30 11:00:53 pfSense php-fpm[372]: /rc.linkup: HOTPLUG: Configuring interface opt1
Mar 30 11:01:00 pfSense check_reload_status[411]: rc.newwanip starting igb1
Mar 30 11:01:00 pfSense check_reload_status[411]: Restarting IPsec tunnels
Mar 30 11:01:01 pfSense php-fpm[373]: /rc.newwanip: rc.newwanip: Info: starting on igb1.
Mar 30 11:01:01 pfSense php-fpm[373]: /rc.newwanip: rc.newwanip: on (IP address: 192.168.12.150) (interface: WAN2[opt1]) (real interface: igb1).
Mar 30 11:01:03 pfSense php-fpm[373]: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'
Mar 30 11:01:03 pfSense check_reload_status[411]: updating dyndns opt1
Mar 30 11:01:04 pfSense php-fpm[373]: /rc.newwanip: Resyncing OpenVPN instances for interface WAN2.
Mar 30 11:01:04 pfSense php-fpm[373]: /rc.newwanip: Creating rrd update script
Mar 30 11:01:07 pfSense php-fpm[373]: /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 192.168.12.150 -> 192.168.12.150 - Restarting packages.

Actions #28

Updated by Jim Pingle almost 2 years ago

  • Subject changed from Gateway group doesn't failback from tier 2 to tier 1, worked properly in 2.4 to Gateway monitoring services is not always restarted on interface events, which may prevent a WAN from recovering back to an online state
Actions #29

Updated by Viktor Gurov almost 2 years ago

Wayne Sherman wrote in #note-27:

Setup:
2.6.0-RELEASE (amd64), dual WAN with both WANs on DHCP, and failover via Gateway groups. (default gateway = PreferWAN1)

Test:
Unplugging one of the WAN network cables, wait for a few minutes, and then plug back in

Problems:
1) dpinger does not monitor a WAN port after the port comes back up
2) If I manually restart dpinger, both gateways show as online, but the default gateway does not switch back to WAN1.

Fixed by patch:
After applying the patch, both problems above are fixed.
( https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/ec73bb89489d830ec21c4e04ffa3ec401791b55d )

New problem after patching:
After applying the patch referenced above, a new problem shows up in the logs with an error trying to restart unbound:
@pfSense php-fpm373: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'

Unable to reproduce on pfSense-22.05.a.20220407.0600 - everything works fine, without unbound errors.
Please test on the latest snapshots, and if it happens again, provide unbound configuration.

Actions #30

Updated by Jürgen Echter almost 2 years ago

Viktor Gurov wrote in #note-29:

Wayne Sherman wrote in #note-27:

Setup:
2.6.0-RELEASE (amd64), dual WAN with both WANs on DHCP, and failover via Gateway groups. (default gateway = PreferWAN1)

Test:
Unplugging one of the WAN network cables, wait for a few minutes, and then plug back in

Problems:
1) dpinger does not monitor a WAN port after the port comes back up
2) If I manually restart dpinger, both gateways show as online, but the default gateway does not switch back to WAN1.

Fixed by patch:
After applying the patch, both problems above are fixed.
( https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/ec73bb89489d830ec21c4e04ffa3ec401791b55d )

New problem after patching:
After applying the patch referenced above, a new problem shows up in the logs with an error trying to restart unbound:
@pfSense php-fpm373: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'

Unable to reproduce on pfSense-22.05.a.20220407.0600 - everything works fine, without unbound errors.
Please test on the latest snapshots, and if it happens again, provide unbound configuration.

i also added the patch, but i still have the same problem. If i disable monitoring in the routing tab, and re-enable it, it is working again, else it stays on pending on the dashboard and doesn't switch back to online.

If you need any information just tell me. pfsense 2.6.0

Actions #31

Updated by Marcos M almost 2 years ago

What interface(s) does unbound have assigned? Is this a VM?

Actions #32

Updated by Sage Badolato almost 2 years ago

I can also confirm that I can replicate this exact issue on my PFSense. Both as a VM and as bare metal.

Using a HP DL360p Gen6, as a Windows based HyperV previously, and currently running on the same machine in bare metal. Machine has 2 on-board NICs, used for WAN and LAN, and a PCI-e Intel Pro Gigabit card for the Failover WAN. All hardware is healthy and functioning. Primary ISP is a local cable provider (Gigabit/50mbps) and I have my own SB8200 for this. Failover ISP is a Verizon powered CradlePoint MBR1400v2 with MC200LE-VZ (Verizon LTE USB Modem add-on) on a very limited data plan.

I can reproduce this issue by simply unplugging the Ethernet or power cycling on either modem. End result is that the Gateway just shows Pending under the Gateway Status for the gateway in question. Worth noting, I take down my Primary WAN via power, wait for PFSense to failover to secondary WAN, reconnect my primary WAN (knowing that it's a working connecting, while still reporting incorrectly by PFSense), if I then take down the failover, the Primary WAN will return to service with no issue and report properly. I can replicate this vise versa. But then the Failover gateway will sit at pending status.

One other item worth noting that I haven't seen anyone else mention, and it may be why it's hard to replicate. I've only had this issue on PFSense, where a gateway group has been created for more than 24 hours. If I spin-up a fresh PFSense (bare metal or VM), it will failover and fallback properly, every time. However, after about 24 hours passes, the fallback stops working, and we see the Pending status issue. It doesn't matter the age of the PFSense install. My current bare metal setup has been running for roughly 2 months with no failover setup what so ever. I just configured this again last week as I just got the new cradlepoint (previous jetpack was trash).

I hope this makes sense.

Actions #33

Updated by Marcos M almost 2 years ago

I suggest testing on 22.05 BETA if possible. If the issue persists there, it may be related to https://redmine.pfsense.org/issues/12920.

Actions #34

Updated by Sage Badolato almost 2 years ago

I cannot test 22.05, I'm on community edition.

Actions #35

Updated by Jim Pingle almost 2 years ago

  • Plus Target Version changed from 22.05 to 22.09

Sage Badolato wrote in #note-34:

I cannot test 22.05, I'm on community edition.

You can try a recent 2.7.0 snapshot as well.

I'm re-targeting this at 22.09. There were no changes here and if it is related to the other linked issue then it'll be solved then.

Actions #36

Updated by → luckman212 over 1 year ago

I experienced this this morning, on 22.05.b.20220531.0600

- dpinger showed my DHCP6 gateway as "down"
- I ran pgrep -lf dpinger and confirmed dpinger was running on the right interface
- but, it was bound to a local IPv6 (bogon) and thus could not send outbound pings
- ping6 2001:4860:4860::8888 worked normally, both from the WAN modem itself and from pfSense console
- stopping and restarting the dpinger service did NOT restore the WAN6 to online state
- I had to edit the WAN6 interface (no changes) and hit Save -- then it was green again
- I ran ifconfig ix2 before and after, and noticed that the IP addresses below had swapped positions. Not sure if this is related or just a side effect.

edit: not sure what I was thinking when I labeled that screenshot but the left side (before) should be "not working" and the right side (after) should be "working"

Not sure if there is a better/separate issue to report this on? does it need a new Issue since in my case it's specific to DHCP6 + dpinger?

Actions #37

Updated by Marcos M over 1 year ago

Tested on 22.05 RC.

I was not able to replicate this initially with WAN1 as DHCP and WAN2 as static. After testing a combination of DHCP/static on both, I was able to replicate the issue by doing the following:
  1. Release WAN DHCP
    • gateway status is pending (or missing if no gateway entry exists in config.xml - see #12920)
  2. Renew WAN DHCP
    • gateway status is pending

I then ran a diff between the previously working config and the broken config, and the difference was that a gateway entry existed in config.xml when it was working:

        <gateway_item>
            <interface>wan</interface>
            <gateway>dynamic</gateway>
            <name>WAN1_DHCP</name>
            <weight>1</weight>
            <ipprotocol>inet</ipprotocol>
            <descr><![CDATA[Interface WAN1_DHCP Gateway]]></descr>
        </gateway_item>

I was able to break/fix the issue multiple times by removing/adding that entry from config.xml. After many runs of testing however, I could no longer reproduce the issue even with the gateway entry missing. I don't know what the root cause is, but at the very least, it does seem like the missing gateway entry plays a part.

Attached is a test patch I'm using to work around this issue, though it seems to me both rc.newwanip and rc.newwanip6 need refactoring.

Actions #38

Updated by → luckman212 over 1 year ago

I submitted a PR: https://github.com/pfsense/pfsense/pull/4595 that may help some of the cases being hit here.

Actions #39

Updated by Jim Pingle over 1 year ago

  • Status changed from Feedback to Pull Request Review
Actions #40

Updated by → luckman212 over 1 year ago

I've been running with the PR above for 2 days now, it's survived multiple reboots, and unplug/replug of the secondary WAN connection that provides my DHCPv6. So far so good. Just datapoint 1 of 1 but hopefully others can test and report.

Actions #41

Updated by → luckman212 over 1 year ago

Pushed more updates to my PR #4595 (see over there for details).

I had a down V6 gateway this morning and upon investigation, noticed the IP that was being returned by the get_usable_interface_ipv6() function had the "detached" flag in ifconfig. Researching this, it seems it might be related to a FreeBSD bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263986

I created a few helper functions to clean those up and hooked them into rc.newwanipv6.

Tests indicate so far that this is working.

Actions #42

Updated by Alefe Ortiz over 1 year ago

hello guys

Configurator (Scope):
Interfaces: WAN-DHCP4|WAN2-DHCP4
Gateway Group: Failover (WAN_DHCP Gateway: 192.168.10.1 Tier1 Monitor IP: 208.67.220.220 | WAN2_DHCP2 Tier2 Gateway:192.168.5.1 Monitor IP: 1.1.1.1)

Diagnostic 1:
1-WAN- Cable Connected Status/interfaces (up | dhcp up)
2-WAN2- Cable Disconnected Status/interfaces (no carrier | dhcp down)
3-System/Routing Gateways (WAN_DHCP Tier1 Gateway: 192.168.10.1 Monitor IP: 208.67.220.220 | WAN2_DHCP Tier2 Gateway:"dynamic" Monitor IP: 1.1.1.1)
3-Status/ Gateways (WAN_DHCP Gateway:192.168.10.1 Monitor:208.67.220.220 Status:Online |WAN2_DHCP Gateway:dynamic Monitor:"Empty" Status:Pending )
4-Action: Status/ Interfaces (WAN) Release Dhcp or Disconnect interface cable
5-Status/ Gateways (WAN_DHCP Gateway:dynamic Monitor:"Empty" Status:Pending | equal WAN2 | "note at this point the dpinger service will be stopped!")
6-Action: Status/ Interfaces (WAN) DHCP Renew WAN or connect interface cable | or connect cable (WAN2)
7-Status/ Gateways (Pending "Note dpinger service is still stopped !")
--------------------------------------------------------------------------
Diagnostic 2:
1-WAN- Cable Connected Status/interfaces (up | dhcp up)
2-WAN2- Cable Connected Status/interfaces (up | dhcp up)
3-System/Routing Gateways (WAN_DHCP Tier1 Gateway: 192.168.10.1 Monitor IP: 208.67.220.220 | WAN2_DHCP Tier2 Gateway:192.168.5.1 Monitor IP: 1.1.1.1)
3-Status/ Gateways (WAN_DHCP Gateway:192.168.10.1 Monitor:208.67.220.220 Status:Online |WAN2_DHCP Gateway:192.168.5.1 Monitor:1.1.1.1 Status:Online )
4-Action: Status/ Interfaces (WAN) Release Dhcp or Disconnect interface cable
5-Status/ Gateways (WAN_DHCP Gateway:dynamic Monitor:"Empty" Status:Pending |WAN2_DHCP Gateway:192.168.5.1 Monitor:1.1.1.1 Status:Online| "note that the dpinger service is started healthy")
6-Action: Status/ Interfaces (WAN) DHCP Renew WAN or connect interface cable
7-Status/ Gateways (WAN_DHCP Gateway:dynamic Monitor:"Empty" Status:(Pending) does not switch to online |WAN2_DHCP Gateway:192.168.5.1 Monitor:1.1.1.1 Status:Online| "note that the dpinger service is started healthy")

Notes:
the problem does not occur in with static ip interfaces
*the problem also occurs with ppps interfaces (Action:Disconnect or Recconect PPPoE Note: everything seems to develop the moment the interface becomes down no carrier and loses ip addressing)

Questions:

1-What can I do to temporarily resolve the issue?
2-This problem is a bug in the version, it will be fixed in version 2.7.0

Firmware Version: (2.6.0-RELEASE (amd64))

Actions #43

Updated by Jim Pingle over 1 year ago

  • Plus Target Version changed from 22.09 to 22.11
Actions #44

Updated by Jim Pingle over 1 year ago

  • Plus Target Version changed from 22.11 to 23.01
Actions #45

Updated by Jim Pingle over 1 year ago

  • Assignee deleted (Viktor Gurov)
  • Start date deleted (02/27/2021)
  • Plus Target Version changed from 23.01 to 23.05
Actions #46

Updated by robi robi about 1 year ago

Ran into this on my 2.6.0-RELEASE (amd64) which has two WANs, one PPPoE and one DHCP. The DHCP one experienced occasionally that the gateway had to be refreshed manually.

Applying the patch from note https://redmine.pfsense.org/issues/11570#note-27 fixed the issue.

Actions #47

Updated by Jim Pingle 11 months ago

  • Category changed from Gateways to Gateway Monitoring
Actions #48

Updated by Jim Pingle 10 months ago

  • Plus Target Version changed from 23.05 to 23.09
Actions #49

Updated by LTC Tech 9 months ago

We have an office that uses Starlink (CGNAT DHCP IP) and a slow FWA (Public Static IP) connection as backup. If the office loses power then Starlink takes a while to connect. When Starlink finally does connect dpinger is either active with a stale binding address or missing from processes altogether. Saving the gateway brings up dpinger with correct source address and everything starts working through Starlink.

Rejecting leases from Starlink 192.168.100.1 DHCP doesn't seem to help. The Starlink router is in bypass mode but it appears to announce 192.168.100.0/24 via DHCP when it has no internet connection. In normal operation, both the host and gateway IP should be within the CGNAT range 100.64.0.0/10.

Might be worthwhile to write a watchdog for dpinger...

Actions #50

Updated by Jim Pingle 9 months ago

  • Target version changed from 2.7.0 to CE-Next
Actions #51

Updated by Darius ITGuys.net 9 months ago

I might have something to add. While inspecting my downloaded config.xml (CE 2.6.0) I noticed this:
<gateways>
<defaultgw4>Spectrum_Static</defaultgw4>
<defaultgw6>-</defaultgw6>
</gateways>
It's referencing a WAN/gateway I don't have anymore but the GUI was set in System>Routing>Gateways with "Default gateway IPv4" to "Automatic".
This caused pfSense to not have a default route listed at all in the Diagnostics>Routes>IPv4 Routes table.
Leaving it Automatic and saving, and also re-saving the gateway (as might have fixed this for me in the past) didn't solve it or change that incorrect value in the backup config.xml.

Manually changing the Default gateway IPv4 dropdown box to my actual gateway, "WAN_DHCP" solves the issue for me and fixes the config.xml.
Afterwards, switching it back to Automatic continues to work. (I haven't yet tested whether "Automatic" works after reboots or WAN down scenarios.)

Actions #52

Updated by Jim Pingle 6 months ago

  • Plus Target Version changed from 23.09 to 24.01

PR has conflicts and needs work/testing still

Actions #53

Updated by Marcos M 6 months ago

Actions #54

Updated by Marcos M 6 months ago

  • Status changed from Pull Request Review to Feedback
I believe the original issue description is related to the following two issues:
  • #14616 (a patch is available)
  • #12920 (a workaround exists)

The issue described in #note-36 should be resolved with #14725. A separate but related issue is #12947 which could use further testing.

As for PR 4595, I think it'd be best to revisit it after further testing/feedback on the above redmine issues.

Actions #55

Updated by Jim Pingle 5 months ago

  • Plus Target Version changed from 24.01 to 24.03
Actions #56

Updated by Azamat Khakimyanov 4 months ago

  • Status changed from Feedback to Resolved

Tested on 23.05_1 and on 23.09-BETA (built on Fri Oct 20 9:00:00 MSK 2023)

I was able to reproduce this issue on 23.05_1 by releasing and renewing DHCP WAN IP but only with IPv4 addresses.
But when I added IPv6 addresses, I didn't see this issue.

I wasn't able to reproduce it on 23.09-BETA.

I marked this Regression as resolved.

Actions #57

Updated by Marcos M 4 months ago

  • Related to Bug #12920: Gateway behavior differs when the gateway does not exist in the configuration added
Actions #58

Updated by Marcos M 4 months ago

  • Related to Bug #14725: Primary IPv6 interface address may be incorrect when a ULA is set added
Actions #59

Updated by Marcos M 4 months ago

  • Related to Bug #12947: DHCP6 client does not take any action if the interface IPv6 address changes during renewal added
Actions #60

Updated by Marcos M 4 months ago

  • Status changed from Resolved to Closed
  • Target version deleted (CE-Next)
  • Plus Target Version deleted (24.03)
  • Release Notes changed from Default to Force Exclusion

There are a number of factors that can result in the issue described in the original post. These are detailed in separate redmines - see #note-54.

Actions

Also available in: Atom PDF