pfSense

Related to Regression #12215: OpenVPN does not resync when running on a gateway group

Closed

Related to Bug #12771: Automatic filter reload with OpenVPN client gateway uplink happens too soon or not at all

Resolved

Viktor Gurov

Related to Bug #12613: DNS Resolver does not restart during link up/down events on a static IP address interface

Resolved

Viktor Gurov

Related to Bug #12811: Services are not restarted when PPP interfaces connect

Resolved

Jim Pingle

Related to Regression #14616: dpinger does not start after renewing DHCP

Resolved

Related to Bug #12920: Gateway behavior differs when the gateway does not exist in the configuration

Resolved

Related to Bug #14725: Primary IPv6 interface address may be incorrect when a ULA is set

Resolved

Related to Bug #12947: Old IPv6 addresses may continue to be used after DHCP or RA changes

Resolved

Updated by M L almost 5 years ago

I forgot to mention... this does problem only seems to occur when you fail the main by way of unplugging the WAN interface, or powering off the modem, where the link goes down. If you fail the main by for example unplugging the coax to the cable modem, or the ISP goes down, something other than the actual link going down, everything works fine in both directions.

Actions

Updated by Viktor Gurov almost 5 years ago

related to #10716 and #11298 (?)

Actions

Updated by Viktor Gurov almost 5 years ago

M L wrote:

Failover back to main, not so great:

Plug in WAN1

WAN1 interface status shows link up with the IP. Check.

Gateway monitor shows pending/unknown.

The end. Default gateway fails to switch back to main, and obviously nothing else after that happens either.

Unable to reproduce this part - after a while the Gateway monitor shows "Online" and successfully restarts the filter/ovpn/ipsec on WAN1.

Maybe there is some kind of race condition

Actions

Updated by James Blanton almost 5 years ago

Viktor Gurov wrote:

M L wrote:

Failover back to main, not so great:

Plug in WAN1

WAN1 interface status shows link up with the IP. Check.

Gateway monitor shows pending/unknown.

The end. Default gateway fails to switch back to main, and obviously nothing else after that happens either.

Unable to reproduce this part - after a while the Gateway monitor shows "Online" and successfully restarts the filter/ovpn/ipsec on WAN1.

Maybe there is some kind of race condition

This sounds similar to my issue on Bug #11630.

Actions

Updated by Fred Latke almost 5 years ago

I can reproduce exactly the same behavior. If I loose connectivity to the ISP or disconect the coaxil cable from my modem, the main WAN gateway gets placed as default just fine after the outage. If I disconnect the UTP cable or turn off the router, after everythings back up the interface status will show as up, but the gateways widget will show the interface as "offline, packet loss".

Going into System > Routing and clicking save/apply without any changes fixes everything.

Actions

Updated by Marcos M over 4 years ago

It would seem this is fixed on 2.5.1/2.6 according to the comment on #11805

Hi, just want to report its working fine now for me using the latest dev CE version 2.6.0.a.20210524.0100
More details: Running in Hyper-V, Gateway group Load balancing with 3 Tier 1 Openvpn Gateways.
For me, 2.5.0-dev broke the Gateway Group. 2.5.1 broke Port forward and fixed Gateway Groups, 2.6.0.a fixed them both.

If you were/are having this issue, please test on either of these versions.

Actions

Updated by Jim Pingle over 4 years ago

Status changed from New to Feedback

Actions

Updated by Lars Möller over 4 years ago

We are having the same problem on SG-3100, XG-7100, SG-5100. It occours on 21.* up to 21.05.1. On 2.4.5 everything was fine.

The problem occours if the main WAN is DHCP. In another setup where main WAN is PPPOE everything is working fine.

Here 2 example setups:

Not working, it never switches back to main:
Main WAN: DHCP (LTE-Hybrid Router) (Interface is not going down, but has packet loss)
Backup WAN: DHCP (DSL-Router, very slow)
Gateway Group: "Packet Loss" or "Packet Loss or low latency"

Working fine in case of main WAN down (could not test packet loss case, main WAN is very reliable):
Main WAN: PPPOE (Fiber-Modem)
Backup WAN: fixed IPv4 (VDSL Lancom Router)
Gateway Group: "Packet Loss" or "Packet Loss or low latency"

The only work around we could find is to manually switch WANs. Our customers are getting more and more frustrated. When can we expect a solution?

Actions

Updated by Chris B over 4 years ago

I'm seeing this on 21.05.2-RELEASE too. Once failover from WAN to WAN2 happens it will never fail back. the WAN get a DHCP address but the gateway stays Pending. Even pulling out WAN2 completely just causes the default to go away and you end up with nothing. WAN never comes out of Pending until you bounce WAN.
WAN is Tier1 and WAN2 is Tier2.

Actions

#10

Updated by Marcos M over 4 years ago

Tested this on 22.01.a.20211013.0500 - it worked correctly (as in the default gateway did change under Diagnostics / Routes). The logging is somewhat inconsistent however:

Statically assigned:

Nov 2 20:47:24     rc.gateway_alarm     62185     >>> Gateway alarm: WAN1GW (Addr:192.0.2.1 Alarm:1 RTT:.383ms RTTsd:.133ms Loss:22%)
Nov 2 20:47:24     check_reload_status     384     updating dyndns WAN1GW
Nov 2 20:47:24     check_reload_status     384     Restarting IPsec tunnels
Nov 2 20:47:24     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 20:47:24     check_reload_status     384     Reloading filter
Nov 2 20:47:25     php-fpm     40189     /rc.dyndns.update: MONITOR: WAN1GW has packet loss, omitting from routing group WANGWGROUP
Nov 2 20:47:25     php-fpm     40189     192.0.2.1|192.0.2.2|WAN1GW|0.385ms|0.134ms|24%|down|highloss
Nov 2 20:47:25     php-fpm     40189     /rc.dyndns.update: Gateway, switch to: WAN2GW
Nov 2 20:47:25     php-fpm     40189     /rc.dyndns.update: Default gateway setting WAN2GW as default.
Nov 2 20:47:25     php-fpm     14272     /rc.openvpn: Gateway, switch to: WAN2GW
Nov 2 20:47:25     php-fpm     14272     /rc.openvpn: Default gateway setting WAN2GW as default.
Nov 2 20:47:25     php-fpm     14272     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 20:47:26     php-fpm     40189     /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.244
Nov 2 20:47:40     php-fpm     97321     /rc.ipsec: IPSEC: One or more IPsec tunnel gateways have changed. Refreshing.
Nov 2 20:47:40     check_reload_status     384     Reloading filter
Nov 2 20:47:41     php-fpm     97321     /rc.ipsec: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 20:49:26     rc.gateway_alarm     4482     >>> Gateway alarm: WAN1GW (Addr:192.0.2.1 Alarm:0 RTT:.394ms RTTsd:.196ms Loss:5%)
Nov 2 20:49:26     check_reload_status     384     updating dyndns WAN1GW
Nov 2 20:49:26     check_reload_status     384     Restarting IPsec tunnels
Nov 2 20:49:26     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 20:49:26     check_reload_status     384     Reloading filter
Nov 2 20:49:27     php-fpm     13321     /rc.dyndns.update: MONITOR: WAN1GW is available now, adding to routing group WANGWGROUP
Nov 2 20:49:27     php-fpm     13321     192.0.2.1|192.0.2.2|WAN1GW|0.394ms|0.195ms|4%|online|none
Nov 2 20:49:27     php-fpm     13321     /rc.dyndns.update: Gateway, switch to: WAN1GW
Nov 2 20:49:27     php-fpm     13321     /rc.dyndns.update: Default gateway setting WAN1GW as default.
Nov 2 20:49:27     php-fpm     38053     /rc.openvpn: Gateway, switch to: WAN1GW
Nov 2 20:49:27     php-fpm     38053     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 20:49:28     php-fpm     13321     /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.4
Nov 2 20:49:42     check_reload_status     384     Reloading filter

DHCP:

Nov 2 21:37:09     rc.gateway_alarm     82217     >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:1 RTT:.855ms RTTsd:4.492ms Loss:21%)
Nov 2 21:37:09     check_reload_status     384     updating dyndns WAN1_DHCP
Nov 2 21:37:09     check_reload_status     384     Restarting IPsec tunnels
Nov 2 21:37:09     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 21:37:09     check_reload_status     384     Reloading filter
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: MONITOR: WAN1_DHCP has packet loss, omitting from routing group WANGWGROUP
Nov 2 21:37:10     php-fpm     45785     192.0.2.1|192.0.2.2|WAN1_DHCP|0.875ms|4.566ms|23%|down|highloss
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: Gateway, switch to: WAN2_DHCP
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: Default gateway setting Interface WAN2_DHCP Gateway as default.
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 2 21:37:10     php-fpm     45785     /rc.openvpn: route_add_or_change: Invalid gateway and/or network interface ipsec1
Nov 2 21:37:25     check_reload_status     384     Reloading filter
Nov 2 21:39:15     rc.gateway_alarm     94172     >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:0 RTT:.408ms RTTsd:.142ms Loss:5%)
Nov 2 21:39:15     check_reload_status     384     updating dyndns WAN1_DHCP
Nov 2 21:39:15     check_reload_status     384     Restarting IPsec tunnels
Nov 2 21:39:15     check_reload_status     384     Restarting OpenVPN tunnels/interfaces
Nov 2 21:39:15     check_reload_status     384     Reloading filter
Nov 2 21:39:31     php-fpm     19377     /rc.ipsec: IPSEC: One or more IPsec tunnel gateways have changed. Refreshing.
Nov 2 21:39:31     check_reload_status     384     Reloading filter
Nov 2 21:39:32     php-fpm     19377     /rc.ipsec: Gateway, none 'available' for inet6, use the first one configured. ''

Another try using DHCP:

 Nov 2 21:58:51     rc.gateway_alarm Nov 2 21:58:51     check_reload_status Nov 2 21:58:51     check_reload_status Nov 2 21:58:51     check_reload_status Nov 2 21:58:51     check_reload_status Nov 2 21:58:53     php-fpm Nov 2 21:58:54     php-fpm Nov 2 21:59:07     check_reload_status Nov 2 22:00:13     rc.gateway_alarm Nov 2 22:00:13     check_reload_status Nov 2 22:00:13     check_reload_status Nov 2 22:00:13     check_reload_status Nov 2 22:00:13     check_reload_status Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:15     php-fpm Nov 2 22:00:16     php-fpm

2969 >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:1 RTT:.447ms RTTsd:.242ms Loss:22%) 384 updating dyndns WAN1_DHCP 384 Restarting IPsec tunnels 384 Restarting OpenVPN tunnels/interfaces 384 Reloading filter 45785 /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.242 45785 /rc.dyndns.update: phpDynDNS: Not updating sitea.dyndns.lab.arpa A record because the IP address has not changed. 384 Reloading filter 16897 >>> Gateway alarm: WAN1_DHCP (Addr:192.0.2.1 Alarm:0 RTT:.699ms RTTsd:3.371ms Loss:6%) 384 updating dyndns WAN1_DHCP 384 Restarting IPsec tunnels 384 Restarting OpenVPN tunnels/interfaces 384 Reloading filter 19377 /rc.openvpn: MONITOR: WAN1_DHCP is available now, adding to routing group WANGWGROUP 19377 192.0.2.1|192.0.2.2|WAN1_DHCP|0.688ms|3.327ms|4%|online|none 19377 /rc.openvpn: Gateway, switch to: WAN1_DHCP 19377 /rc.openvpn: Default gateway setting Interface WAN1_DHCP Gateway as default. 45785 /rc.dyndns.update: Gateway, switch to: WAN1_DHCP 19377 /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. '' 19377 /rc.openvpn: route_add_or_change: Invalid gateway and/or network interface ipsec1 45785 /rc.dyndns.update: phpDynDNS: updating cache file /conf/dyndns_WANGWGROUP_rfc2136_'sitea.dyndns.lab.arpa'_ns1.lab.arpa.cache: 192.0.2.2 45785 /rc.dyndns.update: phpDynDNS: Not updating sitea.dyndns.lab.arpa A record because the IP address has not changed.

Actions

#11

Updated by Viktor Gurov over 4 years ago

Status changed from Feedback to New

same issue on 22.01.a.20211029.0500 - once failover from WAN to LTE happens it will never fail back until I manually click 'apply' on the System / Routing / Gateways page.

Actions

#12

Updated by dave wilson about 4 years ago

Does anyone have a good automated workaround? I have Starlink (DHCP) as primary WAN and LTE modem w/ethernet as backup. Should I try assigning static IPs for primary? The manual 'click apply' isn't ideal if I'm not available to execute it.

Actions

#13

Updated by Scott Silver about 4 years ago

I think I may have tracked down one of the problems here. It seems that pfSense is forgetting to reset the gateway monitor when the WAN interface comes back up in certain cases. In my case, the WAN IP comes back up as the same IP address as it was previous. So newwanip, the script that runs when a WAN gets a new IP, seems to not reset the gateway monitor (because it checks for this case, possibly as an optimization, possibly for other reasons I don't understand)

Here are the details:

One of my interfaces goes away, so pfSense loses one of its WANs.
When it comes back pfSense requests a new IP via DHCP.
Subsequently there is the script rc.newwanip that is supposed to run when a WAN interfaces gets a new IP.
rc.newwanip guards this code with "isSameAsLastWANAddress()" and since my ISP issues the same address, pfSense does not run this code.
This code, in particular, would reset the gateway monitor. Since pfSense does not reset it, the old instance of the gateway monitor (dpinger) will continue to run. However, it can never send out any new ICMP/ping messages because the socket refers to a dead interface and not the new one so no pings come back.
Thus, dpinger never thinks the interface comes back.
So why does running dpinger from the command line work, even when the gateway monitor instance doesn't? When we run dpinger from the comman dpinger gets a working socket for the new interface.
The "quick but probably wrong" fix is to make this code on line 204 always run. See that I OR'd in 1 into the conditional below.

if (/*added so we do this all the time*/ 1 || !is_ipaddr($oldip) || ($curwanip != $oldip) ||
    (!is_ipaddrv4($config['interfaces'][$interface]['ipaddr']) && ($config['interfaces'][$interface]['ipaddr'] != 'dhcp'))) {
    /*
     * Some services (e.g. dyndns, see ticket #4066) depend on
     * filter_configure() to be called before, otherwise pass out
     * route-to rules have the old ip set in 'from' and connections
     * do not go through the correct link
     */
    filter_configure_sync();

    /* reconfigure our gateway monitor, dpinger results need to be 
     * available when configuring the default gateway */
    setup_gateways_monitor();

Actions

#14

Updated by Scott Silver about 4 years ago

Note that https://redmine.pfsense.org/issues/11142 was the bug that someone fixed that tries to solve some other problem.

I suspect the correct fix will not touch the VPN and will only reset gateway_monitor.

Actions

#15

Updated by Viktor Gurov about 4 years ago

Related to Bug #11142: rc.newwanip restarts VPN services when the IP matches added

Actions

#16

Updated by Viktor Gurov about 4 years ago

Tracker changed from Bug to Regression

Actions

+ fix:
https://gitlab.netgate.com/pfSense/pfSense/-/merge_requests/516

#17

Updated by Viktor Gurov about 4 years ago

Partially fixed in https://github.com/pfsense/pfsense/commit/da836151dbd6dff0f8759ef165b24e0e173b078e

Actions

#18

Updated by Jim Pingle about 4 years ago

Assignee set to Viktor Gurov
Priority changed from High to Normal
Target version set to CE-Next
Plus Target Version set to 22.05

Actions

#19

Updated by Jim Pingle about 4 years ago

Status changed from New to Pull Request Review

Actions

#20

Updated by Viktor Gurov about 4 years ago

Related to Regression #12215: OpenVPN does not resync when running on a gateway group added

Actions

#21

Updated by Viktor Gurov about 4 years ago

Related to Bug #12771: Automatic filter reload with OpenVPN client gateway uplink happens too soon or not at all added

Actions

#22

Updated by Viktor Gurov about 4 years ago

Status changed from Pull Request Review to Feedback
% Done changed from 0 to 100

Applied in changeset ec73bb89489d830ec21c4e04ffa3ec401791b55d.

Actions

#23

Updated by Viktor Gurov about 4 years ago

Related to Bug #12613: DNS Resolver does not restart during link up/down events on a static IP address interface added

Actions

#24

Updated by → luckman212 about 4 years ago

Did this make it into 2.6 / 22.01 or do we need to use System Patches to get it? - edit nevermind, I see it's targeted at 22.05

Actions

#25

Updated by Viktor Gurov about 4 years ago

Related to Bug #12811: Services are not restarted when PPP interfaces connect added

Actions

#26

Updated by Jim Pingle almost 4 years ago

Target version changed from CE-Next to 2.7.0

Actions

#27

Updated by Wayne Sherman almost 4 years ago

Setup:
2.6.0-RELEASE (amd64), dual WAN with both WANs on DHCP, and failover via Gateway groups. (default gateway = PreferWAN1)

Test:
Unplugging one of the WAN network cables, wait for a few minutes, and then plug back in

Problems:
1) dpinger does not monitor a WAN port after the port comes back up
2) If I manually restart dpinger, both gateways show as online, but the default gateway does not switch back to WAN1.

Fixed by patch:
After applying the patch, both problems above are fixed.
( https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/ec73bb89489d830ec21c4e04ffa3ec401791b55d )

New problem after patching:
After applying the patch referenced above, a new problem shows up in the logs with an error trying to restart unbound:
pfSense php-fpm[373]: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'

Unbound error in context:
Mar 30 11:00:53 pfSense php-fpm[372]: /rc.linkup: DEVD Ethernet attached event for opt1 Mar 30 11:00:53 pfSense php-fpm[372]: /rc.linkup: HOTPLUG: Configuring interface opt1 Mar 30 11:01:00 pfSense check_reload_status[411]: rc.newwanip starting igb1 Mar 30 11:01:00 pfSense check_reload_status[411]: Restarting IPsec tunnels Mar 30 11:01:01 pfSense php-fpm[373]: /rc.newwanip: rc.newwanip: Info: starting on igb1. Mar 30 11:01:01 pfSense php-fpm[373]: /rc.newwanip: rc.newwanip: on (IP address: 192.168.12.150) (interface: WAN2[opt1]) (real interface: igb1). Mar 30 11:01:03 pfSense php-fpm[373]: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports' Mar 30 11:01:03 pfSense check_reload_status[411]: updating dyndns opt1 Mar 30 11:01:04 pfSense php-fpm[373]: /rc.newwanip: Resyncing OpenVPN instances for interface WAN2. Mar 30 11:01:04 pfSense php-fpm[373]: /rc.newwanip: Creating rrd update script Mar 30 11:01:07 pfSense php-fpm[373]: /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 192.168.12.150 -> 192.168.12.150 - Restarting packages.

Actions

#28

Updated by Jim Pingle almost 4 years ago

Subject changed from Gateway group doesn't failback from tier 2 to tier 1, worked properly in 2.4 to Gateway monitoring services is not always restarted on interface events, which may prevent a WAN from recovering back to an online state

Actions

#29

Updated by Viktor Gurov almost 4 years ago

Wayne Sherman wrote in #note-27:

Setup:
2.6.0-RELEASE (amd64), dual WAN with both WANs on DHCP, and failover via Gateway groups. (default gateway = PreferWAN1)

Test:
Unplugging one of the WAN network cables, wait for a few minutes, and then plug back in

Problems:
1) dpinger does not monitor a WAN port after the port comes back up
2) If I manually restart dpinger, both gateways show as online, but the default gateway does not switch back to WAN1.

Fixed by patch:
After applying the patch, both problems above are fixed.
( https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/ec73bb89489d830ec21c4e04ffa3ec401791b55d )

New problem after patching:
After applying the patch referenced above, a new problem shows up in the logs with an error trying to restart unbound:
@pfSense php-fpm³⁷³: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'

Unable to reproduce on pfSense-22.05.a.20220407.0600 - everything works fine, without unbound errors.
Please test on the latest snapshots, and if it happens again, provide unbound configuration.

Actions

#30

Updated by Jürgen Echter almost 4 years ago

Viktor Gurov wrote in #note-29:

Wayne Sherman wrote in #note-27:

Setup:
2.6.0-RELEASE (amd64), dual WAN with both WANs on DHCP, and failover via Gateway groups. (default gateway = PreferWAN1)

Test:
Unplugging one of the WAN network cables, wait for a few minutes, and then plug back in

Problems:
1) dpinger does not monitor a WAN port after the port comes back up
2) If I manually restart dpinger, both gateways show as online, but the default gateway does not switch back to WAN1.

Fixed by patch:
After applying the patch, both problems above are fixed.
( https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/ec73bb89489d830ec21c4e04ffa3ec401791b55d )

New problem after patching:
After applying the patch referenced above, a new problem shows up in the logs with an error trying to restart unbound:
@pfSense php-fpm³⁷³: /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1648663263] unbound[14890:0] error: bind: address already in use [1648663263] unbound[14890:0] fatal error: could not open ports'

Unable to reproduce on pfSense-22.05.a.20220407.0600 - everything works fine, without unbound errors.
Please test on the latest snapshots, and if it happens again, provide unbound configuration.

i also added the patch, but i still have the same problem. If i disable monitoring in the routing tab, and re-enable it, it is working again, else it stays on pending on the dashboard and doesn't switch back to online.

If you need any information just tell me. pfsense 2.6.0

Actions

#31

Updated by Marcos M almost 4 years ago

What interface(s) does unbound have assigned? Is this a VM?

Actions

#32

Updated by Sage Badolato over 3 years ago

I can also confirm that I can replicate this exact issue on my PFSense. Both as a VM and as bare metal.

Using a HP DL360p Gen6, as a Windows based HyperV previously, and currently running on the same machine in bare metal. Machine has 2 on-board NICs, used for WAN and LAN, and a PCI-e Intel Pro Gigabit card for the Failover WAN. All hardware is healthy and functioning. Primary ISP is a local cable provider (Gigabit/50mbps) and I have my own SB8200 for this. Failover ISP is a Verizon powered CradlePoint MBR1400v2 with MC200LE-VZ (Verizon LTE USB Modem add-on) on a very limited data plan.

I can reproduce this issue by simply unplugging the Ethernet or power cycling on either modem. End result is that the Gateway just shows Pending under the Gateway Status for the gateway in question. Worth noting, I take down my Primary WAN via power, wait for PFSense to failover to secondary WAN, reconnect my primary WAN (knowing that it's a working connecting, while still reporting incorrectly by PFSense), if I then take down the failover, the Primary WAN will return to service with no issue and report properly. I can replicate this vise versa. But then the Failover gateway will sit at pending status.

One other item worth noting that I haven't seen anyone else mention, and it may be why it's hard to replicate. I've only had this issue on PFSense, where a gateway group has been created for more than 24 hours. If I spin-up a fresh PFSense (bare metal or VM), it will failover and fallback properly, every time. However, after about 24 hours passes, the fallback stops working, and we see the Pending status issue. It doesn't matter the age of the PFSense install. My current bare metal setup has been running for roughly 2 months with no failover setup what so ever. I just configured this again last week as I just got the new cradlepoint (previous jetpack was trash).

I hope this makes sense.

Actions

#33

Updated by Marcos M over 3 years ago

I suggest testing on 22.05 BETA if possible. If the issue persists there, it may be related to https://redmine.pfsense.org/issues/12920.

Actions

#34

Updated by Sage Badolato over 3 years ago

I cannot test 22.05, I'm on community edition.

Actions

#35

Updated by Jim Pingle over 3 years ago

Plus Target Version changed from 22.05 to 22.09

Sage Badolato wrote in #note-34:

I cannot test 22.05, I'm on community edition.

You can try a recent 2.7.0 snapshot as well.

I'm re-targeting this at 22.09. There were no changes here and if it is related to the other linked issue then it'll be solved then.

Actions

#36

Updated by → luckman212 over 3 years ago

File clipboard-202206071059-njc9h.png clipboard-202206071059-njc9h.png added

I experienced this this morning, on 22.05.b.20220531.0600

- dpinger showed my DHCP6 gateway as "down"
- I ran pgrep -lf dpinger and confirmed dpinger was running on the right interface
- but, it was bound to a local IPv6 (bogon) and thus could not send outbound pings
- ping6 2001:4860:4860::8888 worked normally, both from the WAN modem itself and from pfSense console
- stopping and restarting the dpinger service did NOT restore the WAN6 to online state
- I had to edit the WAN6 interface (no changes) and hit Save -- then it was green again
- I ran ifconfig ix2 before and after, and noticed that the IP addresses below had swapped positions. Not sure if this is related or just a side effect.

edit: not sure what I was thinking when I labeled that screenshot but the left side (before) should be "not working" and the right side (after) should be "working"

Not sure if there is a better/separate issue to report this on? does it need a new Issue since in my case it's specific to DHCP6 + dpinger?

Actions

#37

Updated by Marcos M over 3 years ago

File 11570test.diff 11570test.diff added

Tested on 22.05 RC.

I was not able to replicate this initially with WAN1 as DHCP and WAN2 as static. After testing a combination of DHCP/static on both, I was able to replicate the issue by doing the following:

Release WAN DHCP
- gateway status is pending (or missing if no gateway entry exists in config.xml - see #12920)
Renew WAN DHCP
- gateway status is pending

I then ran a diff between the previously working config and the broken config, and the difference was that a gateway entry existed in config.xml when it was working:

        <gateway_item>
            <interface>wan</interface>
            <gateway>dynamic</gateway>
            <name>WAN1_DHCP</name>
            <weight>1</weight>
            <ipprotocol>inet</ipprotocol>
            <descr><![CDATA[Interface WAN1_DHCP Gateway]]></descr>
        </gateway_item>

I was able to break/fix the issue multiple times by removing/adding that entry from config.xml. After many runs of testing however, I could no longer reproduce the issue even with the gateway entry missing. I don't know what the root cause is, but at the very least, it does seem like the missing gateway entry plays a part.

Attached is a test patch I'm using to work around this issue, though it seems to me both rc.newwanip and rc.newwanip6 need refactoring.

Actions

#38

Updated by → luckman212 over 3 years ago

I submitted a PR: https://github.com/pfsense/pfsense/pull/4595 that may help some of the cases being hit here.

Actions

#39

Updated by Jim Pingle over 3 years ago

Status changed from Feedback to Pull Request Review

Actions

#40

Updated by → luckman212 over 3 years ago

I've been running with the PR above for 2 days now, it's survived multiple reboots, and unplug/replug of the secondary WAN connection that provides my DHCPv6. So far so good. Just datapoint 1 of 1 but hopefully others can test and report.

Actions

#41

Updated by → luckman212 over 3 years ago

Pushed more updates to my PR #4595 (see over there for details).

I had a down V6 gateway this morning and upon investigation, noticed the IP that was being returned by the get_usable_interface_ipv6() function had the "detached" flag in ifconfig. Researching this, it seems it might be related to a FreeBSD bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263986

I created a few helper functions to clean those up and hooked them into rc.newwanipv6.

Tests indicate so far that this is working.

Actions

#42

Updated by Alefe Ortiz over 3 years ago

hello guys

Configurator (Scope):
Interfaces: WAN-DHCP4|WAN2-DHCP4
Gateway Group: Failover (WAN_DHCP Gateway: 192.168.10.1 Tier1 Monitor IP: 208.67.220.220 | WAN2_DHCP2 Tier2 Gateway:192.168.5.1 Monitor IP: 1.1.1.1)

Diagnostic 1:
1-WAN- Cable Connected Status/interfaces (up | dhcp up)
2-WAN2- Cable Disconnected Status/interfaces (no carrier | dhcp down)
3-System/Routing Gateways (WAN_DHCP Tier1 Gateway: 192.168.10.1 Monitor IP: 208.67.220.220 | WAN2_DHCP Tier2 Gateway:"dynamic" Monitor IP: 1.1.1.1)
3-Status/ Gateways (WAN_DHCP Gateway:192.168.10.1 Monitor:208.67.220.220 Status:Online |WAN2_DHCP Gateway:dynamic Monitor:"Empty" Status:Pending )
4-Action: Status/ Interfaces (WAN) Release Dhcp or Disconnect interface cable
5-Status/ Gateways (WAN_DHCP Gateway:dynamic Monitor:"Empty" Status:Pending | equal WAN2 | "note at this point the dpinger service will be stopped!")
6-Action: Status/ Interfaces (WAN) DHCP Renew WAN or connect interface cable | or connect cable (WAN2)
7-Status/ Gateways (Pending "Note dpinger service is still stopped !")
--------------------------------------------------------------------------
Diagnostic 2:
1-WAN- Cable Connected Status/interfaces (up | dhcp up)
2-WAN2- Cable Connected Status/interfaces (up | dhcp up)
3-System/Routing Gateways (WAN_DHCP Tier1 Gateway: 192.168.10.1 Monitor IP: 208.67.220.220 | WAN2_DHCP Tier2 Gateway:192.168.5.1 Monitor IP: 1.1.1.1)
3-Status/ Gateways (WAN_DHCP Gateway:192.168.10.1 Monitor:208.67.220.220 Status:Online |WAN2_DHCP Gateway:192.168.5.1 Monitor:1.1.1.1 Status:Online )
4-Action: Status/ Interfaces (WAN) Release Dhcp or Disconnect interface cable
5-Status/ Gateways (WAN_DHCP Gateway:dynamic Monitor:"Empty" Status:Pending |WAN2_DHCP Gateway:192.168.5.1 Monitor:1.1.1.1 Status:Online| "note that the dpinger service is started healthy")
6-Action: Status/ Interfaces (WAN) DHCP Renew WAN or connect interface cable
7-Status/ Gateways (WAN_DHCP Gateway:dynamic Monitor:"Empty" Status:(Pending) does not switch to online |WAN2_DHCP Gateway:192.168.5.1 Monitor:1.1.1.1 Status:Online| "note that the dpinger service is started healthy")

Notes:
the problem does not occur in with static ip interfaces
*the problem also occurs with ppps interfaces (Action:Disconnect or Recconect PPPoE Note: everything seems to develop the moment the interface becomes down no carrier and loses ip addressing)

Questions:

1-What can I do to temporarily resolve the issue?
2-This problem is a bug in the version, it will be fixed in version 2.7.0

Firmware Version: (2.6.0-RELEASE (amd64))

Actions

#43

Updated by Jim Pingle over 3 years ago

Plus Target Version changed from 22.09 to 22.11

Actions

#44

Updated by Jim Pingle over 3 years ago

Plus Target Version changed from 22.11 to 23.01

Actions

#45

Updated by Jim Pingle over 3 years ago

Assignee deleted (~~Viktor Gurov~~)
Start date deleted (~~02/27/2021~~)
Plus Target Version changed from 23.01 to 23.05

Actions

#46

Updated by robi robi about 3 years ago

Ran into this on my 2.6.0-RELEASE (amd64) which has two WANs, one PPPoE and one DHCP. The DHCP one experienced occasionally that the gateway had to be refreshed manually.

Applying the patch from note https://redmine.pfsense.org/issues/11570#note-27 fixed the issue.

Actions

#47

Updated by Jim Pingle almost 3 years ago

Category changed from Gateways to Gateway Monitoring

Actions

#48

Updated by Jim Pingle almost 3 years ago

Plus Target Version changed from 23.05 to 23.09

Actions

#49

Updated by LTC Tech over 2 years ago

We have an office that uses Starlink (CGNAT DHCP IP) and a slow FWA (Public Static IP) connection as backup. If the office loses power then Starlink takes a while to connect. When Starlink finally does connect dpinger is either active with a stale binding address or missing from processes altogether. Saving the gateway brings up dpinger with correct source address and everything starts working through Starlink.

Rejecting leases from Starlink 192.168.100.1 DHCP doesn't seem to help. The Starlink router is in bypass mode but it appears to announce 192.168.100.0/24 via DHCP when it has no internet connection. In normal operation, both the host and gateway IP should be within the CGNAT range 100.64.0.0/10.

Might be worthwhile to write a watchdog for dpinger...

Actions

#50

Updated by Jim Pingle over 2 years ago

Target version changed from 2.7.0 to CE-Next

Actions

#51

Updated by Darius ITGuys.net over 2 years ago

I might have something to add. While inspecting my downloaded config.xml (CE 2.6.0) I noticed this:
<gateways>
<defaultgw4>Spectrum_Static</defaultgw4>
<defaultgw6>-</defaultgw6>
</gateways>
It's referencing a WAN/gateway I don't have anymore but the GUI was set in System>Routing>Gateways with "Default gateway IPv4" to "Automatic".
This caused pfSense to not have a default route listed at all in the Diagnostics>Routes>IPv4 Routes table.
Leaving it Automatic and saving, and also re-saving the gateway (as might have fixed this for me in the past) didn't solve it or change that incorrect value in the backup config.xml.

Manually changing the Default gateway IPv4 dropdown box to my actual gateway, "WAN_DHCP" solves the issue for me and fixes the config.xml.
Afterwards, switching it back to Automatic continues to work. (I haven't yet tested whether "Automatic" works after reboots or WAN down scenarios.)

Actions

#52

Updated by Jim Pingle over 2 years ago

Plus Target Version changed from 23.09 to 24.01

PR has conflicts and needs work/testing still

Actions

#53

Updated by Marcos M over 2 years ago

Related to Regression #14616: dpinger does not start after renewing DHCP added

Actions

#54

Updated by Marcos M over 2 years ago

Status changed from Pull Request Review to Feedback

I believe the original issue description is related to the following two issues:

#14616 (a patch is available)
#12920 (a workaround exists)

The issue described in #note-36 should be resolved with #14725. A separate but related issue is #12947 which could use further testing.

As for PR 4595, I think it'd be best to revisit it after further testing/feedback on the above redmine issues.

Actions

#55

Updated by Jim Pingle over 2 years ago

Plus Target Version changed from 24.01 to 24.03

Actions

#56

Updated by Azamat Khakimyanov over 2 years ago

Status changed from Feedback to Resolved

Tested on 23.05_1 and on 23.09-BETA (built on Fri Oct 20 9:00:00 MSK 2023)

I was able to reproduce this issue on 23.05_1 by releasing and renewing DHCP WAN IP but only with IPv4 addresses.
But when I added IPv6 addresses, I didn't see this issue.

I wasn't able to reproduce it on 23.09-BETA.

I marked this Regression as resolved.

Actions

#57

Updated by Marcos M over 2 years ago

Related to Bug #12920: Gateway behavior differs when the gateway does not exist in the configuration added

Actions

#58

Updated by Marcos M over 2 years ago

Related to Bug #14725: Primary IPv6 interface address may be incorrect when a ULA is set added

Actions

#59

Updated by Marcos M over 2 years ago

Related to Bug #12947: Old IPv6 addresses may continue to be used after DHCP or RA changes added

Actions