Project

General

Profile

Bug #5180

DDNS sometimes not updating after primary WAN failure

Added by ben selinger almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Category:
Dynamic DNS
Target version:
Start date:
09/21/2015
Due date:
% Done:

0%

Estimated time:
Affected Version:
All
Affected Architecture:

Description

Since update 2.2.2, DDNS (noip) is not updating the secondary WAN address after the primary WAN fails.
WAN1 and WAN2 are grouped in Balance0. DDNS is configured to monitor Balance0. This configuration was working before the most recent update.

Currently, when WAN1 fails, the gateway failover occurs correctly and all routing continues as normal on WAN2. DDNS detects this failure and updates NoIP.com, but updates with the IP of WAN1.
Hitting force update does not result in the WAN2 IP being sent. If I manually log into NoIP and set the WAN2 address through their interface, everything works correctly.
When WAN1 comes back up and becomes primary again, I see that DDNS does correctly update with the WAN1 IP.

Here's what I'm seeing in the log: (replaced WAN1 IP with 9.9.9.9 for security)

Sep 21 14:09:38 php-fpm[21361]: /services_dyndns_edit.php: Gateways status could not be determined, considering all as up/active. (Group: balance0)
Sep 21 14:00:56 php-fpm[82788]: /rc.dyndns.update: phpDynDNS (cranesmartvoip.ddns.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
Sep 21 14:00:56 php-fpm: /services_dyndns_edit.php: phpDynDNS (cranesmartvoip.ddns.net): (Success) DNS hostname update successful.
Sep 21 14:00:56 php-fpm: /services_dyndns_edit.php: phpDynDNS (cranesmartvoip.ddns.net): (Success) DNS hostname update successful.
Sep 21 14:00:56 php-fpm: /services_dyndns_edit.php: phpDynDNS: updating cache file /conf/dyndns_balance0noip-free'cranesmartvoip.ddns.net'0.cache: 9.9.9.9
Sep 21 14:00:56 php-fpm: /services_dyndns_edit.php: phpDynDNS: updating cache file /conf/dyndns_balance0noip-free'cranesmartvoip.ddns.net'0.cache: 9.9.9.9
Sep 21 14:00:46 php-fpm: /services_dyndns_edit.php: phpDynDNS (cranesmartvoip.ddns.net): (Success) DNS hostname update successful.
Sep 21 14:00:46 php-fpm: /services_dyndns_edit.php: phpDynDNS (cranesmartvoip.ddns.net): (Success) DNS hostname update successful.
Sep 21 14:00:46 php-fpm: /services_dyndns_edit.php: phpDynDNS: updating cache file /conf/dyndns_balance0noip-free'cranesmartvoip.ddns.net'0.cache: 9.9.9.9
Sep 21 14:00:46 php-fpm: /services_dyndns_edit.php: phpDynDNS: updating cache file /conf/dyndns_balance0noip-free'cranesmartvoip.ddns.net'0.cache: 9.9.9.9
Sep 21 14:00:46 php-fpm: /services_dyndns_edit.php: Message sent to itsupport@cranesmart.com OK
Sep 21 14:00:46 php-fpm: /services_dyndns_edit.php: Message sent to itsupport@cranesmart.com OK
Sep 21 14:00:45 php-fpm: /services_dyndns_edit.php: DynDNS (cranesmartvoip.ddns.net): Processing dummy update on No-IP free account. IP temporarily set to 192.168.1.1
Sep 21 14:00:45 php-fpm: /services_dyndns_edit.php: DynDNS (cranesmartvoip.ddns.net): Processing dummy update on No-IP free account. IP temporarily set to 192.168.1.1
Sep 21 14:00:45 check_reload_status: Syncing firewall
Sep 21 14:00:44 check_reload_status: Reloading filter
Sep 21 14:00:44 check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 21 14:00:44 check_reload_status: Restarting ipsec tunnels
Sep 21 14:00:44 check_reload_status: updating dyndns GWtg
Sep 21 14:00:16 php-fpm[49456]: /services_dyndns.php: MONITOR: GWtg is down, omitting from routing group balance0

As mentioned, there have been no config changes since before the update to 2.2.2, and ddns WAS working correctly prior to update.

History

#1 Updated by c0n flux almost 4 years ago

Seeing what I think is a related (though different) issue in 2.2.4

DDNS set to monitor gateway group {WAN1_WAN2_FAILOVER} (use WAN1, failover to WAN2)

On forcing WAN1 Gateway down via System->Routing, hosts failover to WAN2 connection, however DDNS shows WAN1 IP in red (stale entry)

Leaving in this state for ~30 minutes, DDNS never updates to reflect WAN2 address

Performing "Save & Force Update", DDNS updates to reflect WAN2 address as expected

On restoring WAN1 Gateway via System->Routing, new connections are established over WAN1, however DDNS still reflects WAN2 IP

Leaving in this state for ~60 minutes, DDNS never updates until once again forced

This seems to be an issue with Gateway state changes not properly triggering DDNS updates

#2 Updated by Chris Buechler almost 4 years ago

  • Assignee set to Chris Buechler
  • Affected Architecture deleted (amd64)

looking into this while I'm working on #4990

#3 Updated by Chris Buechler almost 4 years ago

  • Status changed from New to Feedback

c0n flux: your issue is #5214, I just fixed that.

Ben: not seeing any issues along those lines. Granted I'm testing on 2.2.5, which is the same as 2.2.4 in that regard, but don't think anything has changed in that area since 2.2.2. The "Gateways status could not be determined, considering all as up/active" is suspect, what does your Status>Gateways look like when that update occurs?

#4 Updated by ben selinger almost 4 years ago

WAN0 shows as down and WAN1 up.
I'm beginning to suspect that something in the kernel went wonky because I believe queues also started going sideways as well. Will try to reproduce the issue this weekend and post results.
FYI, when I did test this the weekend after the outage, it reliably did not update DDNS when WAN0 went down, but reliably did update when WAN0 came back up. DDNS just never updated to WAN1 IP.

#5 Updated by Chris Buechler almost 4 years ago

  • Subject changed from DDNS not updating after primary WAN failure to DDNS sometimes not updating after primary WAN failure
  • Affected Version changed from 2.2.2 to All

there's a race condition of some sort here. I've only had success replicating once, and only after binding an IPsec connection to the gateway group as well (though that could be a coincidence, it happened 0 times in ~20 tries before doing that, and on the first try after though not any of the subsequent ~10 tries thus far).

The dynamic DNS update still sees the downed WAN as the IP it's supposed to be updating and skips the update as it sees no change in the cases we've been able to replicate. That seems to match the logs here too.

#6 Updated by Chris Buechler almost 4 years ago

  • Status changed from Feedback to Confirmed

this always coincides with "Gateways status could not be determined, considering all as up/active", which makes it pull the highest-tier interface's IP, and skip the dynamic DNS update because the IP hasn't changed. That's why it's never an issue when failing back to the highest tier.

#7 Updated by Chris Buechler almost 4 years ago

  • Target version set to 2.3

it doesn't necessarily always coincide with "Gateways status could not be determined, considering all as up/active", but that's the most common circumstance.

The root cause is a couple different apinger issues. Where you end up with "status could not be determined", apinger.status isn't populated. In another case, apinger.status doesn't update immediately upon receiving a -HUP with the new config.

Using this bit in function return_gateway_groups_array in gwlb.inc to save a copy of apinger.status and the gateways_status array to help troubleshoot.

        $timestamp = exec("date +'%Y%m%d%H%M%s'");
        $filename = "/root/" . $timestamp . "-gateways_status-array.txt";
        file_put_contents($filename, print_r($gateways_status, true));
        exec("cp /var/run/apinger.status /root/$timestamp-apinger.status");

#8 Updated by ben selinger almost 4 years ago

Okay, I'm not sure why this changed, as I've made no changes to config since opening the ticket, but I noticed this while testing tonight.
Looks to me like the script used to determine the active interfaces public IP get's confused if we're NAT'd. GWCP has an IP of 192.168.16.2, which goes through a router at 192.168.16.1, which provides WAN access via a 4G connection. I'm positive this wasn't an issue a few releases back, as our primary connection goes down all the time and all our VOIP phones would fail if the DDNS update didn't work.

Here's what I'm seeing in the log now:

Oct 9 21:25:48    php-fpm[8906]: /services_dyndns.php: Default gateway down setting GWCP as default!
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: DynDNS (cranesmartvoip.ddns.net) There was an error trying to determine the public IP for interface - balance0(em2_vlan11). Probably interface is not a WAN interface.
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: MONITOR: GWtg is down, omitting from routing group balance0
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: Default gateway down setting GWCP as default!
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: DynDNS (cranesmartvoip.ddns.net): running get_failover_interface for balance0. found em2_vlan11
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: MONITOR: GWtg is down, omitting from routing group balance0
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: Default gateway down setting GWCP as default!
Oct 9 21:25:47    php-fpm[8906]: /services_dyndns_edit.php: Dyndns debug information (cranesmartvoip.ddns.net): Could not resolve checkip.dyndns.org to IP using interface IP 192.168.16.2.
Oct 9 21:25:34    php-fpm[8906]: /services_dyndns_edit.php: MONITOR: GWtg is down, omitting from routing group balance0
Oct 9 21:25:34    php-fpm[8906]: /services_dyndns_edit.php: Default gateway down setting GWCP as default!
Oct 9 21:25:34    php-fpm[8906]: /services_dyndns_edit.php: DynDns: updatedns() starting

#9 Updated by Chris Buechler almost 4 years ago

In that last log snippet, you didn't have functional DNS resolution. That "Could not resolve checkip.dyndns.org" log means gethostbyname got no valid response after 3 attempts.

#10 Updated by ben selinger almost 4 years ago

Yup, I saw that and dug through the code and believe I've figured out what's going on (although I seem to have two separate issues... the lookup issue is the one I've figured out).

I had 4 DNS servers configured for our system. The first two are from our primary ISP, and they only allow lookups from their customers. The other two are 8.8.8.8 and 8.8.4.4. 8.8.8.8 was also my monitor IP for WAN0 while 8.8.4.4 was the monitor IP for WAN1. Because 8.8.4.4 is the monitor IP for WAN1, and pfSense uses DNS servers in a round robin fashion, 8.8.4.4 was being used for lookups. Because it's also the monitor IP for WAN1, those lookups would go over WAN1, even when WAN0 was still up (firewall rules force all traffic destined for 8.8.4.4 out via WAN1) (gateway monitoring).
Because WAN1 is a 4G, and is exorbitantly expensive, I decided to disable DNS lookups for 8.8.4.4, which brought WAN1 traffic down to 0 when it was not in use.

Incidentally however, I'd forgotten that my other three remaining DNS servers were either bound to WAN0's ISP, or 8.8.8.8 (monitor IP). Thus, when WAN0 went down, we had no DNS servers available. I guess I didn't notice this because we also have an internal DNS, so no one ever had any issue with name resolution, even though pfSense did.

I'm still using 8.8.4.4 for WAN1 monitor, but added a fourth DNS server, and failover and back now seems to work. It still doesn't answer the original problem, but that problem seems to have magically resolved itself.
I'll continue monitoring and come back here if the original issue pops up again, but for now, all seems smooth.

#11 Updated by Chris Buechler almost 4 years ago

Thanks for the feedback. The original issue doesn't always happen, so as long as the config is otherwise fine, you might only hit that 1 in 10 or 20 or more tries.

#12 Updated by Chris Buechler over 3 years ago

  • Status changed from Confirmed to Resolved

Confirmed fixed in 2.3, update always happens correctly. The race condition in apinger that caused this doesn't exist in dpinger.

Also available in: Atom PDF