Bug #4094: Gateway Status can report Online when gateway is waiting for DHCP - pfSense - pfSense bugtracker

Actions

Copy link

Bug #4094

closed

Gateway Status can report Online when gateway is waiting for DHCP

Added by Phillip Davis over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Chris Buechler

Category:

Gateways

Target version:

2.2

Start date:

12/10/2014

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

Release Notes:

Affected Version:

2.2

Affected Architecture:

Description

Example system: 2 WANs, both DHCP, that uplink to 2 different ISPs (WAN and OPT1) (called WAN_DHCP interface WANGENERAL and OPT1_DHCP interface OPT1SUBISU in the screenshots)
WAN has monitor IP 8.8.4.4
OPT1 has monitor IP 8.8.8.8

The cable for OPT1 goes to a switch that then has a cable up to a rooftop ISP device that does the uplink. Cable to rooftop is unplugged (simulating a fault). pfSense OPT1 has a physical connection to the switch. Boot like this and OPT1 is waiting/trying to get DHCP.

WAN comes up fine, getting DHCP. A route to WAN monitor IP 8.8.4.4 is added through WAN gateway (10.172.1.1 learned from DHCP) - good.
There is no specific route to OPT1 monitor IP 8.8.8.8 because there is no gateway for OPT1 known yet.

Enable default gateway switching is off, OPT1 is the default gateway, but somehow there is a default route through WAN gateway 10.172.1.1 - that's handy but I didn't expect it to happen.

So OPT1 monitor IP 8.8.8.8 can be reached happily out WAN. apinger is happily monitoring it and getting response, so it considers OPT1 to be Online. Thus the misleading screen shots of gateway status that show both gateways online, even though the OPT1 gateway IP still says "dynamic".

My failover rules that prioritize some traffic out WAN and some out OPT1 are doing something weird - for example I am coming from a client on OPT2WIFI. Here are the rules generated for that:
--------
pass in quick on $OPT2WIFI inet from 10.49.212.250/22 to $INF_subnets tracker 1397570125 keep state label "USER_RULE: Allow packets to INF subnets"

returning at dst == "/" label "USER_RULE: Allow packets to Subisu WAN local subnet"
pass in quick on $OPT2WIFI inet from 10.49.212.250/22 to 10.172.1.0/24 tracker 1418222712 keep state label "USER_RULE: Allow packets to WAN local subnet"
rule Subisu Internal always to Subisu WAN disabled because gateway OPT1_DHCP is down label "USER_RULE: Subisu Internal always to Subisu WAN"
pass in quick on $OPT2WIFI $GWWAN1 inet from 10.49.212.250/22 to $INFemail tracker 1397451985 keep state label "USER_RULE: INF email special"
rule Allow all on WiFi disabled because gateway InetGeneral is down label "USER_RULE: Allow all on WiFi"
--------
and the gateway groups the system has decided it will define:
--------
Gateways
GWWAN_DHCP = " route-to ( vr0_vlan70 10.172.1.1 ) "
GWVPNclients = " route-to { ( vr0_vlan70 10.172.1.1 ) } "
GWWAN1 = " route-to { ( vr0_vlan70 10.172.1.1 ) } "
--------

Somehow it has decided that the InetGeneral Gateway Group is down - but that has OPT1 tier 1 (which is down) and WAN tier 2, which is up - so why is InetGeneral considered down?
As a result, the last rule quoted above "rule Allow all on WiFi disabled because gateway InetGeneral is down" has been disabled, and so the leftover traffic that was being directed to InetGeneral is going nowhere - most internet access does not work.

But looking at all the green on the dashboard, a network admin could easily miss the fact that OPT1 is down.

2 problems I see here:

1) The gateway group that contains OPT1 (waiting for DHCP) and WAN (got DHCP already) is being considered down, and rules using it are being disabled.

2) The Gateways Status and dashboard Gateways Widget are showing green Online for OPT1 when it does not even have an IP address yet.

I suspect that I could generate a scenario like this on 2.1.n also - never done this level of testing before.
As soon as I plug in the cable to the rooftop device on OPT1 and it gets an IP address (even if the ISP behind it is down) the system starts correctly monitoring OPT1 monitor IP 8.8.8.8, OPT1 shows as offline, all the gateway group do their thing and internet comes back for all sites, failing everything to WAN. So the problems are all when OPT1 has not got its DHCP address yet.

Files

Download all files

Gateway-Groups.png (27.3 KB) Gateway-Groups.png		Phillip Davis, 12/10/2014 10:51 AM
Gateways.png (21 KB) Gateways.png		Phillip Davis, 12/10/2014 10:51 AM
Gateways-Status.png (15.5 KB) Gateways-Status.png		Phillip Davis, 12/10/2014 10:51 AM
Gateways-Widget.png (4.05 KB) Gateways-Widget.png		Phillip Davis, 12/10/2014 10:51 AM
Routes.png (47.5 KB) Routes.png		Phillip Davis, 12/10/2014 10:51 AM

Actions

Copy link

Updated by Phillip Davis over 10 years ago

If I physically unplug OPT1, then everything fails over correctly to WAN. The issue seems to be only if a DHCP WAN-type interface is sitting plugged in waiting for DHCP and is not getting it.

Actions

Copy link

Updated by Phillip Davis over 10 years ago

Note: Another hardware scenario where this can happen is if you have your upstream WAN devices connected to pfSense on a single ethernet trunk going to a VLAN switch. Each WAN is on a separate tagged VLAN on the trunk. Each upstream WAN device is on a normal untagged port on the VLAN switch that belongs to the appropriate VLAN.
pfSense sees the physical trunk as being always up, because the cable from pfSense to the VLAN is there and the VLAN switch has power.
If one or more of the WANs does not receive its IP address by DHCP then the sequence here will happen - Gateway Status looks like both gateways are Online, rules for gateway groups that have the down gateway as top tier seem to get disabled.

Actions

Copy link

Updated by Jim Thompson over 10 years ago

Assignee set to Chris Buechler

assigned to CMB for now. (Evaluation).

I can think of a bunch of scenarios that are "racy" (DHCP can take a while). Haven't thought through what happens if we define "up" as "PHY is happy, 1 (or more) IP addresses assigned". What to do on address change events, for instance. (?)

Actions

Copy link

Updated by Chris Buechler over 10 years ago

I'm almost to a point of confirming what Phil describes. Broke my system earlier and killed my VPN to where the test box was, just about have that fixed but going to be short on time to get too far into this until tomorrow.

When a WAN has no IP, or has 0.0.0.0 in the case of a DHCP client interface trying to reach out, it should consider the gateway as down. It seems like something's changed there, I'll get to the bottom of it tomorrow.

"Up" means the interface has an IP and gateway, and can ping its monitor IP via that source IP out via that gateway. Link status isn't taken into account (lack of link will make that test fail without actually checking link). For some reason, either that monitor IP can be pinged via a diff path, or it's ignoring the fact that it's not responding. My suspicion is it's somehow finding a way out another WAN, but that's largely a gut feel at this point.

Actions

Copy link

Updated by Phillip Davis over 10 years ago

Because the interface has no IP address/gateway yet, there is no way for pfSense to set a specific route to the monitor IP - the gateway for that route is not yet determined.
The lack of a specific route means the the probes by apinger take the default route. If there is another gateway up that has the default route, then the probes succeed, and thus apinger thinks that things are good.
There could be some kludge to set a specific route to the monitor IP that sends it to some "black-hole" gateway that will never work. Then apinger will get 100% packet loss and probably the rest of the resulting events will make the interface/gateway really look down.
But there could be a much better solution also!

Actions

Copy link

Updated by Phillip Davis over 10 years ago

Got annoyed about my DynDNS status attached to a gateway group showing the IP address in red, and realised it was a down-stream effect of this bug. After drilling down for a while, it seems that all these symptoms are fixed by not letting apinger monitor a gateway for an interface that still has IPv4 address 0.0.0.0
https://github.com/pfsense/pfsense/pull/1414

Actions

Copy link