Bug #3179
closed
Gateway failure not properly detected in certain cases using a monitor IP outside of the WAN's subnet
Added by Jim Pingle about 11 years ago.
Updated about 11 years ago.
Description
Still researching this a bit but it needs an entry so things don't get lost.
Currently, I have two WANs, DSL and CABLE. DSL is the default. The CABLE WAN is having issues and at times experiencing 80-90% loss. The monitor IP on CABLE is 8.8.8.8. Apinger, however, is not reporting loss during these times. It is showing the gateway as online, even though the circuit is experiencing massive loss confirmed by other methods.
Because of the loss and lack of detected failure, manually reconfiguring the gateway groups is necessary to regain usable connectivity from behind the firewall.
This may possibly be due to the removal of static routes for apinger targets, but testing is needed to confirm.
Files
I can provide some input on this issue as well.
On 2 of 8 of my firewalls I have this problem happen consistently. On remaining 6 problem usually shows up after default gateway fails over once.
I have WAN and OPT1 interfaces. Default GW is on WAN. The monitor IPs are not on the same respective subnets. Doing a packet capture on the OPT1 interface does not show any of the ICMP packets. On the WAN interface I see ICMP packets to both monitor IPs of WAN's GW and OPT1's GW. The source IP for the ICMP destined to monitor IP of OPT1's GW is the IP address of the OPT1 interface. But the packet itself is being sent out by the WAN interface.
My workaround right now is to add static routes for the monitor IPs.
Another observation is the unexpected behavior when a DNS server set to be queried through one GW is also being used as a monitor IP for another GW. Setting it as a DNS with a specific gateway enters a static route.
Attaching a capture file that shows the ICMP actually is going out the right interface and is experiencing loss. But at the time apinger reports 0.0% loss on that WAN.
So the static routes do help certain scenarios, but not all.
- Status changed from New to Feedback
It now appears as though apinger sees the gateway as down but does not report nor graph the result as expected.
If you change the 'down' time to a value longer than the number of samples required for calculation (e.g. 30) the graph is correct.
So the problem appears mostly if the down time is at the default value of 10 (or less) since it uses 10 samples for calculation.
- Status changed from Feedback to Resolved
this particular issue is fixed, the issue with 10 vs. 30 seconds with packet loss still exists but isn't a regression. I'll open a separate ticket on that.
Also available in: Atom
PDF