Project

General

Profile

Actions

Bug #9054

closed

Gateway Group slow (or never) to switch back to Tier 1

Added by Mitch Claborn about 6 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Multi-WAN
Target version:
-
Start date:
10/22/2018
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.4.4
Affected Architecture:

Description

See https://forum.netgate.com/topic/136852/2-4-4-gateway-group-slow-or-never-to-switch-back-to-tier-1. (No responses yet as of this posting.)

I have a gateway group with 2 gateways, one at Tier 1 and the other at Tier 2. I've been having lots of trouble with my Tier 1 link lately and pfSense will switch over to the Tier 2 link, but when the Tier 1 gateway comes back within limits (latency, packet loss) the routing does not switch back to the Tier 1 gateway. The Gateways widget on the home page shows the Tier 1 as "online" as does Status -> Gateways and Status -> Gateway Groups. The log file shows an alarm for latency and then cleared for latency.

I've set that gateway group as the default gateway and am also sending traffic to it with a LAN firewall rule.

Actions #1

Updated by Mitch Claborn about 6 years ago

If I set the Tier 1 gateway as "Mark Gateway as Down" then turn that setting back off, the routing will correct itself and switch back to the Tier 1 gateway.

Actions #2

Updated by Mitch Claborn about 6 years ago

To make things even more complicated, in the workaround mentioned above, the routing actually changes back to the Tier 1 gateway when I mark it as down, so that when the status is "forced offline" it is still routing through that gateway. When I undo the "mark as down" it continues to route through that gateway.

Actions #3

Updated by Mitch Claborn about 6 years ago

The Gateway Group was set as Trigger Level: Packet Loss or High Latency. I changed that to "Member Down" and now the routing seems to be switching back to the Tier 1 gateway as it should. I'm going to revert to "Packet Loss or High Latency" as a test to see if that triggers the problem.

Actions #4

Updated by Mitch Claborn about 6 years ago

With the Gateway Group set to "Packet Loss or High Latency" this problem definitely shows up much more often.

Actions #5

Updated by Vasyl Semenchuk about 6 years ago

The same problem on all my devices (20 devices) after upgrading

Actions #6

Updated by Mitch Claborn about 6 years ago

@VasylSemenchuk Are your gateway groups set to trigger level "Packet Loss or High Latency" or "Member Down"? Does it work better if set to "Member Down"?

Actions #7

Updated by Vasyl Semenchuk about 6 years ago

Set to trigger level "Packet Loss or High Latency"
I will set trigger level "Member Down" and let you know on monday or thursday

Actions #8

Updated by Vasyl Semenchuk about 6 years ago

Did you try restart service dpinger? In my case this helps switch back to WAN1

Actions #9

Updated by Vasyl Semenchuk about 6 years ago

Also i noticed in my case helps when restart openvpn client.
After restart OpenVPN, vpn and other traffic switch back to WAN1

Actions #10

Updated by Vasyl Semenchuk about 6 years ago

HI! After some tests noticed that problem appear only when my "Gateway Group" set as Default gateway
If set WAN1 or WAN2 as Default gateway, switching working fine

Actions #11

Updated by Bob Guo over 5 years ago

Generally same problem here, BUT EVEN HAVE PROBLEM WHEN THE GATEWAY GROUP ISN'T PFSENSE DEFAULT GATEWAY. After digging a little bit deeper, I found that the problem resides in failing to generate correct config in /tmp/rules.debug. As f ar as I see, there is no clue shows that there is problem with pfctl reading rules; therefore, I thought the problem might be in parts that take charg e of generating rules or calls generating. For example, when alarms on loss or delay disappeared, pfsense didn't call rules generating process.
I currently try only member down as trigger, and I'd like to know if it works for you.
As I noticed, when problems present, a simple filter reload will get everything back to normal. I don't know if it will work for you. If it works, a temporary workaround may be some cron runned script monitoring if gateway is correct.

Actions #12

Updated by Bob Guo over 5 years ago

Mitch Claborn wrote:

@VasylSemenchuk Are your gateway groups set to trigger level "Packet Loss or High Latency" or "Member Down"? Does it work better if set to "Member Down"?

Do you have problem if the trigger is member down only?

Actions #13

Updated by Bob Guo over 5 years ago

Vasyl Semenchuk wrote:

Did you try restart service dpinger? In my case this helps switch back to WAN1

Actually dpinger is pretty reliable in giving gateway status.

Actions #14

Updated by Jim Pingle over 5 years ago

  • Category set to Multi-WAN
Actions #15

Updated by alex alex over 4 years ago

Any update on this? I experience the very same problem on version 2.4.4-RELEASE-p3 (amd64).

Actions #16

Updated by Viktor Gurov over 4 years ago

Seems related to #10716

Actions #17

Updated by Jörn Greszki about 4 years ago

I am not sure if my issue:

https://forum.netgate.com/topic/156890/dpinger-broken-or-dashboard-broken-or-my-brain-is-broken?_=1600704062714

is related to that what you describe.

If not, I would open a seperate bug ticket, if not, I would contribute information and further testing if needed.

Right now, for me, Multi-WAN is working, but not able to recover from 100% packet loss events.

Actions #18

Updated by Rodrigo Gonçalves about 4 years ago

I have the same issue here.
Once pfSense switches to the Tier 2 gateway, the only way to make it come back to the Tier 1 is by disabling ("Never") and reenabling the Tier 2 gateway.

Actions #19

Updated by Renato Botelho about 4 years ago

  • Target version set to 2.5.0
Actions #20

Updated by Viktor Gurov about 4 years ago

  • Status changed from New to Resolved

no such issue on 2.5.0.a.20201022.1850, resolved in #10716
failover and load-balance gw groups tested

Actions #21

Updated by alex alex about 4 years ago

Viktor Gurov wrote:

no such issue on 2.5.0.a.20201022.1850, resolved in #10716
failover and load-balance gw groups tested

When version 2.5.0 is going to be released? Any roadmap on that?

Actions #22

Updated by Jim Pingle about 4 years ago

  • Target version deleted (2.5.0)
Actions #23

Updated by Jörn Greszki about 4 years ago

Now tested with 2.5.0.a.20201101.1850

I still get for unknown reasons sometimes partial or full loss for alive-ping at one of the 2 WAN interfaces, but this is not the issue.

Nov 2 10:37:56 dpinger 16236 WAN_PHY1_IGB0GW 8.8.4.4: Alarm latency 0us stddev 0us loss 100%

Problem is that this status remains until any change to the gateway group is made - then it works immediately.

dpinger is not reattempting to reach the defined IP or the process maintaining the operational status is not taking over the changes.

https://forum.netgate.com/topic/156890/dpinger-broken-or-dashboard-broken-or-my-brain-is-broken/23

Actions #24

Updated by Viktor Gurov about 4 years ago

Jörn Greszki wrote:

Now tested with 2.5.0.a.20201101.1850

I still get for unknown reasons sometimes partial or full loss for alive-ping at one of the 2 WAN interfaces, but this is not the issue.

Nov 2 10:37:56 dpinger 16236 WAN_PHY1_IGB0GW 8.8.4.4: Alarm latency 0us stddev 0us loss 100%

Problem is that this status remains until any change to the gateway group is made - then it works immediately.

dpinger is not reattempting to reach the defined IP or the process maintaining the operational status is not taking over the changes.

https://forum.netgate.com/topic/156890/dpinger-broken-or-dashboard-broken-or-my-brain-is-broken/23

Please create new redmine issue for that
may be related to #8136

Actions

Also available in: Atom PDF