Project

General

Profile

Actions

Bug #5090

closed

Wan failover fails to recover normal behaviour when all wans work again

Added by Alberto Iglesias over 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Multi-WAN
Target version:
-
Start date:
09/03/2015
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
Affected Architecture:

Description

It seems to me that wan failover is not working ok. Two wan connections configures with failover (tier 1 and tier 2).

- When both are ok, it uses the wan with tier 1 (wan1), which is ok. If I disconnect wan1, it starts using wan with tier 2 (wan2), wich is also ok. After 2-3 minutes, I reconnect wan1, but the "Status Gateway" page doens't recover wan1, and keeps appearing as offline. If I restart apinger service, wan1 appears as online again.

- Besides that, although wan1 is back online, pfsense keeps using wan2 until I do some configuration change or manually make wan2 fail.

To sum up, commutation is done ok in case of fail, but when you recover the failed gateway, nothing comes back automatically. I've tested it with two different machines, and always happens the same.

Actions #1

Updated by Chris Buechler over 9 years ago

  • Status changed from New to Feedback

guessing this is just already-established connections staying on the other WAN? That's the expected behavior, only new connections would go across the re-established WAN.

Actions #2

Updated by Alberto Iglesias over 9 years ago

No, new connections also keep using wan2. You can make a ping, wget or whatever you want, and only uses wan2.

Actions #3

Updated by Chris Buechler over 9 years ago

what does Status>Gateways show?

Actions #4

Updated by Alberto Iglesias over 9 years ago

Imagine wan1 (Tier 1) and wan2 (Tier 2) are working and you restart pfsense (just to show you my test proccess). Then:

1. "Status>Gateways" show both gateways as Online and traffic goes through wan1. That's normal state.
2. I disconnect wan1. After a few seconds "Status>Gateways" shows wan1 gateway as offline and pfsense redirect traffic through wan2. That's ok.
3. After 2-3 minutes, I reconnect wan1. First problem, even wan1 is ready and working, "Status>Gateways" keeps showing wan1 gateway as Offline. I've wait several minutes and nothing changes. Then, I restart apinger service, and wan1 gateway appears as Online again.
4. At this point, "Status>Gateways" shows all gateways as Online again, but traffic keeps going through wan2. If I make a ping to google.es, a wget to any page, etc. all keeps going through wan2. That's not correct, because wan1 has Tier 1, so once it's recovered it should be used instead of wan2 (that has Tier 2).
5. And to my surprise, if I make some change in the wan2 gateway configuration (change his weight, for example) and save it, then traffic starts going through wan1. Obviously, it also goes back if I disconnect wan2 (but after that, same problems, wan2 doesn't recover automatically).

So, load balance is done ok when a gateway goes down, but after that nothing comes back automatically, neither the gateway status nor the use of the recovered wan.

Actions #5

Updated by Catalin Enache over 9 years ago

Hi. I also have the same behavior. After WAN1 gets up, traffic still goes through WAN2 unless I make some changes in routing configuration (eg: gateway description change) or after reboot the router. Pfsense beta 2.2.5 from 4th September has the same behavior.

Actions #6

Updated by Chad Monroe over 9 years ago

I too am seeing this problem on 2.2.4 embedded both variants (32-bit w/ALIX and 64-bit w/APU hardware) show the problem in my case. I have about 100 pfSense gateways in the field, 20 of which have dual WAN and of the 20 about 12 are connected to backup WAN via 4G LTE (using an Ethernet bridge; to pfSense it just looks like a normal DHCP Eth GW).. in my case I have the same type of config as Alberto (Tier 1 GW which should always be the active GW when up and a tier 2 GW which is a standby/failover GW). Similar symptoms as reported by both Alberto and Catalin.

One other oddity I noticed today during a failover event (WAN1/Tier1 GW was down, WAN2/Tier2 GW was active).. there was a state in the firewall table (for a SIP connection) which no matter how many times I cleared it would always route back out WAN1/Tier1 GW which was down. I cleared the state probably 20 times trying various things (marking the primary GW forced offline and apply, actually clicking the green box next to the GW to mark it as disabled and apply, turning the physical Ethernet port down and apply etc.) and even when the GW was marked offline AND interface physically disabled I could reboot the LAN device (takes about 30 seconds), clear the bogus state from the table and on device init when it sent the first SIP REGISTER message the new state still went out WAN1/Tier1 (marked offline and physically disabled at the port level). All other states for the device (provisioning via HTTPS, syslog etc.) would use WAN2/Tier2 GW as expected. I finally gave up and rebooted the firewall with WAN1/Tier1 GW marked offline + port disabled and on recovery it finally sent the stale SIP state/connection which I'd previously tried to clear several times out of WAN2 like it was supposed it. I'm happy to open a separate bug for this if anyone thinks the issue is un-related however given the primary issue (not properly failing back to primary/tier 1 WAN) I at least wanted to mention this issue with states here in case it helps debug the issue.. I'm not 100% convinced it's related as the failover issue seems like it may be related to apinger while my stuck state issue may be somewhere else but this is just a (partially) educated guess from poking around in the code.

I have a test dual WAN setup in the lab and am happy to gather any logs/debugs/etc. necessary, perform tests, test builds etc. as I know dual WAN routers aren't quite as easy to come by in the field. Given this is very easily reproducible we should be able to verify any fixes/tests quite throughly I'd imagine. Thanks for any help guys, it's much appreciated!

Actions #7

Updated by Catalin Enache over 9 years ago

hi. Any news? can we submit any logs? how is this bug approached?

One more note: any system change - GW, routing, NAT, triggers the recovery to main WAN. (switch to main)

Thanks

Actions #8

Updated by Kill Bill over 9 years ago

@Catalin: Until the FUBAred apinger gets replaced with something working, I wouldn't expect any solution here.

Actions #9

Updated by Bipin Chandra over 9 years ago

apinger just doesnt work well, probably it can be replaced by something that monitors the actual ppp etc connection like how normal routers do, make it monitor mpd etc and find out the connection state

Actions #10

Updated by Alberto Iglesias over 9 years ago

New installation with 2.2.4 and two wan connections, and same problem. In this case also with two LANs, each one with a failover group in one sense (one with wan1->wan2 and the other with wan2->wan1). None of the groups recover main wan automatically when it's back online.

Can at least anybody officially confirm this bug?? It's really easy to test and verify, so at least we could know that we're right and failover doesn't work, or know if we're doing something wrong.

Regards

Alberto

Actions #11

Updated by Catalin Enache over 9 years ago

Alberto Iglesias wrote:

New installation with 2.2.4 and two wan connections, and same problem. In this case also with two LANs, each one with a failover group in one sense (one with wan1->wan2 and the other with wan2->wan1). None of the groups recover main wan automatically when it's back online.

Can at least anybody officially confirm this bug?? It's really easy to test and verify, so at least we could know that we're right and failover doesn't work, or know if we're doing something wrong.

Regards

Alberto

Hi Alberto. Can you please let us know what type of connections do you have on both wans? Thanks

Actions #12

Updated by Alberto Iglesias over 9 years ago

In this last case are cable connections (30M/6M). Routers with NAT and DMZ configured pointing to the pfsense interface.

But I've also seen this problem in fiber connections with PPPoE, in DSL connections and also with cable connections in bridge mode (no NAT and public IP in the pfsense interface).

In all our tests, we've been uncapable of seeing traffic redirected automatically to the main wan.

Actions #13

Updated by Catalin Enache over 9 years ago

Alberto Iglesias wrote:

In this last case are cable connections (30M/6M). Routers with NAT and DMZ configured pointing to the pfsense interface.

But I've also seen this problem in fiber connections with PPPoE, in DSL connections and also with cable connections in bridge mode (no NAT and public IP in the pfsense interface).

In all our tests, we've been uncapable of seeing traffic redirected automatically to the main wan.

Ok. At some point I was tempted to believe it is only related only with pppoe/4g connections. It's strange though, some ppl are reporting to have their fallback to main working fine.

Actions #14

Updated by Chris Buechler about 9 years ago

  • Affected Version deleted (2.2.4)

I went through and re-tested multi-WAN in general on 2.2.5 (which is the same as 2.2.4 in that regard) and it fails over and back as it should just fine every time.

There may be some edge case but nothing here to suggest what that might be.

Actions #15

Updated by Chad Monroe about 9 years ago

Hi Chris,

I can fairly reliably reproduce this on 2.2.4 with two Ethernet WANs. What logs or debugs would help you to get a better idea of what we are seeing? I know there is a new option for enhanced apinger logs but (despite the comments bashing apinger) I'm not sure if this is even the problem. It seems (for me at least) that some states simply get stuck on the primary WAN even if it goes down (or is forced down) and there's no way to get them to fail to the backup WAN short of a reboot; clearing states doesn't even do it. Unfortunately I'm much more well versed in the Linux kernel the BSD but am willing to learn and do some debugging if you point me in the right direction. Assuming my (guess) is correct and it's related to either state tables or gateway sates/new connection setup is there anything you recommend I run debug wise? I'm not against making my own builds with certain debug flags on if needed. While it's very possible we all are hitting a corner case (or have a mis-configuration for that matter) the number of reports seems a bit high to me. Something about our network setup causes this issue to pop up at nearly 20+ multi-WAN customers.. luckily the primary WAN doesn't fail too often and when it does most critical services do fail over but it can turn into an urgent problem if one of the "stuck" states happens to be for a VoIP phone for example.

Anyway, I'm sure you guys are swamped but if I can do anything to help debug please let me know. I'll have time this weekend if you happen to get back to me by then but will of course make time whenever. Thanks again for everything you guys do,

-Chad
Actions #16

Updated by Chris Buechler about 9 years ago

There are a variety of potential configuration issues, and nearly all the support cases we undertake with this description end up being some kind of config problem. If you'd like help specifically with your scenario, purchase support and we'll be able to help. https://portal.pfsense.org/support-subscription.php If it's a bug, we'll be able to get it fixed and won't count it towards your purchased incidents.

Actions #17

Updated by xavier Lemaire about 9 years ago

Hi guys,

I also come back on this little problem. I do a lot test with lots of different configuration ADSL, SDLS Optic fiber but all in 2.2.5.
the real problem is with voip. Of course I deactivated the check box "State Killing on Gateway Failure
The monitoring process will flush states for a gateway that goes down if this box is not checked. Check this box to disable this behavior."

when the gateway where is in the Tier1 my voip connection rules are well on the Tier2.

when Tier1 is the back all my voip connexion stay on the Tier2.

Status Gateway" page is OK and log Gateway tell me it : apinger: alarm canceled: WANGW * down *

But trafic don't go back on the good gateway, i have to make a reset state for states for udp voip system.

Actions #18

Updated by Alberto Iglesias about 9 years ago

Yes, it keeps happening to me too consistently in all systems. I'm pretty sure that configuration is OK. In fact, as you say, if you manually flush states it starts working again. It's the automatic process what fails

Actions #19

Updated by Chris Buechler almost 9 years ago

  • Status changed from Feedback to Resolved

the issues here are attributable to issues in apinger, which has been replaced in 2.3 with dpinger, which doesn't have problems with math or race conditions in status reporting.

Actions #20

Updated by xavier Lemaire over 8 years ago

Chris Buechler wrote:

the issues here are attributable to issues in apinger, which has been replaced in 2.3 with dpinger, which doesn't have problems with math or race conditions in status reporting.

I am so sad since yesterday.
I was following and testing 2.3 beta and this stuff was working back so good.

Yesterday i tried to upgrade from beta to released 2.3 and the issue is back again.

what i am tried to do :

i have pfsense with 2 wans. I have 2 rules in gateway group one data one voip with Tier 1 and tier 2 in differents wan.
I have lan rule to match trafic for sip server (outside) go on voip gateway and all trafic in data gateway.

i cut my "voip" gateway and my phone re register in about 30 sec in the backup gateway without trouble.

But when my "voip" gateway is back it stay on backup gateway.... waiting hours always stay.

I tried to change UDP timeout in advanced configuration it s don't change
I try to play with "State Killing on Gateway Failure" it s don't change

But for http trafic if i make same test it s okay and for example if i try on a what my ip website its change well.

Must i create a new ticket ?

best regards

Actions #21

Updated by Chris Buechler over 8 years ago

xavier: that's how it's supposed to work at this point. Sounds like you want state killing on failback, which doesn't exist at this time. feature #855 covers that

Actions #22

Updated by James M over 8 years ago

I am able to create the same issue, running a clean install of v2.3.1

2 WANs setup in gateway groups called "failover"
WAN1 - tier 1
WAN2 - tier 2

LAN Firewall rule specifying the Gateway as the failover gateway group.

If WAN1 goes down, all traffic fails over to WAN2 as expected - you can see this in Diag > States and can confirm by doing a trace route from any LAN device.
When WAN1 comes back up (status > Gateway confirms "online") - some state's remain over WAN2. Standard HTTP traffic will revert to WAN1 within a few minutes. However traffic such as VoIP/SIP remains over WAN2 and the diag>states table confirms this. Can be left for 8hrs+ and remains the same.

I assume this is still a bug...?
Should there not be a "failback" option? I have seen similar on the newer firmware on Draytek routers which appears to have been added to resolve this same issue.

Many thanks
James

Actions #23

Updated by Chris Buechler over 8 years ago

James M wrote:

I assume this is still a bug...?

No, see my last comment just above yours.

Actions #24

Updated by James M over 8 years ago

Hi Chris

I note your previous comment. However, how would the state killing feature work? I don't fully understand how the states are refreshed for example.

But in simple terms, take a VoIP/SIP phone service, if a connection failovers over from the primary WAN1 connection to a secondary WAN2 connection, at what point should that VoIP/SIP connection be expected to fall back onto the WAN1 connection when it becomes available again. Are you saying that with state killing on failback it would move these sessions immediately?
Or how long would/should the state remain open on the WAN2 connection?

We are currently having real problems with this on 2 client sites setup as follows:

WAN1 - ADSL connection just used for VoIP traffic
WAN2 - EFM higher bandwidth connection used for all internet access, VPN etc.

Gateway group named "EFMFirst"
WAN2 EFM - Tier 1
WAN1 ADSL - Tier 2

Gateway group named "DSLFirst"
WAN1 ADSL - Tier 1
WAN2 EFM - Tier 2

Firewall Rules for Voice network:
Traffic set to Gateway: DSLFirst

Firewall Rules for LAN network:
Traffic set to Gateway: EFMFirst

The problem is that if the ADSL line drops, the VoIP traffic goes onto the EFM connection. This is fine for a short period of time, but due to the other traffic on this line the bandwidth is not enough so we can get call quality issues. This is not a problem for a short period of time (better to have some phone service than none at all).

When the ADSL line comes back online (Status>Gateways confirms this), the VoIP traffic stays over the EFM connection. Looking at the State table you can see the TCP & UDP traffic stuck to WAN2.

It can be left for 24hrs and still the VoIP traffic will be on the wrong WAN. It will never move the traffic back onto the ADSL connection where it should be. Therefore the call quality issues remain due to the lack of bandwidth.

What would you suggest, is this truly not a bug?
Is there not something that can force the states to re-associate with the firewall rule and therefor the correct WAN gateway after a specified period of time perhaps?

Thanks
James

Actions #25

Updated by James M over 8 years ago

Also if you Kill the 2 States for each VoIP phone in the Diagnostics > States section, they re-appear straight away on the same ports and interfaces as they were previously.
This is done by filtering the state's list by the IP address of the device. You can then see both UDP states (one on the internal network & one on the wan). Then press the "Kill States" button. This removes the 2 states very briefly, but then they reappear, still on the wrong WAN interface.
They have definitely cleared since the Byte count returns down to 0KB and starts counting again.
Surely clearing the state should have forced it to reconnect and follow the current rule and gateway group to the correct gateway??

Actions

Also available in: Atom PDF