Project

General

Profile

Feature #8

Clear states after failover

Added by Perry Mason over 10 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
Rules/NAT
Target version:
Start date:
05/10/2009
Due date:
% Done:

0%

Estimated time:

Description

In a Multi wan Setup some VOIP phones needs a little help after a fail over.
Solution previously outlined in http://forum.pfsense.org/index.php/topic,7808.0.html works but can hopefully be improved.

- Replace afterfilterchange with apinger so states only gets reset when a fail over happens and not on every filter change.
- When states gets cleared it will only happen to a single IP or a group of IP's and not every entry in state table

Associated revisions

Revision 102ab75d (diff)
Added by Scott Ullrich over 9 years ago

Clear states for an interface if it is down Ticket #8

Revision 64e9ae07 (diff)
Added by Scott Ullrich over 9 years ago

Use correct argument for pfctl -b Ticket #8

Revision 05f3ffa4 (diff)
Added by Ermal Luçi over 9 years ago

Ticket #8. Delete states after link fails down.

Revision 79b7f498 (diff)
Added by Ermal Luçi about 9 years ago

Ticket #8. Use proper IP to pfctl -b and run the command after the rules have been removed for the not 'down' interfaces.

Revision 1bb03150 (diff)
Added by Ermal Luçi almost 9 years ago

Ticket #8. Try to kill all states regarding source nodes and states with source address the gateway ip.

Revision 7b877108 (diff)
Added by Ermal Luçi almost 9 years ago

Ticket #8. Try to kill even all states from hosts and destination the gateway ip.

Revision fa2f5379 (diff)
Added by Ermal Luçi almost 9 years ago

Ticket #8. Actually use the new functionality of pfctl -b to kill even states referencing down gateways in their route-to cached parameter.

Revision b439dc01 (diff)
Added by Ermal Luçi almost 9 years ago

Pass the new argumesnts to pfctl -b and correct variable name as reported on Ticket #8.

Revision eed02caf (diff)
Added by Ermal Luçi almost 9 years ago

Ticket #8. A try at correctly finding the gateway ip of down interfaces or gateways marked as down. Currently it cannot be gotten from apinger.

Revision b7142c29 (diff)
Added by Ermal Luçi almost 9 years ago

Ticket #8. Use correct variable. Discovered-by: Mark Huijgen

History

#1 Updated by Ermal Luçi about 10 years ago

I have done this for pppoe/pptp/l2tp interfaces through pfctl -b in 2.0.
For the other cases some more code analysis is needed to find when the interface fails.

Though its just a matter of writing the code than how-to analysis.

AFAIK the patch is present in pf(4) for 7.2 sources too.

#2 Updated by Scott Ullrich almost 10 years ago

  • Assignee deleted (Core Team)
  • Affected Version set to 2.0

#3 Updated by Scott Ullrich over 9 years ago

  • Priority changed from Normal to Very Low

#4 Updated by Chris Buechler over 9 years ago

  • Priority changed from Very Low to High

This is important, services that maintain a connection never come back after failover without this. VoIP is a good example, that state will never clear, so phones won't fail over without manually clearing states. The per-interface state kill that I believe Ermal added to pfctl fixes this.

#5 Updated by Chris Buechler over 9 years ago

  • Category changed from Unknown to Rules/NAT
  • Affected Version changed from 2.0 to All

pfctl -b <IP of interface>

does this reportedly (haven't tested thoroughly, it's in our code and appears to work).

#6 Updated by Scott Ullrich over 9 years ago

I looked through the tree last night and I did not see pfctl -b in use anywhere. Can someone point me to where its currently being used?

#7 Updated by Chris Buechler over 9 years ago

I don't think it is being used, it needs to be tested to ensure it works, then added to run appropriately when a WAN fails.

#8 Updated by Ermal Luçi over 9 years ago

I already told:
"I have done this for pppoe/pptp/l2tp interfaces through pfctl -b in 2.0."

Look at the link down scripts in /usr/local/[s]bin

#9 Updated by Chris Buechler over 9 years ago

That's good Ermal, that fixes a different problem with the same cause where upon PPPoE reconnect (primarily in countries where they force a daily reconnect like Germany) stale states can hang around.

This fix is sort of like that, though needs to run when a gateway is marked down.

#10 Updated by Scott Ullrich over 9 years ago

Well, after a filter_configure() run we can simply loop through gateways checking for any that are down. if they are down zap the states. That should be pretty easy. I'll take care of it.

#11 Updated by Scott Ullrich over 9 years ago

  • Status changed from New to Feedback

I have committed code for this. Please test!

#12 Updated by Ermal Luçi over 9 years ago

Just be aware that the ip specified on pfctl -b is the ip of the interface not of the gateway.

Bascially it is meant like this:
$WAN has ip=192.168.10.10 gw=102.168.10.1
$WAN reloads and gets ip=192.168.10.9 gw=192.168.10.1
the pfctl command in this case should be pfctl -b 192.168.10.10 and not pfctl -b 192.168.10.1.

This is the hard part of it.
That is why i said that i handled this for PPTP/PPPoE/L2TP.
For dhcp it can be added in the script to check if the ip has changed.
For the static ip/carp cases it need to happen on POST of interfaces.php where you can still access the old ip in the config(before being overwritten by the new one).
For openvpn it should be handled by the link down script. But i saw that Seth now call filter_configure() in that case and that needs to be replaced by a new script doing both.

I will study this more thoroughly when i have more time.

What you have checked in is not wrong(though you need to specify the interface ip).
Though i would say that if this is the way to handle this, through the gateway status detection code, it is better to modify apinger or whatever script is run to get this info and teach it to run the pfctl -b command.

If you agree i will see to modify that path/code?!

#13 Updated by Scott Ullrich over 9 years ago

Huh!? What if an interface use more than that one gateway? It's going to whipe all the states for that interface? This seems wrong to me.

#14 Updated by Scott Ullrich over 9 years ago

  • Status changed from Feedback to New

#15 Updated by Scott Ullrich over 9 years ago

  • Status changed from New to Feedback

OK, I figured it out. Please test.

#16 Updated by Ermal Luçi over 9 years ago

I do not have a environment now but that code seems to grown in home.
The return_gateways_array() return even the interface where the gateway is supposed to be.
No need to go at that length for finding the interface.
Seth can help there!

#17 Updated by Scott Ullrich over 9 years ago

Yep, you are right. gateway['interface'] had the goods so I adjusted the code. Please test!

#18 Updated by Perry Mason over 9 years ago

with latest snapshot voip phone states doesn't get cleared. Do I need to tick a box somewhere?

#19 Updated by Dan Swartzendruber over 9 years ago

Here is what I have found out: the code that runs in delete_states_for_down_gateways() is not working correctly, at the least for pppoe (which is what I have.) It does this:

if($gateway['monitor'] == "down") {

But from debug message I put in, I can see the state of the gateway is 'dynamic', not 'down'. That is not the only problem though - when I tried printing out the IP address as it is fetched for the pfctl command like this:

$int_ip = get_interface_ip($gateway['interface']);

at the point this routine executes, the pppoe interface has (I think) been killed, so there is no IP address to fetch. I was trying to figure out where this all happens, but got kind of lost. It would be nice to get this working, but for the moment, I think I might just patch the code that runs on interface bring-up to do 'pfctl -b' on the LAN IP of the asterisk server, as opposed to the WAN IP of the gateway. It isn't like that address is going to be changing randomly.

#20 Updated by Chris Buechler over 9 years ago

  • Status changed from Feedback to New

sounds like a suitable work around. Also please report back how pfctl -b works for you Dan, it hasn't been tested much/at all.

Re-opening this as it's clearly broken

#21 Updated by Ermal Luçi over 9 years ago

  • Status changed from New to Feedback

Seems the pfctl -b on ppp-linkdown has been lost in the history.
I putted back so it should be ok even in PPP links case like PPPoE/PPTP/etc...

#22 Updated by Chris Buechler about 9 years ago

  • Status changed from Feedback to New

This still doesn't work, when a link fails its states aren't deleted. Running 'pfctl -b $interfaceip' manually also doesn't work.

#23 Updated by Chris Buechler about 9 years ago

I should clarify - seems only the NAT states are still around.

#24 Updated by Ermal Luçi about 9 years ago

Can you please be more clear what you mean by nat states?

#25 Updated by Chris Buechler about 9 years ago

  • Status changed from New to Resolved

It appeared previously that firewall states would all be gone but states such as:

10.0.0.22:52559 -> 7.2.0.15:15963 -> 18.1.0.4:53

would still be around. But that's not the case after further testing on an August 3 build.

#26 Updated by Perry Mason about 9 years ago

I still have this issue on a August 4 build

udp    87.54.25.131:5060 <- 192.168.11.20:5060    MULTIPLE:MULTIPLE
udp    192.168.11.20:5060 -> 192.168.102.100:5060 -> 87.54.25.131:5060    MULTIPLE:MULTIPLE
Those states doesn't get cleared when the failover happens.

#27 Updated by Chris Buechler about 9 years ago

  • Status changed from Resolved to New

True, running pfctl -b manually does work, but it's not run properly at failover.

#28 Updated by Mark Huijgen about 9 years ago

I have done some testing regarding this issue around 24th of June.

Calling pfctl -b manually did not do enough at that time. It cleared only 1 of the two states in #26 (the 2nd one, which has gateway address in it: pfctl -b 192.168.102.100).
Only the thing is the kernel seems to instantiate the 2nd rule again when another packet is sent. Only if both rules got removed they would remain gone.

See for more extensive info my post on the forum http://forum.pfsense.org/index.php/topic,26235.msg136537.html#msg136537

Has the behavior of pfctl -b been changed since June 24th? If so I could test it again.

#29 Updated by Mark Huijgen about 9 years ago

In my last post I ment comment 26, it seems redmine linked issue 26 for the text "poundsign26".

#30 Updated by Mark Huijgen about 9 years ago

Ok, had some time spare and tested it with today's snapshot (built on Sun Aug 8 10:20:52 EDT 2010) and behaviour is still the same. Only the 2nd state gets killed by pfctl -b and is recreated on the next packet.

So running pfctl -b manually does not seem to work ok either.

#31 Updated by Ermal Luçi about 9 years ago

  • Status changed from New to Feedback

I added a solution to run the command after the rules referencing this interfaces are removed from the working config.
It should behave correctly now.

#32 Updated by Mark Huijgen about 9 years ago

I have tested with latest snapshot built on Tue Aug 10 22:06:38 EDT 2010 which includes the change.
It seems failover is broken now. Same for gateway applet on front page and Status->gateway too. Might be related to the apinger output change?

From what I can tell this latest change will make sure pfctl -b is using the correct IP, which of course is also needed for the full fix.
However pfctl -b src_ip fails to clear enough states if NAT is involved.

The following test is done before the last update, since the last update does not seems to have broken the failover.
On a client on the LAN I started a ping to ip 74.125.95.93, which sends a ping packet every second.
This results in the following states in the firewall

all icmp 74.125.95.93:23567 <- 10.10.10.10 0:0
all icmp 10.10.10.10:23567 -> 192.168.103.2:53279 -> 74.125.95.93 0:0

Then I made the gateway on interface 192.168.103.2 go down, watched the route-to rule update in /tmp/rules.debug and I manually ran pfctl -b 192.168.103.2.
For a very short moment only the first of the 2 rules remain in the table, the 2nd one was wiped out by the pftcl -b command.
However on the next ping packet from the client the 2nd rule is automatically added again.

pfctl -b 192.168.103.2 should have wiped out both of the rules.

#33 Updated by Erik Fonnesbeck about 9 years ago

There is not yet a new enough build to have these latest changes. The next one may possibly have all the changes, but it might not if the builder needs to be restarted to pick up the latest change from the builder tools repository.

#34 Updated by Mark Huijgen about 9 years ago

Updated to 2.0-BETA4 (i386) built on Sun Aug 15 18:27:24 EDT 2010
Gateway applet working again and the rules.debug file indicates fail over is happening again (gateway which went down is removed from the groups route-to rule).

However, pfctl -b is still not removing enough states.

The following hack works:
grep the statetable before calling pfctl -b with the ip also used for pfctl -b to find all nat states in the form of:
all (proto) (ip1) -> pfctl_-b_IP (:port) -> (ip2)

And then doing a pfctl -k on all ip2 ip1 from all matches, however this will also kill connections from ip1 -> ip2 that might exist over the other gateways that are still up.

#35 Updated by Mark Huijgen about 9 years ago

Should have been "pfctl -k ip1 -k ip2" in my last comment.

#36 Updated by Ermal Luçi about 9 years ago

Actually no, pfctl is killing the right states. That is all it has to kill.
Can you please show me the src nodes states after running the pfctl -b command.

Should be 'pfctl -v -s Sources'

I suspect that it might have an impact on this.

#37 Updated by Mark Huijgen about 9 years ago

This command seems to be giving no output at all, whether I run it before or after the pfctl -b.

#38 Updated by Ermal Luçi almost 9 years ago

Try latest snapshots more state killing added.

#39 Updated by Mark Huijgen almost 9 years ago

Tried the extra state kills on the CLI since snapshot hasn't been updated yet:

Still doesnt do enough, for the ping i'm doing from the LAN ip 10.10.10.144 to 74.125.95.93 I get two states:
icmp 74.125.95.93:63260 <- 10.10.10.144 0:0
icmp 10.10.10.144:63260 -> 192.168.102.2:34763 -> 74.125.95.93 0:0

After running
pfctl -b 192.168.102.2 ; pfctl -K 192.168.102.2 ; pfctl -k 192.168.102.2 ; pfctl -k 0.0.0.0/0 -k 192.168.102.2 ;
killed 2 states from 1 gateway
killed 0 src nodes from 1 sources and 0 destinations
killed 0 states from 1 sources and 0 destinations
killed 0 states from 1 sources and 1 destinations

Only the pfctl -b kills states. One state is from the apinger and the other is
the state below.
icmp 10.10.10.144:63260 -> 192.168.102.2:34763 -> 74.125.95.93 0:0

The other state remains. When the next ping comes in, both states are back and the ping still times out.
Is there any command run yet that kills the following state?
icmp 74.125.95.93:63260 <- 10.10.10.144 0:0

When I do
  1. pfctl -k 10.10.10.144 -k 74.125.95.93
    killed 2 states from 1 sources and 1 destinations

ping starts working again.

#40 Updated by Ermal Luçi almost 9 years ago

Ok got a deal at it with your help.
We have to kill states of the form 'icmp 74.125.95.93:63260 <- 10.10.10.144 0:0' only if they have cached the 'downed' gateway. I implemented this in pf(4) so test latest snap to see if it actually behaves as expected.

#41 Updated by Mark Huijgen almost 9 years ago

Im happy to report we are close to solving this issue!

Manually executing:
pfctl -b 192.168.102.2 -b 192.168.102.1
after the gw in the previous example went down, kills both states now!

However the code in filter.inc is never called, see function filter_delete_states_for_down_gateways() in filter.inc:

147    foreach ($a_gateways as $gwip => $gateway) {
148        if (stristr($status['status'], "down")) {

$status['status'] is not defined and thus this statement will never be true, changing it to $gateway['status'] fixed this.

Second problem is that pfctl -b is called with the 2nd ip wrong:
In the example above it calls pfctl with the following args:

150                                mwexec("/sbin/pfctl -b {$gateway['srcip']} -b {$gwip}");

For the example earlier this results in "pfctl -b 192.168.102.2 -b 74.125.77.99" on my system.
74.125.77.99 is the IP used by apinger to test if the gateway is up, it is not the ip of the gateway itself (which is 192.168.102.1).

#42 Updated by Ermal Luçi almost 9 years ago

Try latest snapshot or commits related to this.
they should get the proper ips to pass to pfctl though i plan to improve this more later on.

#43 Updated by Mark Huijgen almost 9 years ago

Almost works, 2nd arg is set correctly with gw ip now, but local ip is missing now.
Line 163 filter.inc

$cmd = "/sbin/pfctl -b {$gateway['srcip']} ";

Should read

$cmd = "/sbin/pfctl -b {$gwstatus['srcip']} ";

After this change fail over works as expected.

#44 Updated by Ermal Luçi almost 9 years ago

Thank you for the support on this.
If you confirm it works as expected we can close this.

#45 Updated by Mark Huijgen almost 9 years ago

Confirmed fail over is working as expected for me in build 2.0-BETA4 (i386) built on Thu Aug 26 09:00:28 EDT 2010
Tested with 3 statically configured (VLAN) interfaces setup to do NAT for the LAN network and put into a gateway group. I'm unable to test it with pppoe.

#46 Updated by Perry Mason almost 9 years ago

When wan fails to wan2 it do clear the states. But when wan connection recover states doesn't get cleared and my voip connection can't receive calls.

#47 Updated by Chris Buechler almost 9 years ago

  • Status changed from Feedback to Resolved

The point of this particular issue is to clear states on a failed connection, which works now. Killing them on a connection that's up when another connection comes online is a completely separate feature that isn't going to make 2.0. Sticky connections will keep that from being a problem, though it will prevent it from switching back over without manual intervention. Killing states on a connection that's up is going to be undesirable in most all circumstances, though it's something to consider in the future.

#48 Updated by Perry Mason almost 9 years ago

I disagree strangely enough :). The feature request took somewhat a turn as it could solve other problems, But for me it seems obviously that the voip fail over part also would contain the recover part.
Never the less I do think a solution can make it in 2.0 If a field was available as based on what was outlined in the description.

- Replace afterfilterchange with apinger so states only gets reset when a fail over happens and not on every filter change.
- When states gets cleared it will only happen to a single IP or a group of IP's and not every entry in state table

#49 Updated by Chris Buechler almost 9 years ago

shocking. ;) anything further from what's already done isn't critical in the vast majority of cases, has an available work around, and is a lot of work to implement with tons of possibilities for introducing problems. We're going RC soon, not going to do anything further than what's already done for now. Feature request on the remaining is #855, if you have other related ideas please add there.

#50 Updated by M Skenderian almost 5 years ago

State Killing on Gateway Failure is great. is there a way to the the opposite, like State Killing on Gateway Success.

I have two WAN, one being a wireless WiMAX connection. We have a dozen or so IP phones, and this feature has helped us redirect traffic to WAN2 (Wireless Uplink) the problem is, the WAN2 is very slow. but its better then nothing. is there a way to Kill the States when WAN comes back up online. that way the voip will route back to the original default route (WAN)

Also available in: Atom PDF