Bug #8555
closedSelectively killing states on WAN failure
0%
Description
The current options on a WAN failure is to kill all states, or none at all. In a scenario such as having a wireless link is installed as a backup, this leaves all your connections being dropped if the wireless backup link goes offline or not dropping connection states and having devices that don't fail over to the backup link properly if your main link goes offline. With something like VOIP this can result in dropped calls when the backup connection fails or phones going dead and not failing over if the main link fails.
Killing states was looked at in Bug #3181, and there is a comment "Wiping the entire state table is overkill, but will have to suffice for 2.1", but the code doesn't look to have been changed since then.
There is code in /etc/rc.kill_states that attempts to selectively kill states based on the states found on a failed interface. I have taken this, modified it and added it in to /etc/inc/filter.inc to try to handle these situations so connections will fail over to a backup gateway without the need to kill all active states on non-failed gateways.
I have attached two patches. One takes the code from /etc/rc.kill_states and only kills the connections based on IPs which match associated NAT states, along with all connections on the interface. The other expands this code and finds and kills all connections based on IPs which match any connection state on that interface, NAT or not, IPv4 and IPv6.
There is a situation where if certain IP pairs have connections out two different gateways, for example if different connections from the same source to the same destination were routed out two different gateways, it will drop the connections which were on going through the non-failed gateway as well, but this is still less of an impact compared to killing all states in the state table.
Possible improvements to these patches:- Moving this code into its own function if the logic can be shared by these two areas.
- Fix the code path such that routing fails to a backup gateway before the states are killed. The code to kill states seems to be called multiple times (some in different threads) on gateway failover. I've noted that after the first call to kill states, connection attempts directly after this may still attempt to go out the failed gateway. Further calls to kill states happen subsequently and the connections will eventually fail over, but this seems to take extra time than may be necessary.
On a side note, I also discovered that the original code in /etc/rc.kill_states has a bug preventing it from working as expected - Bug #8554
Files
Related issues