Project

General

Profile

Bug #9450

Multiwan gateway group fail-over not working as expected (possible race condition)

Added by nasir ahmed about 2 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Multi-WAN
Target version:
Start date:
04/03/2019
Due date:
% Done:

0%

Estimated time:
Affected Version:
2.4.x
Affected Architecture:
All

Description

Multiwan gateway group fail-over not working as expected. After a link state change is triggered by dpinger (rc.gateway_alarm is called) due to a higher priority link recovery, the rc.filter_configure_sync script fails to add the recovered gateway back to the gateway group because of a race condition.

The race condition:
rc.filter_configure_sync calls get_gwgroup_members_inner() function in gwlb.inc, which manages gateway inclusion/exclusion. But this function relies on get_dpinger_status() which reads live gateway status form the dpinger instance monitoring the triggering gateway, and By the time get_dpinger_status() reads the current values, they may have fluctuated back to the gateway down range.
This, prevents get_gwgroup_members_inner() from reactivation the gateway. In contrast, dpinger waits the "Alert interval" period which defaults to 1000 millisecond to check again for alarm conditions. By that time the loss and delay average values may move again out of the alarm range and dpinger may not trigger another gateway down alarm.

This bug results in pfsense and dpinger maintaining unmatched internal states for that particular gateway. In the described scenario the gateway status will be UP in dpinger and DOWN in pfsense. These unmatched states will be maintained until a new gateway event is triggered or a filter reload is called for any other reason.

As a solution I would suggest that the dpinger trigger values, should be written to a file by rc.gateway_alarm and kept for at least "Alert interval" long, and then deleted. Further more, get_gwgroup_members_inner() should read gateway loss and delay values from this file as long as it exists.

History

#1 Updated by Jim Pingle about 2 months ago

  • Category set to Multi-WAN
  • Priority changed from High to Normal
  • Target version set to 2.5.0
  • Affected Version set to 2.4.x
  • Affected Architecture set to All

Also available in: Atom PDF