Multiwan gateway group fail-over not working as expected (possible race condition)
Multiwan gateway group fail-over not working as expected. After a link state change is triggered by dpinger (rc.gateway_alarm is called) due to a higher priority link recovery, the rc.filter_configure_sync script fails to add the recovered gateway back to the gateway group because of a race condition.
The race condition:
rc.filter_configure_sync calls get_gwgroup_members_inner() function in gwlb.inc, which manages gateway inclusion/exclusion. But this function relies on get_dpinger_status() which reads live gateway status form the dpinger instance monitoring the triggering gateway, and By the time get_dpinger_status() reads the current values, they may have fluctuated back to the gateway down range.
This, prevents get_gwgroup_members_inner() from reactivation the gateway. In contrast, dpinger waits the "Alert interval" period which defaults to 1000 millisecond to check again for alarm conditions. By that time the loss and delay average values may move again out of the alarm range and dpinger may not trigger another gateway down alarm.
This bug results in pfsense and dpinger maintaining unmatched internal states for that particular gateway. In the described scenario the gateway status will be UP in dpinger and DOWN in pfsense. These unmatched states will be maintained until a new gateway event is triggered or a filter reload is called for any other reason.
As a solution I would suggest that the dpinger trigger values, should be written to a file by rc.gateway_alarm and kept for at least "Alert interval" long, and then deleted. Further more, get_gwgroup_members_inner() should read gateway loss and delay values from this file as long as it exists.