WireGuard MultiWAN Not Failing Back to Tier 1
When using a GW group for WAN failover, WireGuard will fail to Tier2 when the Tier1 GW is down. However, when Tier1 is restore, WireGuard does not revert back to Tier1.
I have run pings on 220.127.116.11 and I see traffic revert to Tier2 and then back to Tier1 as expected. However, I am not seeing this with WireGuard in pfSensePlus 21.02_1. I’ve performed similar tests previously using one of the 2.5.0 CE nightly builds and it worked flawlessly.
I’m using the latest stable OS on an SG-3100 and XG-7100-1U
#4 Updated by Christian McDonald about 1 month ago
I'm seeing this on 2.5.0 as well. I have a failover group set as default gateway IPv4. WAN1 dropped out and WG started going out WAN2 as expected. But when WAN1 came back, WG didn't revert back to WAN1 as expected.
Edit1: I was able "fix" this by going into the remote WG peer (which happens to also be a pfSense VM) and re-save the VPN->WireGuard tunnels
Edit2: Weird, even though gateway monitoring against the remote peer is dead, it is only dead in one direction. So for instance, I have two pfSense nodes. PF1 has a static public address and PF2 is a dynamic peer. PF2 has two WANS in a failover group (WAN1 preferred over WAN2). When WAN1 went own, PF2 started pushing WG traffic out WAN2 as expected. Running 'wg show' on PF1 shows the correct endpoint address change. However, as soon as WAN1 came back, gateway monitoring PF1 from PF2 appeared "online" but monitoring PF2 from PF1 was "offline" until I re-saved the tunnels as in Edit1. What's even weirder is that even though gateway monitoring PF2 from PF1 in my case appeared as "offline", my OSPF adjacencies remained up and I was still able to route traffic over the tunnel bi-directionally.
#5 Updated by James Blanton about 1 month ago
What I've found is that unless you do something to interfere with WireGuard, such as disabling and re-enabling or stopping all traffic for at least 60 seconds, then the tunnels don't fail back. Unfortunately, the "wg-quick" tools are also not included, so there is no way to systematically bounce the tunnel and it must be done manually.
This is quite an issue for me, as we just ordered a couple NetGate devices for a pilot of a new network. If we deploy this, I'm going to need around 15 appliances. But I can't really justify moving forward with testing until this issue is corrected. I'm hoping that someone from NetGate can help out with this soon.
#6 Updated by Christian McDonald about 1 month ago
If anybody from Netgate would like to jump into a Zoom meeting so that they can observe this edge case, just reach out to me.
Here is a video of the behavior: https://i.imgur.com/nJFI4P1.gif
In this video, you can see wg1 (10.1.14.0/31) is dead. This video is from the static endpoint. 10.1.14.0 is the static peer, and 10.1.14.1 is the remote dynamic peer. From the perspective of the peer in the video, you can see that 10.1.14.1 is not pingable. Though not shown here, from the perspective remote peer (not in the video), 10.1.14.0 is pingable. Strange. In this video I just bounce the wg1 tunnel and you can observe that 10.1.14.1 becomes pingable again.
Edit1: James, by chance do you have a persistent keepalive configured on your problematic tunnels?
#7 Updated by James Blanton about 1 month ago
Nope! I explored that line of thought as well. I did have it set up at one point, but then I removed it. And then I completely deleted and rebuilt the tunnels. And then I factory reset the routers and rebuilt the tunnels again. And then I set it up in a lab with VM's running 2.5.0 stable release. All of them are performing the same way.