Project

General

Profile

Actions

Bug #12399

open

WireGuard v0.1.5 - Tunnel Will Never Handshake Again After WAN Reset

Added by Ryan Roosa 29 days ago. Updated 4 days ago.

Status:
New
Priority:
High
Category:
WireGuard
Target version:
Start date:
09/23/2021
Due date:
01/01/2022 (Due in about 2 months)
% Done:

0%

Estimated time:
Plus Target Version:
Affected Version:
2.5.2
Affected Plus Version:
Affected Architecture:
amd64

Description

Hi Christian,
Really appreciate your work on the Wireguard package for pfSense :)

Sadly, there seems to be a show-stopping bug for me in the package. If my WAN connection drops due to an intentional modem reboot or other ISP issue, and is subsequently restored, my configured WireGuard tunnel in pfSense ceases to handshake after the WAN comes back even if I restart WireGuard services and/or restart pfSense completely. The only way I can get it to handshake again is by removing the configured Peer, completely removing the assigned interface from the Tunnel, deleting the tunnel, then recreating everything and re-associating all my firewall rules with the newly re-created interface. As you an can imagine this a quite a painful process.

I have tried simply disabling/re-enabling the assigned tunnel interface with an without the WireGuard services running but that does not make any different; the tunnel has to be completely removed and re-created for handshaking to occur. This issue is easily reproduceable by restarting my ISP modem. Below are the pfSense system logs during the modem reboot (WAN drop and re-establishment) along with the ifconfig output of my tun_wg0 interface, not sure if they are helpful:

Sep 23 09:24:58 rc.gateway_alarm 29027 >>> Gateway alarm: WANGW (Addr:192.168.7.1 Alarm:0 RTT:.476ms RTTsd:.426ms Loss:5%)
Sep 23 09:24:58 check_reload_status 375 updating dyndns WANGW
Sep 23 09:24:58 check_reload_status 375 Restarting ipsec tunnels
Sep 23 09:24:58 check_reload_status 375 Restarting OpenVPN tunnels/interfaces
Sep 23 09:24:58 check_reload_status 375 Reloading filter
Sep 23 09:25:00 xinetd 37636 Starting reconfiguration
Sep 23 09:25:00 xinetd 37636 Swapping defaults
Sep 23 09:25:00 xinetd 37636 readjusting service 6969-udp
Sep 23 09:25:00 xinetd 37636 Reconfigured: new=0 old=1 dropped=0 (services)

tun_wg0: flags=80c1<UP,RUNNING,NOARP,MULTICAST> metric 0 mtu 1420
description: WireGuardVPN
options=80000<LINKSTATE>
inet 10.7.0.2 netmask 0xffffffff
groups: wg WireGuard
nd6 options=101<PERFORMNUD,NO_DAD>

I do not have any other WireGuard or OpenVPN tunnels besides the one and am not using any sort of gateway groups for failover; just a very simple configuration. The 'server' peer I am connecting to from my local pfSense install is a VPS running WireGuard v1.0.20200513 and I am using a Pre-Shared key for 'client' peers including pfSense. I am using a keep-alive value of 30 seconds for the pfSense peer. Please let me know if I can supply any additional information to assist in troubleshooting. Screenshots attached showing the WireGuard in the state of disconnect after a WAN reset.


Files

2021-09-23_10h16_55.png (52.9 KB) 2021-09-23_10h16_55.png Failed handshake status Ryan Roosa, 09/23/2021 09:22 AM
2021-09-23_10h18_03.png (69.2 KB) 2021-09-23_10h18_03.png Tunnel config Ryan Roosa, 09/23/2021 09:22 AM
2021-09-23_10h20_37.png (78.1 KB) 2021-09-23_10h20_37.png Assigned Interface config Ryan Roosa, 09/23/2021 09:23 AM
2021-09-23_10h21_32.png (63.5 KB) 2021-09-23_10h21_32.png Assigned interface gateway config Ryan Roosa, 09/23/2021 09:24 AM
2021-09-23_10h19_09.png (88.2 KB) 2021-09-23_10h19_09.png Peer config Ryan Roosa, 09/23/2021 09:25 AM
Actions #1

Updated by Christian McDonald 29 days ago

Thanks for the tag, I will investigate this and circle back.

Actions #2

Updated by Ryan Roosa 29 days ago

Christian McDonald wrote in #note-1:

Thanks for the tag, I will investigate this and circle back.

Awesome! Greatly appreciate it :)

Actions #3

Updated by Ryan Roosa 29 days ago

In rebooting my ISP modem many times and tracking the behavior of pfSense and WireGuard, I observed that when the modem starts to come up and my LAN can reach the Internet again briefly, the WAN interface is reloaded by pfSense and WireGuard is able to reach the remote peer again completing one handshake successfully. However, as part of the ISP modem startup sequence (internal healthchecks, loading ISP routes, etc.) it appears the Internet becomes briefly inaccessible again for 30-60 seconds before this connectivity returns completely. It is during this second window of Internet inaccessibility after the intial modem restart that WireGuard reaches its customary 2 minute handshake interval and tries to handshake with the remote peer for a second time (after being able to reach it once already). This second handshake never takes place due to Internet connectivity temporarily being disrupted. Even when the Internet is fully reachable again by all LAN devices, no subsequent handshake attempts ever occur at this stage.

So, to summarize, if Internet connectivity to the remote peer is disrupted in the window of time between WireGuard's first and second handshake attempts after a pfSense WAN reload (due to restart of ISP modem connected to WAN), the configured tunnel in pfSense will never handshake to the assigned peer again on its own. Restarting WireGuard services or pfSense at this point yield no improvement. I cannot say with certainty at this point without further testing, but it would seem logical to me that this scenario may in fact reveal itself during other periods of WAN link instability where consecutive handshake attempts occurring at the ~2 minute interval are unsuccessful due to temporal WAN instability due to myriad reasons. Given these findings, if focus could be directed at handshake workflow and logic surrounding handshake retry attempts after a failed attempt I believe an ultimate solution can be found.

In the meantime, a workaround I found to get the tunnel handshaking with the remote peer again after this scenario occurs is to modify the tunnel listening port arbitrarily (i.e. changing it form '51820' to '51821'); saving/applying this change results in successful handshakes again. In the error scenario with my ISP modem, since WireGuard was observed to handshake once successfully after pfSense resets the WAN interface, another work around may be to re-execute the actions pfSense takes when the WAN is reloaded if handshaking attempts are observed to have ceased after a period of WAN instability. Taking that line of thought a step further, I suppose it would be possible to automate either workaround (tunnel port change or WAN reload actions) with a combo of shell scripting and the pfSense PHP Shell playback functionality but I have not gone down this road yet myself and would hope that the root cause could be resolved without resorting to band-aids like these :)

FWIW as a data point, I see that a Reddit user who replied to my OP in r/PFSENSE (https://www.reddit.com/r/PFSENSE/comments/ptx06g/wireguard_v015_tunnel_will_never_handshake_again/) indicated that they experienced similar behavior in v0.1.5 and that a downgrade to the previous release did not exhibit the issue.

Thanks again!

Actions #4

Updated by Christian McDonald 28 days ago

Just tested this on my 2100. I will test more next week.

I have a WireGuard tunnel to Mullvad.

  1. Started a persistent ping on LAN client
  2. Unplugged WAN, pings stop as expected.
  3. Plug in WAN, it takes a hot minute but eventually the first handshake is performed.
  4. As soon as I see the first handshake, I unplugged the WAN again.
  5. Now at this point we are in the period of time between the first and the second handshake.
  6. I reattach WAN...again it takes as hot minute, but eventually connectivity is restored and works correctly.

Hmm. I will test some other scenarios next week.

Actions #5

Updated by Christian McDonald 28 days ago

Ryan, out of curiosity, are you using DHCP are static addressing on your WAN?

Actions #6

Updated by Ryan Roosa 28 days ago

Christian McDonald wrote in #note-5:

Ryan, out of curiosity, are you using DHCP are static addressing on your WAN?

My WAN uses static addressing.

Actions #7

Updated by Christian McDonald 28 days ago

Interesting... I can replicate this if my WAN is using DHCP, but as soon as I switch to a static address I can unplug and replug WAN all day at any cadence and it works fine.

What is your system default gateway set to? If it is set to Automatic, set it to your WAN explicitly.

Actions #8

Updated by Ryan Roosa 28 days ago

Christian McDonald wrote in #note-7:

Interesting... I can replicate this if my WAN is using DHCP, but as soon as I switch to a static address I can unplug and replug WAN all day at any cadence and it works fine.

What is your system default gateway set to? If it is set to Automatic, set it to your WAN explicitly.

Thank you for the feedback and suggestions! I am not home at the moment but will double check when I get back that my WAN has not reverted to DHCP for some reason. Will also check my gateway setting but believe I have WAN explicitly set as much default gateway already.

Actions #9

Updated by Christian McDonald 28 days ago

Thanks.

We might need to hook the gateway alarm and trigger WireGuard service to be restarted when gateway status changes...like we already do for ipsec, openvpn, dyndns, etc.

Actions #10

Updated by Ryan Roosa 28 days ago

Christian McDonald wrote in #note-9:

Thanks.

We might need to hook the gateway alarm and trigger WireGuard service to be restarted when gateway status changes...like we already do for ipsec, openvpn, dyndns, etc.

I confirmed my WAN is static and set as default gateway. Would hooking the gateway alarm to restart WireGuard restart it in a different way than what is already available in the GUI? I restarted the services manually in the GUI a few times and it did not seem to help. The only thing that helped seemed to be changing the tunnel port number.

Actions #11

Updated by Samuel Hanna 13 days ago

The problem still persist on wireguard 0.1.5_1.
even after changing the keys and ports nothing seems to help.
wish that you can fix this issue ASAP.
my wan connection is PPPOE with 1492 mtu value, tried to lower the mtu of WG interface lower than 1420, also didn't help.

Actions #12

Updated by Ryan Roosa 7 days ago

Samuel Hanna wrote in #note-11:

The problem still persist on wireguard 0.1.5_1.
even after changing the keys and ports nothing seems to help.
wish that you can fix this issue ASAP.
my wan connection is PPPOE with 1492 mtu value, tried to lower the mtu of WG interface lower than 1420, also didn't help.

Just an update from my side...
My 'after WAN reset' wording in the bug description may be somewhat of a misnomer based on my further testing. After many hours of playing with this (as well as having it break organically again at least twice) I've been able to consistently reproduce the issue by:

  1. Taking my WAN connection offline without disrupting its Ethernet connection to pfSense (logging into my modem interface and disconnecting from the upstream ISP without disconnecting the Ethernet connection).
  2. Waiting the equivalent of two handshake intervals (2-4 minutes depending on when the last successful handshake occurred).
  3. Taking my WAN connection back online.

After step 3, tunnel handshaking suspends indefinitely and restarting Wireguard services or the WAN interface in pfSense does nothing to correct this error state even though the rest of my LAN is back online without issue.

Since I do not see any updates on a fix for this bug as of yet, I created a shell script equivalent of the workaround I discovered (changing the tunnel port arbitrarily then later reverting back to the port originally configured for consistency within the package). I really dislike using bandaids such as this but I'm not sure when this bug might be fixed and I need to insure against unknown future WAN instability killing my Wireguard tunnels.

The following is my shell script which is called every 10 minutes (*/10 * * * *) via the 'Cron' package in pfSense:

#!/bin/sh
[ $(( $(date +s) - $(wg show tun_wg0 latest-handshakes | awk '{print $2}') )) -gt 300 ] &x%x \
wg set tun_wg0 listen-port 51$(tr -cd 0-9 </dev/urandom | head -c 3) && sleep 300 && \
wg syncconf tun_wg0 /usr/local/etc/wireguard/tun_wg0.conf && \
echo "$(date) - Had to jiggle the tunnel port to fix handshaking!" >> /var/log/wg_bandaid.log
exit 0

(Updated script; do not use the one above. See: https://redmine.pfsense.org/issues/12399#note-16 .)

Breaking it down line by line:

  1. The script starts by checking if it has been longer than 5 minutes since the last handshake. If not, then the script exits without taking any action.
  2. If the last handshake took place longer than 5 minutes ago, the tunnel listen port is changed arbitrarily ('51NNN'). The script then waits 5 minutes to allow the tunnel to handshake twice successfully again before proceeding.
  3. Once the waiting interval has passed, a configuration sync is executed which updates the listen port back to what it was prior to the script changing it (it does this by comparing the running configuration to the configuration file saved at OS level by the package).
  4. A log of the script's intervention is written to '/var/log/wg_bandaid.log' for posterity.
  5. Finally, the script exits with a success code (0).

Technically the script body is just a long 'one-liner' which is broken up into multiple lines for readability and easier explanation. Of note, I found that the 'sleep' command must be at least 4-5 minutes (I settled on 5 minutes = 300 seconds) by trial and error testing. If I reverted the port any sooner, the tunnel would stop handshaking again. Maybe this is a clue to the source of the bug? Not sure, but thought it was worth sharing for those wondering why this specific interval of wait time was chosen (as an aside, sleeps are not really a hallmark of great programming in general but since this is a bandaid, I've decided to use it regardless).

One other interesting thing I found is that the tunnel configuration file stored at OS level ('/usr/local/etc/wireguard/tun_wg0.conf') is always missing the tunnel endpoint IP information. This info is obviously stored elsewhere in the pfSense package configuration since it shows up in the web interface so I'm not sure if its omission in the OS level configuration file is intentional or not but this is why I chose to use 'syncconf' as opposed to 'setconf' when having the script revert the tunnel port after arbitrarily changing it (and, as the 'wg' man page indicates, 'syncconf' is more efficient anyways with the added benefit of not further disrupting the active peer connection).

I provide no warranty with this script and it assumes your only tunnel is 'tun_wg0'. More complex setups may necessitate tweaks and/or additional logic. Caveat emptor.

Actions #13

Updated by Christian McDonald 7 days ago

Thank you for the detailed report here. This is immensely helpful. I will continue to poke at this next week and report back. It would be interesting to see if this behavior also happens on vanilla FreeBSD, I will test this as well. If this happens to be reproducible there as well, then that's a good indicator of an upstream issue.

And yes, the endpoint is not stored in the configuration file because of reasons why how WireGuard handles name resolution. WireGuard tools only performs resolution once, at tunnel creation. So it is left up to additional tooling to nudge WireGuard to re-resolve these endpoints. The rationale here, is that endpoint addresses at the level of the kernel code is fixed length, 32-128bit, so leaving the DNS resolution out of the kernel and leaving it to userspace tooling eliminates parsing vuls, etc.

Actions #14

Updated by Samuel Hanna 7 days ago

Ryan Roosa wrote in #note-12:

Samuel Hanna wrote in #note-11:

The problem still persist on wireguard 0.1.5_1.
even after changing the keys and ports nothing seems to help.
wish that you can fix this issue ASAP.
my wan connection is PPPOE with 1492 mtu value, tried to lower the mtu of WG interface lower than 1420, also didn't help.

Just an update from my side...
My 'after WAN reset' wording in the bug description may be somewhat of a misnomer based on my further testing. After many hours of playing with this (as well as having it break organically again at least twice) I've been able to consistently reproduce the issue by:

  1. Taking my WAN connection offline without disrupting its Ethernet connection to pfSense (logging into my modem interface and disconnecting from the upstream ISP without disconnecting the Ethernet connection).
  2. Waiting the equivalent of two handshake intervals (2-4 minutes depending on when the last successful handshake occurred).
  3. Taking my WAN connection back online.

After step 3, tunnel handshaking suspends indefinitely and restarting Wireguard services or the WAN interface in pfSense does nothing to correct this error state even though the rest of my LAN is back online without issue.

Since I do not see any updates on a fix for this bug as of yet, I created a shell script equivalent of the workaround I discovered (changing the tunnel port arbitrarily then later reverting back to the port originally configured for consistency within the package). I really dislike using bandaids such as this but I'm not sure when this bug might be fixed and I need to insure against unknown future WAN instability killing my Wireguard tunnels.

The following is my shell script which is called every 10 minutes (*/10 * * * *) via the 'Cron' package in pfSense:

#!/bin/sh
[ $(( $(date +s) - $(wg show tun_wg0 latest-handshakes | awk '{print $2}') )) -gt 300 ] &x%x \
wg set tun_wg0 listen-port 51$(tr -cd 0-9 </dev/urandom | head -c 3) && sleep 300 && \
wg syncconf tun_wg0 /usr/local/etc/wireguard/tun_wg0.conf && \
echo "$(date) - Had to jiggle the tunnel port to fix handshaking!" >> /var/log/wg_bandaid.log
exit 0

Breaking it down line by line:

  1. The script starts by checking if it has been longer than 5 minutes since the last handshake. If not, then the script exits without taking any action.
  2. If the last handshake took place longer than 5 minutes ago, the tunnel listen port is changed arbitrarily ('51NNN'). The script then waits 5 minutes to allow the tunnel to handshake twice successfully again before proceeding.
  3. Once the waiting interval has passed, a configuration sync is executed which updates the listen port back to what it was prior to the script changing it (it does this by comparing the running configuration to the configuration file saved at OS level by the package).
  4. A log of the script's intervention is written to '/var/log/wg_bandaid.log' for posterity.
  5. Finally, the script exits with a success code (0).

Technically the script body is just a long 'one-liner' which is broken up into multiple lines for readability and easier explanation. Of note, I found that the 'sleep' command must be at least 4-5 minutes (I settled on 5 minutes = 300 seconds) by trial and error testing. If I reverted the port any sooner, the tunnel would stop handshaking again. Maybe this is a clue to the source of the bug? Not sure, but thought it was worth sharing for those wondering why this specific interval of wait time was chosen (as an aside, sleeps are not really a hallmark of great programming in general but since this is a bandaid, I've decided to use it regardless).

One other interesting thing I found is that the tunnel configuration file stored at OS level ('/usr/local/etc/wireguard/tun_wg0.conf') is always missing the tunnel endpoint IP information. This info is obviously stored elsewhere in the pfSense package configuration since it shows up in the web interface so I'm not sure if its omission in the OS level configuration file is intentional or not but this is why I chose to use 'syncconf' as opposed to 'setconf' when having the script revert the tunnel port after arbitrarily changing it (and, as the 'wg' man page indicates, 'syncconf' is more efficient anyways with the added benefit of not further disrupting the active peer connection).

I provide no warranty with this script and it assumes your only tunnel is 'tun_wg0'. More complex setups may necessitate tweaks and/or additional logic. Caveat emptor.

thanks for your help. but i have detected that the handshake is getting interrupted by my ISP. as they started to block whole wireguard protocol all over the country. :D

Actions #15

Updated by Ryan Roosa 4 days ago

Christian McDonald wrote in #note-13:

Thank you for the detailed report here. This is immensely helpful. I will continue to poke at this next week and report back. It would be interesting to see if this behavior also happens on vanilla FreeBSD, I will test this as well. If this happens to be reproducible there as well, then that's a good indicator of an upstream issue.

And yes, the endpoint is not stored in the configuration file because of reasons why how WireGuard handles name resolution. WireGuard tools only performs resolution once, at tunnel creation. So it is left up to additional tooling to nudge WireGuard to re-resolve these endpoints. The rationale here, is that endpoint addresses at the level of the kernel code is fixed length, 32-128bit, so leaving the DNS resolution out of the kernel and leaving it to userspace tooling eliminates parsing vuls, etc.

No problem! Appreciate your continued attention on this one.

Thanks for highlighting the additional information surrounding DNS resolution which explains why the endpoint is not stored in the OS level configuration file. Your explanation leads me to some additional questions:

  1. Does the WireGuard package for pfSense (or WG in general) still try resolution even if the tunnel is configured using an IP address for the endpoint?
  2. In the case that the tunnel endpoint is specified as an IP address should the best practice be to set "Endpoint Hostname Resolve Interval" to "0" to disable this periodic resolution check in order to prevent some issue that might be encountered when attempting to resolve a host entered as an IP?
Actions #16

Updated by Ryan Roosa 4 days ago

FWIW, just wanted to share updates I've made to my bandaid script. I found that 'head -c' usage on '/dev/urandom' likes to hang and not close the device properly when the script runs in background as a cron job. So I switched that part to use 'jot' for random port number generation and wrapped the core one-liner with some basic pidfile logic to keep the script from stacking executions. The latter allows calling the script more frequently (I have it running every 5 minutes now). Finally, added very rudimentary log cleanup to keep it from growing too large in case something goes nuts with it.

#!/bin/sh

LOGFILE=/var/log/wg_bandaid.log
PIDFILE=/var/run/wg_bandaid.pid
TUNNAME=tun_wg0
TUNCONF=/usr/local/etc/wireguard/$TUNNAME.conf

# Preliminary logic to ensure this only runs one instance at a time
if [ -f $PIDFILE ]
then
  PID=$(cat $PIDFILE)
  ps -p $PID > /dev/null 2>&1
  if [ $? -eq 0 ]
  then
    echo "$(date) - Process already running. Exiting." >> $LOGFILE
    exit 1
  else
    echo $$ > $PIDFILE
    if [ $? -ne 0 ]
    then
      echo "$(date) - Could not create PID file. Exiting." >> $LOGFILE
      exit 1
    fi
  fi
else
  echo $$ > $PIDFILE
  if [ $? -ne 0 ]
  then
    echo "$(date) - Could not create PID file. Exiting." >> $LOGFILE
    exit 1
  fi
fi

# Main command
[ $(( $(date +%s) - $(wg show $TUNNAME latest-handshakes | awk '{print $2}') )) -gt 300 ] && \
wg set $TUNNAME listen-port 51$(jot -r 1 100 999) && sleep 300 && \
wg syncconf $TUNNAME $TUNCONF && \
echo "$(date) - Had to jiggle $TUNNAME listen-port to fix handshaking!" >> $LOGFILE

rm $PIDFILE

# Keep log from getting too large
echo "$(tail -1000 $LOGFILE)" > $LOGFILE
exit 0
Actions

Also available in: Atom PDF