Project

General

Profile

Actions

Bug #15435

open

Long boot time when using FQDN for WireGuard VPN endpoint

Added by Patrik Stahlman 12 months ago. Updated 2 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
WireGuard
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Affected Version:
Affected Plus Version:
24.03
Affected Architecture:
All

Description

(Additional details in https://forum.netgate.com/post/1163707)

My reason for investigating this was boot times of more than five minutes, seemingly not having a real cause. The syslog didn't provide any answers as it is tuned off for a long period of time, and subsequent dumping of buffered log items have incorrect time stamps:

Apr 19 22:29:33     kernel         pflog0: promiscuous mode disabled
Apr 19 22:29:33     syslogd         kernel boot file is /boot/kernel/kernel
Apr 19 22:25:04     syslogd         exiting on signal 15
Apr 19 22:25:04     reboot     54300     rebooted by root

I took a video of the console as pfSense was booting and the only thing going on was the slow establishment of wireguard tinnels (screenshot in the forum post). Video available on demand (not Oscars' material)

I subsequently hacked /etc/rc.bootup to get my bearings on what might be taking up the time. system_do_shell_commands(1) stood out as being the culprit to investigate:


24/04/2024 11:09:19 L248: system_do_shell_commands(1);
24/04/2024 11:12:48 L252: load_thermal_hardware();

Full trace:

Logging /etc/rc.bootup_with_trace
24/04/2024 11:09:16 L75: led_normalize();
24/04/2024 11:09:16 L122: notices_setup();
24/04/2024 11:09:16 L138: system_dmesg_save();
24/04/2024 11:09:16 L141: system_check_reset_button();
24/04/2024 11:09:16 L167: mute_kernel_msgs();
24/04/2024 11:09:16 L168: start_devd();
24/04/2024 11:09:18 L169: set_device_perms();
24/04/2024 11:09:18 L170: unmute_kernel_msgs();
24/04/2024 11:09:18 L174: parse_config_bootup();
24/04/2024 11:09:18 L177: mwexec("/usr/sbin/gnid > {$g['vardb_path']}/uniqueid 2>/dev/null");
24/04/2024 11:09:18 L194: convert_config();
24/04/2024 11:09:19 L230: load_crypto();
24/04/2024 11:09:19 L235: system_setup_sysctl();
24/04/2024 11:09:19 L248: system_do_shell_commands(1);
24/04/2024 11:12:48 L252: load_thermal_hardware();
24/04/2024 11:12:48 L255: system_timezone_configure();
24/04/2024 11:12:48 L258: system_hostname_configure();
24/04/2024 11:12:50 L261: system_hosts_generate();
24/04/2024 11:12:55 L264: enable_carp(false);
24/04/2024 11:12:55 L267: interfaces_loopback_configure();
24/04/2024 11:12:55 L270: ca_setup_trust_store();
24/04/2024 11:12:55 L277:       exec("/sbin/zfs get -H compression /var/log | /usr/bin/awk '{print $3;}'", $varlog_compression_state);
24/04/2024 11:12:55 L285: system_syslogd_start();
24/04/2024 11:12:57 L288: set_pam_auth();
24/04/2024 11:12:57 L296: switches_configure();
24/04/2024 11:12:57 L300: setup_microcode();
24/04/2024 11:12:58 L321: interfaces_configure();
24/04/2024 11:13:07 L322: interfaces_sync_setup();
24/04/2024 11:13:07 L328: system_hosts_generate();
24/04/2024 11:13:07 L334:       send_event("service reload sshd");
24/04/2024 11:13:07 L340: openvpn_resync_all();
24/04/2024 11:13:07 L344: system_resolvconf_generate();
24/04/2024 11:13:07 L347: filter_pflow_configure();
24/04/2024 11:13:07 L350: filter_configure_sync();
24/04/2024 11:13:09 L354: filter_pflog_start();
24/04/2024 11:13:09 L358: setup_gateways_monitor();
24/04/2024 11:13:11 L362: system_routing_configure();
24/04/2024 11:13:12 L371:       services_dnsmasq_configure();
24/04/2024 11:13:12 L374:       services_unbound_configure();
24/04/2024 11:13:22 L378: local_reset_accounts();
24/04/2024 11:13:22 L383: configure_cron();
24/04/2024 11:13:22 L387: system_routing_enable();
24/04/2024 11:13:22 L424:       mwexec("/usr/bin/timeout -k 45 30 /usr/local/sbin/ntpd -g -q -c /dev/null {$ntp_boot_time_servers}");
24/04/2024 11:13:29 L429: system_ntp_configure();
24/04/2024 11:13:29 L437:       system_webgui_start();
24/04/2024 11:13:31 L441: console_configure();
24/04/2024 11:13:31 L444: services_dhcpd_configure();
24/04/2024 11:13:31 L447: system_dhcpleases_configure();
24/04/2024 11:13:31 L450: services_dhcrelay_configure();
24/04/2024 11:13:31 L453: services_dhcrelay6_configure();
24/04/2024 11:13:31 L456: send_event("service reload dyndnsall");
24/04/2024 11:13:31 L459: filter_configure_sync();
24/04/2024 11:13:32 L462: vpn_setup();
24/04/2024 11:13:32 L465: captiveportal_configure();
24/04/2024 11:13:32 L468: voucher_configure();
24/04/2024 11:13:32 L471: system_do_shell_commands();
24/04/2024 11:13:32 L474: $ipsec_dynamic_hosts = ipsec_configure();
24/04/2024 11:13:34 L477: services_snmpd_configure();
24/04/2024 11:13:36 L480: system_set_harddisk_standby();
24/04/2024 11:13:36 L483: reload_ttys();
24/04/2024 11:13:36 L486: enable_rrd_graphing();
24/04/2024 11:13:37 L489: enable_watchdog();
24/04/2024 11:13:37 L519: services_igmpproxy_configure();
24/04/2024 11:13:37 L522: upnp_start();
24/04/2024 11:13:37 L525: activate_powerd();
24/04/2024 11:13:37 L528: prefer_ipv4_or_ipv6();
24/04/2024 11:13:37 L536:       register_all_installed_packages();
24/04/2024 11:13:37 L541: system_syslogd_start();
24/04/2024 11:13:39 L548: enable_carp();
24/04/2024 11:13:39 L572:       filter_configure();
24/04/2024 11:13:39 L576: led_normalize();
24/04/2024 11:13:39 L578: notify_all_remote("Bootup complete");

Apart from System Patches, there was only item in the earlyshellcmd list:

<earlyshellcmd>service wireguardd start</earlyshellcmd>

The journey continued into wireguard land.

I hacked /usr/local/pkg/wireguard/includes/wg_service.inc to provide to trace logs (attached).

The trace logs from wireguardd start during boot shows timestamps indicating significant delays, or sleep periods prolonging the boot.There are three calls into wireguard/wg_service.inc of which the first is taking the longest.

'service wireguardd start' 
(earlyshellcmd) runs
/usr/local/etc/rc.d/wireguardd

which calls

/usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc start

and the tracings are from my hacked /usr/local/pkg/wireguard/includes/wg_service.inc (attached)

Overview of timing in the boot trace file

The timestamps below are the unique timestamps in the log, when execution is taking place wg_service.inc. They were extracted by:

awk '{ print $2 }' wireguardd-service-start-bootup-trace.txt| sort | uniq

After that it is quite easy to find the instruction that is blocking the execution by finding the last of of any particular timestamp,
17:37:28 for example:


24/04/2024 17:37:28 /usr/local/pkg/wireguard/includes/wg_service.inc:515        fclose($h_lock);
24/04/2024 17:37:28 /usr/local/pkg/wireguard/includes/wg_service.inc:518        $sig_test = posix_kill($pid, 0);
24/04/2024 17:37:32 /usr/local/pkg/wireguard/includes/wg_service.inc:501        global $wgg;


My interpretation here is this call to posix_kill() adds a four second wait/block to the execution. Follow the trace and a pattern starts to emerge...
'service wireguardd start'
17:37:13 wg_service.inc starts
17:37:17 posix_kill() => 7s
17:37:24 posix_kill() => 4s
17:37:28 posix_kill() => 4s
17:37:32 posix_kill() => 4s
17:37:36 posix_kill() => 4s
17:37:40 posix_kill() => 4s
17:37:44 posix_kill() => 5s
17:37:49 posix_kill() => 8s
17:37:57 posix_kill() => 11s
17:38:08 posix_kill() => 15s
17:38:23 posix_kill() => 17s
17:38:40 posix_kill() => 21s
17:39:01 posix_kill() => 24s
17:39:25 posix_kill() => 28s
17:39:53 posix_kill() => 27s
17:40:20 posix_kill() => 25s
17:40:45 wg_service.inc completes 3.5 min later!

Not sure what triggeres the last two execeutions of wg_service.inc. One is "package start" I believe. But as they take hardly any time to execute they are of little interest.

'?' 
17:41:38
17:41:42

'?'
17:41:44
17:41:48

Additional thoughts:
  • This might not be specific to 24.03, it was just that I noticed it during the testing of the 24.03-RC, as reboots aren't generally needed with pfSense :)
  • my pfSense is configured with 15 wireguard tunnels, 14 clients and one server.
  • a manual 'services wireguardd start' takes about 1m20

Files

wireguardd-service-start-bootup-trace.txt (16.7 KB) wireguardd-service-start-bootup-trace.txt trace out from hacked wg_service.inc. Patrik Stahlman, 04/24/2024 07:51 PM
wg_service.inc (11.5 KB) wg_service.inc hack to enable tracing of PHP code with time stamping Patrik Stahlman, 04/24/2024 07:53 PM
rc.bootup_with_trace (17.2 KB) rc.bootup_with_trace patched /etc/rc.bootup with built-in trace system Patrik Stahlman, 04/24/2024 08:09 PM
console-log-pfsense-boot.txt (34 KB) console-log-pfsense-boot.txt Patrik Stahlman, 04/26/2024 10:27 AM
Actions #1

Updated by Patrik Stahlman 12 months ago

Just to document what I did, completely. To generate the first general timing trace of /etc/rc.bootup:

1. apply this patch


--- /etc/pfSense-rc.24.03       2024-04-19 01:52:38.000000000 +0200
+++ /etc/pfSense-rc     2024-04-24 11:07:05.824092000 +0200
@@ -760,7 +760,12 @@
        fi
 fi

-/etc/rc.bootup
+if [ -f /root/run_rc_bootup_with_trace ]; then
+    /bin/rm /root/run_rc_bootup_with_trace
+    /etc/rc.bootup_with_trace
+else
+    /etc/rc.bootup
+fi

 # /etc/rc.bootup unset $g['booting'], and removes file
 # Be sure the file is removed to not create troubles after

2. install the patched /etc/rc.bootup_with_trace (attached)

3. trigger the trace during the next boot only, one-shot

touch /root/run_rc_bootup_with_trace

4. reboot

After the reboot the /etc/rc.boot timing can be found in /tmp/rc.bootup.log

Notes:
  • to trace any php statement, just add "rcbLog();" before it on the same line. rcbLog(); must occur first on the line, followed by any PHP statement you want to trace.
Actions #2

Updated by Patrik Stahlman 12 months ago

I bought a console cable and captured the reboot output (attached). It doesn't really show anything new. Most of the boot time is spent here

Executing early shell commands...Configuring WireGuard tunnels
:
<snip>
:
tun_wg14: link state changed to UP
done.
done.
coretemp0: <CPU On-Die Thermal Sensors> on cpu0

Actions #3

Updated by Patrik Stahlman 2 days ago

This issue is still present in 25.03-BETA (25.03.b.20250414.1838) as mentioned here: https://forum.netgate.com/topic/197151/observations-25-03-b-20250414-1838

A possible cause for this occured to me this morning: some of the wireguard peer end point configurations uses FQDN. If the wireguard configurator tries to resolve those at boot time there is no chance of success as there is no resolver running, nor any interfaces to query over.

I would prefer a solution where I don't have to manually configure the wireguard peers with IP addresses as the WG configuration is received from the provider with FQDNs.

Actions #4

Updated by Patrik Stahlman 2 days ago

I am happy to confirm that the long boot time (or boot failure in case of 25.03-BETA) occurs when using FQDN in wireguard peer end-point addresses.

As a test I changed the FQDNs to IP addresses, and the system booted quickly.

But wireguard still needs to be fixed. The WG configuration I receive from one provider uses FQDNs which resolves pools of IP addresses so FQDNs needs to work and not cause boot delay/failure.

Actions #5

Updated by Patrik Stahlman 2 days ago

I found a simple solution: don't start wireguardd through the early shell command mechanism in config.xml

While I was testing WG peers with FQDNs, I changed the two lines in config.xml relating to wireguard from

<earlyshellcmd>service wireguardd start</earlyshellcmd>
<cmd>service wireguardd start</cmd>

to

<earlyshellcmd>echo service wireguardd start</earlyshellcmd>
<cmd>echo service wireguardd start</cmd>

The next boot was quick, no boot delay nor failure, and the wireguard service was running when I logged in. Wireguard must have been started by some other means, the log show this which might explain why:

Starting package nut...done.
Starting package WireGuard...done.
Starting package System Patches...done.

Looking at the descriptions of the various types of shell commands in the Shell command configuration, perhaps starting wireguardd as a regular shellcmd which will run the command specified towards the end of the boot process , instead of an earlyshellcmd which will run the command specified at the beginning of the boot process would be a better suited for wireguard which rely on DNS to be available?

Actions #6

Updated by Patrik Stahlman 2 days ago

Using the Shellcmd package, I tried changing wireguardd start from earlyshellcmd to shellcmd but that did not work, the wireguard service did not start automatically. But disabling the wireguard command worked, and the service started automatically as described above.

So, in conclusion, this solves the problem I have had with boot failures when using 14+ WG tunnels, some of which are configured with peer end-point addresses using FQDNs.

Actions #7

Updated by Jim Pingle 2 days ago

  • Subject changed from Long boot time when using wireguard VPN to Long boot time when using FQDN for WireGuard VPN endpoint

Changing when WireGuard starts ends up being a chicken-and-egg situation. Certain other things may require connectivity through WireGuard to start, so it needs to be early. If you move it later, you could break those other things that rely on it starting early, not late.

For example if someone's internal DNS servers are only reachable over WireGuard and the firewall is restricted to only those DNS servers, it would have to be started before Unbound.

So while changing it may be viable in some specific cases such as this, it is not a general solution that could be changed in the package for others.

Starting slower due to having to fall back to other DNS servers is more desirable than starting fast, but non-functional.

Actions #8

Updated by Patrik Stahlman 2 days ago

Thanks Jim, yes, I can see the multitude of scenarios that needs to be catered for. I'm just happy I found a way forward for my case.

I also notice there are settings for wireguard endpoint hostname resolve interval, defaulting to 300 seconds. Perhaps reducing that value is another way forward. Although, it looks like the start of the wireguard service is blocking the whole boot sequence, perhaps it is just another dead end. Egg. Chicken. Happy Easter!

Actions #9

Updated by Patrik Stahlman 2 days ago

Unfortunately, wireguard, in its wisdom... seems to forcefully reinstall the shell commands every time I reboot. And it fails to do so correctly, so we end up with multiple entries of <earlyshellcmd> in config.xml, in different parts of the file.

Under <system> there's this one without any status

<earlyshellcmd>service wireguardd start</earlyshellcmd>

and then under <shellcmdsettings>

                <shellcmdsettings>
                        <config>
                                <cmd>service wireguardd start</cmd>
                                <cmdtype>disabled</cmdtype>
                                <description><![CDATA[WireGuard earlyshellcmd (DO NOT EDIT/DELETE!)]]></description>
                        </config>

So, even if I disable the early shell command for wireguard, it gets added and run anyway after the next boot.

Should I raise another redmine on this inability to disable Wireguard running early?

Actions #10

Updated by Patrik Stahlman 2 days ago

I know of others who experience the same issue, so I'll document my local patch that disables Wireguard from installing the early command. Using System / Patches

Description: Prevent Wireguard from adding early shell commands
Patch content:
--- wg_install.inc      2025-04-18 00:13:03.733614000 +0200
+++ wg_install.inc.new  2025-04-18 00:24:50.397783000 +0200
@@ -259,6 +259,7 @@

        wg_earlyshellcmd_deinstall(true);

+/* disable adding early shell command for wireguard
        $a_earlyshellcmd = config_get_path('system/earlyshellcmd', []);

        $a_shellcmdsettings = config_get_path('installedpackages/shellcmdsettings/config', []);
@@ -291,6 +292,7 @@

                wg_write_config('Installed earlyshellcmd(s).', false);
        }
+*/
 }

 /*
Path Strip Count: 0
Base Directory: /usr/local/pkg/wireguard/includes/
Ignore Whitespace: tick
Auto Apply: tick

updated patch: keep wg_earlyshellcmd_deinstall()

Actions

Also available in: Atom PDF