Project

General

Profile

Actions

Bug #15435

open

Long boot time when using wireguard VPN

Added by Patrik Stahlman 10 days ago. Updated 8 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
WireGuard
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Affected Version:
Affected Plus Version:
24.03
Affected Architecture:
All

Description

(Additional details in https://forum.netgate.com/post/1163707)

My reason for investigating this was boot times of more than five minutes, seemingly not having a real cause. The syslog didn't provide any answers as it is tuned off for a long period of time, and subsequent dumping of buffered log items have incorrect time stamps:

Apr 19 22:29:33     kernel         pflog0: promiscuous mode disabled
Apr 19 22:29:33     syslogd         kernel boot file is /boot/kernel/kernel
Apr 19 22:25:04     syslogd         exiting on signal 15
Apr 19 22:25:04     reboot     54300     rebooted by root

I took a video of the console as pfSense was booting and the only thing going on was the slow establishment of wireguard tinnels (screenshot in the forum post). Video available on demand (not Oscars' material)

I subsequently hacked /etc/rc.bootup to get my bearings on what might be taking up the time. system_do_shell_commands(1) stood out as being the culprit to investigate:


24/04/2024 11:09:19 L248: system_do_shell_commands(1);
24/04/2024 11:12:48 L252: load_thermal_hardware();

Full trace:

Logging /etc/rc.bootup_with_trace
24/04/2024 11:09:16 L75: led_normalize();
24/04/2024 11:09:16 L122: notices_setup();
24/04/2024 11:09:16 L138: system_dmesg_save();
24/04/2024 11:09:16 L141: system_check_reset_button();
24/04/2024 11:09:16 L167: mute_kernel_msgs();
24/04/2024 11:09:16 L168: start_devd();
24/04/2024 11:09:18 L169: set_device_perms();
24/04/2024 11:09:18 L170: unmute_kernel_msgs();
24/04/2024 11:09:18 L174: parse_config_bootup();
24/04/2024 11:09:18 L177: mwexec("/usr/sbin/gnid > {$g['vardb_path']}/uniqueid 2>/dev/null");
24/04/2024 11:09:18 L194: convert_config();
24/04/2024 11:09:19 L230: load_crypto();
24/04/2024 11:09:19 L235: system_setup_sysctl();
24/04/2024 11:09:19 L248: system_do_shell_commands(1);
24/04/2024 11:12:48 L252: load_thermal_hardware();
24/04/2024 11:12:48 L255: system_timezone_configure();
24/04/2024 11:12:48 L258: system_hostname_configure();
24/04/2024 11:12:50 L261: system_hosts_generate();
24/04/2024 11:12:55 L264: enable_carp(false);
24/04/2024 11:12:55 L267: interfaces_loopback_configure();
24/04/2024 11:12:55 L270: ca_setup_trust_store();
24/04/2024 11:12:55 L277:       exec("/sbin/zfs get -H compression /var/log | /usr/bin/awk '{print $3;}'", $varlog_compression_state);
24/04/2024 11:12:55 L285: system_syslogd_start();
24/04/2024 11:12:57 L288: set_pam_auth();
24/04/2024 11:12:57 L296: switches_configure();
24/04/2024 11:12:57 L300: setup_microcode();
24/04/2024 11:12:58 L321: interfaces_configure();
24/04/2024 11:13:07 L322: interfaces_sync_setup();
24/04/2024 11:13:07 L328: system_hosts_generate();
24/04/2024 11:13:07 L334:       send_event("service reload sshd");
24/04/2024 11:13:07 L340: openvpn_resync_all();
24/04/2024 11:13:07 L344: system_resolvconf_generate();
24/04/2024 11:13:07 L347: filter_pflow_configure();
24/04/2024 11:13:07 L350: filter_configure_sync();
24/04/2024 11:13:09 L354: filter_pflog_start();
24/04/2024 11:13:09 L358: setup_gateways_monitor();
24/04/2024 11:13:11 L362: system_routing_configure();
24/04/2024 11:13:12 L371:       services_dnsmasq_configure();
24/04/2024 11:13:12 L374:       services_unbound_configure();
24/04/2024 11:13:22 L378: local_reset_accounts();
24/04/2024 11:13:22 L383: configure_cron();
24/04/2024 11:13:22 L387: system_routing_enable();
24/04/2024 11:13:22 L424:       mwexec("/usr/bin/timeout -k 45 30 /usr/local/sbin/ntpd -g -q -c /dev/null {$ntp_boot_time_servers}");
24/04/2024 11:13:29 L429: system_ntp_configure();
24/04/2024 11:13:29 L437:       system_webgui_start();
24/04/2024 11:13:31 L441: console_configure();
24/04/2024 11:13:31 L444: services_dhcpd_configure();
24/04/2024 11:13:31 L447: system_dhcpleases_configure();
24/04/2024 11:13:31 L450: services_dhcrelay_configure();
24/04/2024 11:13:31 L453: services_dhcrelay6_configure();
24/04/2024 11:13:31 L456: send_event("service reload dyndnsall");
24/04/2024 11:13:31 L459: filter_configure_sync();
24/04/2024 11:13:32 L462: vpn_setup();
24/04/2024 11:13:32 L465: captiveportal_configure();
24/04/2024 11:13:32 L468: voucher_configure();
24/04/2024 11:13:32 L471: system_do_shell_commands();
24/04/2024 11:13:32 L474: $ipsec_dynamic_hosts = ipsec_configure();
24/04/2024 11:13:34 L477: services_snmpd_configure();
24/04/2024 11:13:36 L480: system_set_harddisk_standby();
24/04/2024 11:13:36 L483: reload_ttys();
24/04/2024 11:13:36 L486: enable_rrd_graphing();
24/04/2024 11:13:37 L489: enable_watchdog();
24/04/2024 11:13:37 L519: services_igmpproxy_configure();
24/04/2024 11:13:37 L522: upnp_start();
24/04/2024 11:13:37 L525: activate_powerd();
24/04/2024 11:13:37 L528: prefer_ipv4_or_ipv6();
24/04/2024 11:13:37 L536:       register_all_installed_packages();
24/04/2024 11:13:37 L541: system_syslogd_start();
24/04/2024 11:13:39 L548: enable_carp();
24/04/2024 11:13:39 L572:       filter_configure();
24/04/2024 11:13:39 L576: led_normalize();
24/04/2024 11:13:39 L578: notify_all_remote("Bootup complete");

Apart from System Patches, there was only item in the earlyshellcmd list:

<earlyshellcmd>service wireguardd start</earlyshellcmd>

The journey continued into wireguard land.

I hacked /usr/local/pkg/wireguard/includes/wg_service.inc to provide to trace logs (attached).

The trace logs from wireguardd start during boot shows timestamps indicating significant delays, or sleep periods prolonging the boot.There are three calls into wireguard/wg_service.inc of which the first is taking the longest.

'service wireguardd start' 
(earlyshellcmd) runs
/usr/local/etc/rc.d/wireguardd

which calls

/usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc start

and the tracings are from my hacked /usr/local/pkg/wireguard/includes/wg_service.inc (attached)

Overview of timing in the boot trace file

The timestamps below are the unique timestamps in the log, when execution is taking place wg_service.inc. They were extracted by:

awk '{ print $2 }' wireguardd-service-start-bootup-trace.txt| sort | uniq

After that it is quite easy to find the instruction that is blocking the execution by finding the last of of any particular timestamp,
17:37:28 for example:


24/04/2024 17:37:28 /usr/local/pkg/wireguard/includes/wg_service.inc:515        fclose($h_lock);
24/04/2024 17:37:28 /usr/local/pkg/wireguard/includes/wg_service.inc:518        $sig_test = posix_kill($pid, 0);
24/04/2024 17:37:32 /usr/local/pkg/wireguard/includes/wg_service.inc:501        global $wgg;


My interpretation here is this call to posix_kill() adds a four second wait/block to the execution. Follow the trace and a pattern starts to emerge...
'service wireguardd start'
17:37:13 wg_service.inc starts
17:37:17 posix_kill() => 7s
17:37:24 posix_kill() => 4s
17:37:28 posix_kill() => 4s
17:37:32 posix_kill() => 4s
17:37:36 posix_kill() => 4s
17:37:40 posix_kill() => 4s
17:37:44 posix_kill() => 5s
17:37:49 posix_kill() => 8s
17:37:57 posix_kill() => 11s
17:38:08 posix_kill() => 15s
17:38:23 posix_kill() => 17s
17:38:40 posix_kill() => 21s
17:39:01 posix_kill() => 24s
17:39:25 posix_kill() => 28s
17:39:53 posix_kill() => 27s
17:40:20 posix_kill() => 25s
17:40:45 wg_service.inc completes 3.5 min later!

Not sure what triggeres the last two execeutions of wg_service.inc. One is "package start" I believe. But as they take hardly any time to execute they are of little interest.

'?' 
17:41:38
17:41:42

'?'
17:41:44
17:41:48

Additional thoughts:
  • This might not be specific to 24.03, it was just that I noticed it during the testing of the 24.03-RC, as reboots aren't generally needed with pfSense :)
  • my pfSense is configured with 15 wireguard tunnels, 14 clients and one server.
  • a manual 'services wireguardd start' takes about 1m20

Files

wireguardd-service-start-bootup-trace.txt (16.7 KB) wireguardd-service-start-bootup-trace.txt trace out from hacked wg_service.inc. Patrik Stahlman, 04/24/2024 07:51 PM
wg_service.inc (11.5 KB) wg_service.inc hack to enable tracing of PHP code with time stamping Patrik Stahlman, 04/24/2024 07:53 PM
rc.bootup_with_trace (17.2 KB) rc.bootup_with_trace patched /etc/rc.bootup with built-in trace system Patrik Stahlman, 04/24/2024 08:09 PM
console-log-pfsense-boot.txt (34 KB) console-log-pfsense-boot.txt Patrik Stahlman, 04/26/2024 10:27 AM
Actions #1

Updated by Patrik Stahlman 10 days ago

Just to document what I did, completely. To generate the first general timing trace of /etc/rc.bootup:

1. apply this patch


--- /etc/pfSense-rc.24.03       2024-04-19 01:52:38.000000000 +0200
+++ /etc/pfSense-rc     2024-04-24 11:07:05.824092000 +0200
@@ -760,7 +760,12 @@
        fi
 fi

-/etc/rc.bootup
+if [ -f /root/run_rc_bootup_with_trace ]; then
+    /bin/rm /root/run_rc_bootup_with_trace
+    /etc/rc.bootup_with_trace
+else
+    /etc/rc.bootup
+fi

 # /etc/rc.bootup unset $g['booting'], and removes file
 # Be sure the file is removed to not create troubles after

2. install the patched /etc/rc.bootup_with_trace (attached)

3. trigger the trace during the next boot only, one-shot

touch /root/run_rc_bootup_with_trace

4. reboot

After the reboot the /etc/rc.boot timing can be found in /tmp/rc.bootup.log

Notes:
  • to trace any php statement, just add "rcbLog();" before it on the same line. rcbLog(); must occur first on the line, followed by any PHP statement you want to trace.
Actions #2

Updated by Patrik Stahlman 8 days ago

I bought a console cable and captured the reboot output (attached). It doesn't really show anything new. Most of the boot time is spent here

Executing early shell commands...Configuring WireGuard tunnels
:
<snip>
:
tun_wg14: link state changed to UP
done.
done.
coretemp0: <CPU On-Die Thermal Sensors> on cpu0

Actions

Also available in: Atom PDF