Bug #15435
openLong boot time when using wireguard VPN
0%
Description
(Additional details in https://forum.netgate.com/post/1163707)
My reason for investigating this was boot times of more than five minutes, seemingly not having a real cause. The syslog didn't provide any answers as it is tuned off for a long period of time, and subsequent dumping of buffered log items have incorrect time stamps:
Apr 19 22:29:33 kernel pflog0: promiscuous mode disabled Apr 19 22:29:33 syslogd kernel boot file is /boot/kernel/kernel Apr 19 22:25:04 syslogd exiting on signal 15 Apr 19 22:25:04 reboot 54300 rebooted by root
I took a video of the console as pfSense was booting and the only thing going on was the slow establishment of wireguard tinnels (screenshot in the forum post). Video available on demand (not Oscars' material)
I subsequently hacked /etc/rc.bootup to get my bearings on what might be taking up the time. system_do_shell_commands(1) stood out as being the culprit to investigate:
24/04/2024 11:09:19 L248: system_do_shell_commands(1); 24/04/2024 11:12:48 L252: load_thermal_hardware();
Full trace:
Logging /etc/rc.bootup_with_trace 24/04/2024 11:09:16 L75: led_normalize(); 24/04/2024 11:09:16 L122: notices_setup(); 24/04/2024 11:09:16 L138: system_dmesg_save(); 24/04/2024 11:09:16 L141: system_check_reset_button(); 24/04/2024 11:09:16 L167: mute_kernel_msgs(); 24/04/2024 11:09:16 L168: start_devd(); 24/04/2024 11:09:18 L169: set_device_perms(); 24/04/2024 11:09:18 L170: unmute_kernel_msgs(); 24/04/2024 11:09:18 L174: parse_config_bootup(); 24/04/2024 11:09:18 L177: mwexec("/usr/sbin/gnid > {$g['vardb_path']}/uniqueid 2>/dev/null"); 24/04/2024 11:09:18 L194: convert_config(); 24/04/2024 11:09:19 L230: load_crypto(); 24/04/2024 11:09:19 L235: system_setup_sysctl(); 24/04/2024 11:09:19 L248: system_do_shell_commands(1); 24/04/2024 11:12:48 L252: load_thermal_hardware(); 24/04/2024 11:12:48 L255: system_timezone_configure(); 24/04/2024 11:12:48 L258: system_hostname_configure(); 24/04/2024 11:12:50 L261: system_hosts_generate(); 24/04/2024 11:12:55 L264: enable_carp(false); 24/04/2024 11:12:55 L267: interfaces_loopback_configure(); 24/04/2024 11:12:55 L270: ca_setup_trust_store(); 24/04/2024 11:12:55 L277: exec("/sbin/zfs get -H compression /var/log | /usr/bin/awk '{print $3;}'", $varlog_compression_state); 24/04/2024 11:12:55 L285: system_syslogd_start(); 24/04/2024 11:12:57 L288: set_pam_auth(); 24/04/2024 11:12:57 L296: switches_configure(); 24/04/2024 11:12:57 L300: setup_microcode(); 24/04/2024 11:12:58 L321: interfaces_configure(); 24/04/2024 11:13:07 L322: interfaces_sync_setup(); 24/04/2024 11:13:07 L328: system_hosts_generate(); 24/04/2024 11:13:07 L334: send_event("service reload sshd"); 24/04/2024 11:13:07 L340: openvpn_resync_all(); 24/04/2024 11:13:07 L344: system_resolvconf_generate(); 24/04/2024 11:13:07 L347: filter_pflow_configure(); 24/04/2024 11:13:07 L350: filter_configure_sync(); 24/04/2024 11:13:09 L354: filter_pflog_start(); 24/04/2024 11:13:09 L358: setup_gateways_monitor(); 24/04/2024 11:13:11 L362: system_routing_configure(); 24/04/2024 11:13:12 L371: services_dnsmasq_configure(); 24/04/2024 11:13:12 L374: services_unbound_configure(); 24/04/2024 11:13:22 L378: local_reset_accounts(); 24/04/2024 11:13:22 L383: configure_cron(); 24/04/2024 11:13:22 L387: system_routing_enable(); 24/04/2024 11:13:22 L424: mwexec("/usr/bin/timeout -k 45 30 /usr/local/sbin/ntpd -g -q -c /dev/null {$ntp_boot_time_servers}"); 24/04/2024 11:13:29 L429: system_ntp_configure(); 24/04/2024 11:13:29 L437: system_webgui_start(); 24/04/2024 11:13:31 L441: console_configure(); 24/04/2024 11:13:31 L444: services_dhcpd_configure(); 24/04/2024 11:13:31 L447: system_dhcpleases_configure(); 24/04/2024 11:13:31 L450: services_dhcrelay_configure(); 24/04/2024 11:13:31 L453: services_dhcrelay6_configure(); 24/04/2024 11:13:31 L456: send_event("service reload dyndnsall"); 24/04/2024 11:13:31 L459: filter_configure_sync(); 24/04/2024 11:13:32 L462: vpn_setup(); 24/04/2024 11:13:32 L465: captiveportal_configure(); 24/04/2024 11:13:32 L468: voucher_configure(); 24/04/2024 11:13:32 L471: system_do_shell_commands(); 24/04/2024 11:13:32 L474: $ipsec_dynamic_hosts = ipsec_configure(); 24/04/2024 11:13:34 L477: services_snmpd_configure(); 24/04/2024 11:13:36 L480: system_set_harddisk_standby(); 24/04/2024 11:13:36 L483: reload_ttys(); 24/04/2024 11:13:36 L486: enable_rrd_graphing(); 24/04/2024 11:13:37 L489: enable_watchdog(); 24/04/2024 11:13:37 L519: services_igmpproxy_configure(); 24/04/2024 11:13:37 L522: upnp_start(); 24/04/2024 11:13:37 L525: activate_powerd(); 24/04/2024 11:13:37 L528: prefer_ipv4_or_ipv6(); 24/04/2024 11:13:37 L536: register_all_installed_packages(); 24/04/2024 11:13:37 L541: system_syslogd_start(); 24/04/2024 11:13:39 L548: enable_carp(); 24/04/2024 11:13:39 L572: filter_configure(); 24/04/2024 11:13:39 L576: led_normalize(); 24/04/2024 11:13:39 L578: notify_all_remote("Bootup complete");
Apart from System Patches, there was only item in the earlyshellcmd list:
<earlyshellcmd>service wireguardd start</earlyshellcmd>
The journey continued into wireguard land.
I hacked /usr/local/pkg/wireguard/includes/wg_service.inc to provide to trace logs (attached).
The trace logs from wireguardd start during boot shows timestamps indicating significant delays, or sleep periods prolonging the boot.There are three calls into wireguard/wg_service.inc of which the first is taking the longest.
'service wireguardd start'
(earlyshellcmd) runs
/usr/local/etc/rc.d/wireguardd
which calls
/usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc start
and the tracings are from my hacked /usr/local/pkg/wireguard/includes/wg_service.inc (attached)
Overview of timing in the boot trace file
The timestamps below are the unique timestamps in the log, when execution is taking place wg_service.inc. They were extracted by:
awk '{ print $2 }' wireguardd-service-start-bootup-trace.txt| sort | uniq
After that it is quite easy to find the instruction that is blocking the execution by finding the last of of any particular timestamp,
17:37:28 for example:
24/04/2024 17:37:28 /usr/local/pkg/wireguard/includes/wg_service.inc:515 fclose($h_lock); 24/04/2024 17:37:28 /usr/local/pkg/wireguard/includes/wg_service.inc:518 $sig_test = posix_kill($pid, 0); 24/04/2024 17:37:32 /usr/local/pkg/wireguard/includes/wg_service.inc:501 global $wgg;
My interpretation here is this call to posix_kill() adds a four second wait/block to the execution. Follow the trace and a pattern starts to emerge...
'service wireguardd start'
17:37:13 wg_service.inc starts 17:37:17 posix_kill() => 7s 17:37:24 posix_kill() => 4s 17:37:28 posix_kill() => 4s 17:37:32 posix_kill() => 4s 17:37:36 posix_kill() => 4s 17:37:40 posix_kill() => 4s 17:37:44 posix_kill() => 5s 17:37:49 posix_kill() => 8s 17:37:57 posix_kill() => 11s 17:38:08 posix_kill() => 15s 17:38:23 posix_kill() => 17s 17:38:40 posix_kill() => 21s 17:39:01 posix_kill() => 24s 17:39:25 posix_kill() => 28s 17:39:53 posix_kill() => 27s 17:40:20 posix_kill() => 25s 17:40:45 wg_service.inc completes 3.5 min later!
Not sure what triggeres the last two execeutions of wg_service.inc. One is "package start" I believe. But as they take hardly any time to execute they are of little interest.
'?' 17:41:38 17:41:42 '?' 17:41:44 17:41:48Additional thoughts:
- This might not be specific to 24.03, it was just that I noticed it during the testing of the 24.03-RC, as reboots aren't generally needed with pfSense :)
- my pfSense is configured with 15 wireguard tunnels, 14 clients and one server.
- a manual 'services wireguardd start' takes about 1m20
Files