Bug #16678
openAutomatic Boot verification performs rollback to previous Boot Environment when system is rebooted without internet connectivity
0%
Description
While pre-configuring a Netgate 6100 appliance for deployment at a customer site, we encountered unexpected behavior with the Automatic Boot Verification mechanism that resulted in an automatic rollback to a previous Boot Environment when the system was rebooted without Internet connectivity.
The rollback also reverted the system to the previous pfSense Plus software version, even though no software failure occurred.
Configuration steps to reproduce:- Unbox a new Netgate 6100 and boot the system, configure WAN to DHCP and connect it to a network with Internet access and complete the firewall configuration for the site (interfaces, subnets, firewall rules, etc.).
- Upgrade pfSense Plus from 25.07.1 (factory version) to 25.11.1 (WAN remains configured for DHCP at this stage).
- Verify that the system is fully functional and everything works as expected.
- As the final preparation step before deployment: Change the WAN interface from DHCP to static IPv4 and select the corresponding gateway as the system's default gateway.
- Reboot the firewall one last time while still in the office environment to ensure everything works (of course, this time system lacks Internet connectivity because the WAN interface is already pre=configured to static IPv4 for deployment at the customer site.)
- The WebGUI displays the banner: “Automatic Boot Verification is still running, please wait...” which seems to remain indefinitely after reboot and after some time becomes heavily unresponsive. The firewall also stops responding to ICMP ping requests some time later.
- The system reboots and performs a rollback to the previous Boot Environment (which was automatically created before upgrading to 25.11.1) which is also indicated by the corresponding banner "Boot verification failed for default. Netgate pfSense Plus was automatically rebooted back into default_202601XXXXXXX" in the WebGUI.
- The rollback to the previous Boot Environment restores both, the previous software version (25.07.1) and the previous WAN configuration which leads to restoring Internet connectivity.
- The system should not perform a rollback to a previous Boot Environment solely because Internet connectivity is unavailable after a reboot.
- An intentional configuration change that (temporarily) removes Internet connectivity should not inevitable cause "Automatic boot verification" to fail.
- Because no manual Boot Environment was created after upgrading to 25.11.1 and before/after changing the WAN configuration, the only available "previous" BE was the one which was automatically created by the upgrade from 25.07.1 to 25.11.1
- Because the "Automatic Boot verification" process / the BE feature operates on ZFS snapshots rather than pfSense config versions, it cannot distinguish between the software upgrade from 25.07.1 to 25.11.1 and the subsequent WAN configuration change. Consequently, the rollback to the previous Boot Environment reverts also the pfSense Plus software version.
- According to this Youtube video [1] by one of Netgate's engineers, "a watchdog timer is started that simply reboots the system after a fixed period of time" into the previous Boot Environment if boot verification does not complete.
- It appears that the system is not able to stop the watchdog timer (or the watchdog timer is intentionally not stopped ?!) if the system lacks internet connectivity.
- Is Internet connectivity a requirement for Automatic Boot Verification to complete successfully e.g. is Internet connectivity needed within the first 5 minutes (I read somewhere that the watchdog has a timeout of 300s) to sucessefully stop the watchdog?
- What steps or countermeasures are required to prevent the system from automatically roll back to a previous Boot Environment if Internet connectivity is unavailable during, or shortly after, boot?
[1] "Deep Dive into the NEW ZFS Boot Environments feature in pfSense Plus v24.03": https://www.youtube.com/watch?v=LKtE0zxnF4I
We would like to fully understand this behavior. Thank you for looking into this.
Updated by Victor Coss 15 days ago
I had issues too upgrading from 25.07.1 to 25.11.1 but on an XG-1541 instead of the 6100. It automatically rolled back to the previous BE which also rolled back my configuration changes I had made. I'm not sure why verification failed, nor does it say the reason it did. It would be nice if the notification we get saying it failed would give us a reason as to why. Do you have any packages installed? I'm kinda curious if we have some overlap.
Updated by Christian McDonald 15 days ago
I have some ideas here, I’d like to get a fix for this into 26.03.
Updated by Christian McDonald 13 days ago
What additional packages (if any) do you have installed?
At least with a vanilla installation, I'm not able to reproduce this.
Updated by Victor Coss 12 days ago
I know I'm running a different system than OP (XG-1541 instead of 6100), but in case they don't reply. The packages I have installed are: arping, nmap, Status_Traffic_Totals. Nothing crazy and honestly I wouldn't think would cause any problems. There is also WireGuard but I didn't add that nor use it.
Updated by Florian Harbecke 4 days ago
Thanks for looking into this, and sorry for the late reply - I was tied up with other tasks.
Christian McDonald wrote in #note-4:
What additional packages (if any) do you have installed?
Besides the packages included in a vanilla/default installation (ipsec-profile-wizard, Netgate_Firmware_Upgrade, Nexus), we also have the following additional packages installed:
- nmap
- System_Patches
- zabbix-agent7
- zabbix-proxy7
At least with a vanilla installation, I'm not able to reproduce this.
Thank you for testing. I will try to reproduce the issue again with a vanilla installation (without the additional packages listed above). However, it will likely take until mid-next week until I can perform this test.
Updated by Christian McDonald 4 days ago
Florian Harbecke wrote in #note-7:
Thanks for looking into this, and sorry for the late reply - I was tied up with other tasks.
Christian McDonald wrote in #note-4:
What additional packages (if any) do you have installed?
Besides the packages included in a vanilla/default installation (ipsec-profile-wizard, Netgate_Firmware_Upgrade, Nexus), we also have the following additional packages installed:
- nmap
- System_Patches
- zabbix-agent7
- zabbix-proxy7
At least with a vanilla installation, I'm not able to reproduce this.
Thank you for testing. I will try to reproduce the issue again with a vanilla installation (without the additional packages listed above). However, it will likely take until mid-next week until I can perform this test.
Thanks for the list.
do you have any custom early shell commands? Another customer with similar symptoms reported that an early shell command was causing this.
Updated by Florian Harbecke 4 days ago
Christian McDonald wrote in #note-8:
Thanks for the list.
do you have any custom early shell commands? Another customer with similar symptoms reported that an early shell command was causing this.
Yes we do, that's a very good hint. In fact, I just noticed in the config where I initially observed this, we also have the WireGuard package installed.
Therefore, the following two early shell commands are defined in the config:
<earlyshellcmd>service wireguardd start</earlyshellcmd>
<earlyshellcmd>/usr/local/bin/php-cgi -f /usr/local/bin/apply_patches.php</earlyshellcmd>
I will repeat the test using a configuration without WireGuard and without the service wireguardd start early shell command, to see if the issue still occurs.
Updated by Victor Coss 4 days ago
Mine was a <shellcmd> not an <earlyshellcmd>. I also have Wireguard which I did not put on there but it should just show Wireguard disabled at the console on bootup.
Updated by Christian McDonald 3 days ago
Victor Coss wrote in #note-10:
Mine was a <shellcmd> not an <earlyshellcmd>. I also have Wireguard which I did not put on there but it should just show Wireguard disabled at the console on bootup.
Ah yes thanks for the clarification.
The {early}shellcmd code probably wants to have some UI knob that can toggle between synchronous vs. asynchronous execution ... with a warning that synchronous execution could impact system startup time and boot verification. Async would probably be the default execution mode.
We also need to expose the verification timeout as a tunable in the UI.