Service Watchdog - Impacts Reboots and Package Updates
All - wasn't quite sure which to attribute this to as its a package, but is impacting standard operation.Synopsis:
- When upgrading a package where the upgrade must stop the service, the Service Watchdog is restarting the service before the upgrade of the package completes. Appears to completely stall some updates where the update process takes some time to run with the service stopped.
- Upon reboot, while reviewing syslog - the Service Watchdog is starting services before pfSense [itself] normally starts a given service. Suspect that this could cause services to start in an abnormal order and potentially create dependency issues.
Noticed this upon trying to assess a recent issue and watching syslog information where virtually every process upon reboot was started first by the Service Watchdog and when the system starting of that same process occurred - the system initiated startup failed.
#1 Updated by Jim Pingle about 2 months ago
- Project changed from pfSense to pfSense Packages
- Category changed from Services to Service Watchdog
- Priority changed from Normal to Very Low
This is a problem only with the package and also not likely one that will be solvable in an easy way.
The package could maybe check if the firewall is booting and skip doing anything, but because it's launched from cron there isn't a way to pause it temporarily.
Most of the time these kinds of issues pop up because services which are not well-suited to the watchdog are added to it, and not a problem with the package itself.
The package is a kludge. If something is dying, that problem should be fixed rather than relying on this package as a crutch.
#2 Updated by A S about 2 months ago
All fair points.
Have run into a couple occasions where something 'died' (such as Snort, Suricata, lldpd, haproxy) and was unaware that the service had failed. While the watchdog cannot handle anything with multiple instances - perhaps the better option may be the option to have a "Service Watchdog Monitor" situation where it doesn't attempt to forcibly restart, but rather only provides notifications? Particularly if it opened the door to being able to monitor status for things such as snort/suricata where there may be multiple instances on a given firewall and being able to report that a specific instance is not in a running state. That really is the more desirable functionality. eg: instead of a single check box (notification) add a second check box (take action) that could be left unchecked (or checked to help keep something running until the issue/error can be resolved properly).