Bug #7326
closedUnbound fails to start during rc.wanipchange when using large enough dns lists
100%
Description
Current pfsense code manually kills unbound, waits just one second and then starts it again, however if unbound doesnt shut down quick enough following the kill command it will fail to start with an error it cannot bind the port because it is already in use. Stopping unbound with a kill command is not good practice anyway and should be a last resort, by using the unbound-control stop command instead, pfsense will actually wait for the shutdown process to complete before starting unbound again and it will succeed. I have patched my local system and can share the patch here, but I am not a php developer so my changes may not be in the best syntax and for some reason I can see from the logs when the rc.newwanip script runs it tries to shutdown unbound 3 times quickly so the log will show it failing twice (first time works), to keep things tidy its probably also a good idea to look at what is triggering the unbound reconfigure code 3 times.
With this bug (undesirable behaviour), the following situations occur.
1 - On bootup unbound will be offline as when the wan comes online it will trigger the rc.newwanip script and it will fail (without my patch).
2 - If WAN is cycled, unbound will also be left in a offline state.
Files
Updated by Chris Collins over 7 years ago
- File unbound.patch unbound.patch added
attaching file again, first one is wrong
Updated by BBcan177 . over 7 years ago
The issue with the way pfSense stops Unbound, is that the Unbound service takes longer to shut down when there are many local-zone's defined. In limited testing, it can take 10-20seconds.
sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM");
The code will then try to re-start Unbound and fail due to "could not open ports".
To fix this issue, the restart of Unbound can only be done, when the Unbound service has been fully stopped.
The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1489349484] unbound[90597:0] error: bind: address already in use [1489349484] unbound[90597:0] fatal error: could not open ports'
Here is a draft change to the code that seems to fix this issue. The loop might need to be increased to account for slow systems, or increase the timeout to 2 seconds.
In /etc/inc/services.inc
function services_unbound_configure()
// kill any running Unbound instance if (file_exists("{$g['varrun_path']}/unbound.pid")) { sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM"); for ($i=1; $i <= 30; $i++) { exec("/bin/pgrep unbound 2>&1", $output, $retval); if ($retval != 1) { log_error("DEBUG: Unbound - waiting 1 second to shutdown"); sleep(1); } else { break; } } }
However, it seems that there are other places in the pfSense code (ie: OpenVPN) that also need to be patched for this issue... So maybe this code should be placed elsewhere to cover all other instances of restarting Unbound.
Can the devs chime in on a full solution to this issue please ...
Updated by Chris Collins over 7 years ago
note on my system it needs a lot more than one second to shutdown, probably around 10 seconds due to the over 1 million hosts in the pfblockerng file. So sleep will be unreliable.
Updated by BBcan177 . over 7 years ago
Its a 30x loop in 1 sec increments and breaks on Unbound being fully shutdown.
for ($i=1; $i <= 30; $i++) {
Updated by Kill Bill over 7 years ago
Could probably use is_process_running()
from util.inc instead of the exec()
. That debug stuff should certainly be wrapper into if ($g['debug']) {}
, noone wants tens of logs entries about this normally.
Updated by BBcan177 . over 7 years ago
I left the log message just for testing... Shouldn't be included in final code...
The function is_process_running() uses pgrep, so it will work fine.
Updated by → luckman212 over 7 years ago
I pushed a commit for this (not a PR yet because I think there's more to be done) but in case anyone wants to use this for System Patches:
https://github.com/luckman212/pfsense/commit/7d01b12a991e084576d286a784528b788745f905
thx BBcan177
Updated by BBcan177 . over 7 years ago
Can the devs chime in on this issue please?
Are there other functions in the code that also need to be patched to fully resolve this issue?
Updated by Jim Thompson over 7 years ago
- Assignee changed from Jim Pingle to Anonymous
Updated by Steve Y over 7 years ago
To "me too" this report from a different direction, I realized we were seeing this issue on our HA setup during the once daily Suricata rules update. The master server was unaffected but the backup router would end up with unbound not running. With the code change BBcan177's fix's loop triggered 18 times over 24 seconds but unbound does get started. We have both routers functioning as DNS servers so I would think that unbound isn't running dramatically differently on the backup router.
We didn't have the behavior of it stopping multiple times, just that it couldn't start. Also worth noting is that manual Suricata updates or firewall syncs don't cause this issue, and that we have no local-zones defined in unbound, just one "private-domain" entry. WAN IPs are all static, though we do get "pfSense package system has detected an IP change or dynamic WAN reconnection" from OpenVPN at any sync since OpenVPN apparently assigns itself an address when that package starts.
Also worth noting is this is running on a 4 CPU VM on a Xeon E5-2609 v3 so I would think CPU speed isn't an issue.
Updated by Jim Pingle over 7 years ago
- Category set to DNS Resolver
- Assignee changed from Anonymous to Jim Pingle
- Target version set to 2.3.4-p1
- Affected Version set to All
Back to me, I have some changes ready that appear to fix this.
Updated by Jim Pingle over 7 years ago
- Status changed from New to Feedback
- % Done changed from 0 to 100
Applied in changeset 38d110824c87ff60c6289c0432d55009586ceee4.
Updated by Jim Pingle over 7 years ago
Be aware that if you attempt to add a patch for this manually, you also need to apply the patch from #7667 first.
Updated by Chris Collins over 7 years ago
ok thank you for fixing this, I will test the fix as soon as possible on my firewall.
Updated by Jim Pingle about 7 years ago
- Status changed from Feedback to Assigned
Several complaints of unbound not starting/stopping correctly after that last patch was added, so I backed it out for now to avoid further problems while another solution is located.
Updated by Jim Pingle about 7 years ago
- Status changed from Assigned to Feedback
Applied in changeset 6e094e04f7d9634c7151bd9aa68ad93c71584d80.
Updated by Jim Pingle about 7 years ago
- Status changed from Feedback to Assigned
Updated by BBcan177 . about 7 years ago
Instead of using this stop command
mwexec("echo '/usr/local/sbin/unbound-control stop' | /usr/bin/su -m unbound", true);
I would recommend the following command (And also for the other Unbound start/reload etc...)
mwexec("/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c {$g['unbound_chroot_path']}/unbound.conf stop", true);
At a minimum, please consider just adding the for/loop to check if Unbound has stopped before restarting, as there is no negative issue with that section of the recent changes.
Updated by Jim Pingle about 7 years ago
BBcan177 . wrote:
I would recommend the following command (And also for the other Unbound start/reload etc...)
mwexec("/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c {$g['unbound_chroot_path']}/unbound.conf stop", true);
Unless I'm missing something here, why would that be any better? Doing a chroot to / would essentially a no-op, all it's doing is changing the user which "su -m" will do as well.
At a minimum, please consider just adding the for/loop to check if Unbound has stopped before restarting, as there is no negative issue with that section of the recent changes.
I'm not sure it's seen enough testing to justify that claim. I'd like to see some of the users who had problems with the full set of changes applied try a patch with just the loop.
Updated by Jim Pingle about 7 years ago
- File unbound-stop.diff unbound-stop.diff added
Here is a patch to try, for those who had problems with the previous commit.
Updated by Jim Pingle about 7 years ago
- Status changed from Assigned to Feedback
Applied in changeset 782453b4dbb77e5bc97a43f56b95a006c5434d65.
Updated by Jim Pingle about 7 years ago
- Status changed from Feedback to Resolved
I still haven't been able to replicate the original problem here, but unbound appears to be restarting OK on WAN failure here when I test it with the new changes, so we'll close it out for now. It can always be reopened again if additional problems surface.