Project

General

Profile

Bug #7326

Unbound fails to start during rc.wanipchange when using large enough dns lists

Added by Chris Collins about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
DNS Resolver
Target version:
Start date:
02/27/2017
Due date:
% Done:

100%

Estimated time:
Affected Version:
All
Affected Architecture:

Description

Current pfsense code manually kills unbound, waits just one second and then starts it again, however if unbound doesnt shut down quick enough following the kill command it will fail to start with an error it cannot bind the port because it is already in use. Stopping unbound with a kill command is not good practice anyway and should be a last resort, by using the unbound-control stop command instead, pfsense will actually wait for the shutdown process to complete before starting unbound again and it will succeed. I have patched my local system and can share the patch here, but I am not a php developer so my changes may not be in the best syntax and for some reason I can see from the logs when the rc.newwanip script runs it tries to shutdown unbound 3 times quickly so the log will show it failing twice (first time works), to keep things tidy its probably also a good idea to look at what is triggering the unbound reconfigure code 3 times.

With this bug (undesirable behaviour), the following situations occur.

1 - On bootup unbound will be offline as when the wan comes online it will trigger the rc.newwanip script and it will fail (without my patch).
2 - If WAN is cycled, unbound will also be left in a offline state.

unbound.patch (739 Bytes) unbound.patch Chris Collins, 02/27/2017 11:23 AM
unbound.patch (744 Bytes) unbound.patch Chris Collins, 02/27/2017 11:27 AM
unbound-stop.diff (1.47 KB) unbound-stop.diff Jim Pingle, 07/07/2017 01:42 PM

Associated revisions

Revision 38d11082 (diff)
Added by Jim Pingle almost 2 years ago

Change the way unbound is stopped when the process is being restarted, to give the old process enough time to exit cleanly. Fixes #7326

Revision 0577d9df (diff)
Added by Jim Pingle almost 2 years ago

Change the way unbound is stopped when the process is being restarted, to give the old process enough time to exit cleanly. Fixes #7326

(cherry picked from commit 38d110824c87ff60c6289c0432d55009586ceee4)

Revision 863804a9 (diff)
Added by Jim Pingle almost 2 years ago

Change the way unbound is stopped when the process is being restarted, to give the old process enough time to exit cleanly. Fixes #7326

(cherry picked from commit 38d110824c87ff60c6289c0432d55009586ceee4)

Revision 6e094e04 (diff)
Added by Jim Pingle almost 2 years ago

Several complaints of unbound problems after commiting, so back this out. Revert "Change the way unbound is stopped when the process is being restarted, to give the old process enough time to exit cleanly. Fixes #7326"

This reverts commit 38d110824c87ff60c6289c0432d55009586ceee4.

Revision c59b2bc3 (diff)
Added by Jim Pingle almost 2 years ago

Several complaints of unbound problems after commiting, so back this out. Revert "Change the way unbound is stopped when the process is being restarted, to give the old process enough time to exit cleanly. Fixes #7326"

This reverts commit 0577d9df462063bb3d26f0805c1e06fbdb359157.

Revision 53e138c0 (diff)
Added by Jim Pingle almost 2 years ago

Several complaints of unbound problems after commiting, so back this out. Revert "Change the way unbound is stopped when the process is being restarted, to give the old process enough time to exit cleanly. Fixes #7326"

This reverts commit 863804a917987ea10993433c84399b5711c3c352.

Revision 782453b4 (diff)
Added by Jim Pingle almost 2 years ago

Use an alternate method to stop unbound and fix #7326

Revision 67fea036 (diff)
Added by Jim Pingle almost 2 years ago

Use an alternate method to stop unbound and fix #7326

(cherry picked from commit 782453b4dbb77e5bc97a43f56b95a006c5434d65)

Revision db3b3afb (diff)
Added by Jim Pingle almost 2 years ago

Use an alternate method to stop unbound and fix #7326

(cherry picked from commit 782453b4dbb77e5bc97a43f56b95a006c5434d65)
(cherry picked from commit 67fea036201f5af338338f6113033574ec25777e)

History

#1 Updated by Chris Collins about 2 years ago

attaching file again, first one is wrong

#2 Updated by BBcan177 . about 2 years ago

The issue with the way pfSense stops Unbound, is that the Unbound service takes longer to shut down when there are many local-zone's defined. In limited testing, it can take 10-20seconds.

sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM");

The code will then try to re-start Unbound and fail due to "could not open ports".
To fix this issue, the restart of Unbound can only be done, when the Unbound service has been fully stopped.

The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1489349484] unbound[90597:0] error: bind: address already in use [1489349484] unbound[90597:0] fatal error: could not open ports'

Here is a draft change to the code that seems to fix this issue. The loop might need to be increased to account for slow systems, or increase the timeout to 2 seconds.

In /etc/inc/services.inc

function services_unbound_configure()
        // kill any running Unbound instance
        if (file_exists("{$g['varrun_path']}/unbound.pid")) {
                sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM");

                for ($i=1; $i <= 30; $i++) {
                        exec("/bin/pgrep unbound 2>&1", $output, $retval);

                        if ($retval != 1) {
                                log_error("DEBUG: Unbound - waiting 1 second to shutdown");
                                sleep(1);
                        } else {
                                break;
                        }
                }
        }

However, it seems that there are other places in the pfSense code (ie: OpenVPN) that also need to be patched for this issue... So maybe this code should be placed elsewhere to cover all other instances of restarting Unbound.

Can the devs chime in on a full solution to this issue please ...

#3 Updated by Jim Thompson about 2 years ago

  • Assignee set to Jim Pingle

#4 Updated by Chris Collins about 2 years ago

note on my system it needs a lot more than one second to shutdown, probably around 10 seconds due to the over 1 million hosts in the pfblockerng file. So sleep will be unreliable.

#5 Updated by BBcan177 . about 2 years ago

Its a 30x loop in 1 sec increments and breaks on Unbound being fully shutdown.

for ($i=1; $i <= 30; $i++) {

#6 Updated by Kill Bill about 2 years ago

Could probably use is_process_running() from util.inc instead of the exec(). That debug stuff should certainly be wrapper into if ($g['debug']) {}, noone wants tens of logs entries about this normally.

#7 Updated by BBcan177 . about 2 years ago

I left the log message just for testing... Shouldn't be included in final code...
The function is_process_running() uses pgrep, so it will work fine.

#8 Updated by Luke Hamburg about 2 years ago

I pushed a commit for this (not a PR yet because I think there's more to be done) but in case anyone wants to use this for System Patches:
https://github.com/luckman212/pfsense/commit/7d01b12a991e084576d286a784528b788745f905

thx BBcan177

#9 Updated by BBcan177 . about 2 years ago

Can the devs chime in on this issue please?

Are there other functions in the code that also need to be patched to fully resolve this issue?

#10 Updated by Jim Thompson about 2 years ago

  • Assignee changed from Jim Pingle to Steve Beaver

#11 Updated by Steve Yates about 2 years ago

To "me too" this report from a different direction, I realized we were seeing this issue on our HA setup during the once daily Suricata rules update. The master server was unaffected but the backup router would end up with unbound not running. With the code change BBcan177's fix's loop triggered 18 times over 24 seconds but unbound does get started. We have both routers functioning as DNS servers so I would think that unbound isn't running dramatically differently on the backup router.

We didn't have the behavior of it stopping multiple times, just that it couldn't start. Also worth noting is that manual Suricata updates or firewall syncs don't cause this issue, and that we have no local-zones defined in unbound, just one "private-domain" entry. WAN IPs are all static, though we do get "pfSense package system has detected an IP change or dynamic WAN reconnection" from OpenVPN at any sync since OpenVPN apparently assigns itself an address when that package starts.

Also worth noting is this is running on a 4 CPU VM on a Xeon E5-2609 v3 so I would think CPU speed isn't an issue.

#12 Updated by Jim Pingle almost 2 years ago

  • Category set to DNS Resolver
  • Assignee changed from Steve Beaver to Jim Pingle
  • Target version set to 2.3.4-p1
  • Affected Version set to All

Back to me, I have some changes ready that appear to fix this.

#13 Updated by Jim Pingle almost 2 years ago

  • Status changed from New to Feedback
  • % Done changed from 0 to 100

#14 Updated by Jim Pingle almost 2 years ago

Be aware that if you attempt to add a patch for this manually, you also need to apply the patch from #7667 first.

#15 Updated by Chris Collins almost 2 years ago

ok thank you for fixing this, I will test the fix as soon as possible on my firewall.

#16 Updated by Jim Pingle almost 2 years ago

  • Status changed from Feedback to Assigned

Several complaints of unbound not starting/stopping correctly after that last patch was added, so I backed it out for now to avoid further problems while another solution is located.

#17 Updated by Jim Pingle almost 2 years ago

  • Status changed from Assigned to Feedback

#18 Updated by Jim Pingle almost 2 years ago

  • Status changed from Feedback to Assigned

#19 Updated by BBcan177 . almost 2 years ago

Instead of using this stop command

mwexec("echo '/usr/local/sbin/unbound-control stop' | /usr/bin/su -m unbound", true);

I would recommend the following command (And also for the other Unbound start/reload etc...)

mwexec("/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c {$g['unbound_chroot_path']}/unbound.conf stop", true);

At a minimum, please consider just adding the for/loop to check if Unbound has stopped before restarting, as there is no negative issue with that section of the recent changes.

#20 Updated by Jim Pingle almost 2 years ago

BBcan177 . wrote:

I would recommend the following command (And also for the other Unbound start/reload etc...)

mwexec("/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c {$g['unbound_chroot_path']}/unbound.conf stop", true);

Unless I'm missing something here, why would that be any better? Doing a chroot to / would essentially a no-op, all it's doing is changing the user which "su -m" will do as well.

At a minimum, please consider just adding the for/loop to check if Unbound has stopped before restarting, as there is no negative issue with that section of the recent changes.

I'm not sure it's seen enough testing to justify that claim. I'd like to see some of the users who had problems with the full set of changes applied try a patch with just the loop.

#21 Updated by Jim Pingle almost 2 years ago

Here is a patch to try, for those who had problems with the previous commit.

#22 Updated by Jim Pingle almost 2 years ago

  • Status changed from Assigned to Feedback

#23 Updated by Jim Pingle almost 2 years ago

  • Status changed from Feedback to Resolved

I still haven't been able to replicate the original problem here, but unbound appears to be restarting OK on WAN failure here when I test it with the new changes, so we'll close it out for now. It can always be reopened again if additional problems surface.

Also available in: Atom PDF