Bug #7326: Unbound fails to start during rc.wanipchange when using large enough dns lists - pfSense - pfSense bugtracker

Actions

Copy link

Bug #7326

closed

Unbound fails to start during rc.wanipchange when using large enough dns lists

Added by Chris Collins about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jim Pingle

Category:

DNS Resolver

Target version:

2.3.4-p1

Start date:

02/27/2017

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

Release Notes:

Affected Version:

All

Affected Architecture:

Description

Current pfsense code manually kills unbound, waits just one second and then starts it again, however if unbound doesnt shut down quick enough following the kill command it will fail to start with an error it cannot bind the port because it is already in use. Stopping unbound with a kill command is not good practice anyway and should be a last resort, by using the unbound-control stop command instead, pfsense will actually wait for the shutdown process to complete before starting unbound again and it will succeed. I have patched my local system and can share the patch here, but I am not a php developer so my changes may not be in the best syntax and for some reason I can see from the logs when the rc.newwanip script runs it tries to shutdown unbound 3 times quickly so the log will show it failing twice (first time works), to keep things tidy its probably also a good idea to look at what is triggering the unbound reconfigure code 3 times.

With this bug (undesirable behaviour), the following situations occur.

1 - On bootup unbound will be offline as when the wan comes online it will trigger the rc.newwanip script and it will fail (without my patch).
2 - If WAN is cycled, unbound will also be left in a offline state.

Files

Download all files

unbound.patch (739 Bytes) unbound.patch		Chris Collins, 02/27/2017 11:23 AM
unbound.patch (744 Bytes) unbound.patch		Chris Collins, 02/27/2017 11:27 AM
unbound-stop.diff (1.47 KB) unbound-stop.diff		Jim Pingle, 07/07/2017 01:42 PM

Actions

Copy link

Updated by Chris Collins about 7 years ago

File unbound.patch unbound.patch added

attaching file again, first one is wrong

Actions

Copy link

Updated by BBcan177 . about 7 years ago

The issue with the way pfSense stops Unbound, is that the Unbound service takes longer to shut down when there are many local-zone's defined. In limited testing, it can take 10-20seconds.

sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM");

The code will then try to re-start Unbound and fail due to "could not open ports".
To fix this issue, the restart of Unbound can only be done, when the Unbound service has been fully stopped.

The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1489349484] unbound[90597:0] error: bind: address already in use [1489349484] unbound[90597:0] fatal error: could not open ports'

Here is a draft change to the code that seems to fix this issue. The loop might need to be increased to account for slow systems, or increase the timeout to 2 seconds.

In /etc/inc/services.inc

function services_unbound_configure()

        // kill any running Unbound instance
        if (file_exists("{$g['varrun_path']}/unbound.pid")) {
                sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM");

                for ($i=1; $i <= 30; $i++) {
                        exec("/bin/pgrep unbound 2>&1", $output, $retval);

                        if ($retval != 1) {
                                log_error("DEBUG: Unbound - waiting 1 second to shutdown");
                                sleep(1);
                        } else {
                                break;
                        }
                }
        }

However, it seems that there are other places in the pfSense code (ie: OpenVPN) that also need to be patched for this issue... So maybe this code should be placed elsewhere to cover all other instances of restarting Unbound.

Can the devs chime in on a full solution to this issue please ...

Actions

Copy link

Updated by Jim Thompson about 7 years ago

Assignee set to Jim Pingle

Actions

Copy link

Updated by Chris Collins about 7 years ago

note on my system it needs a lot more than one second to shutdown, probably around 10 seconds due to the over 1 million hosts in the pfblockerng file. So sleep will be unreliable.

Actions

Copy link

Updated by BBcan177 . about 7 years ago

Its a 30x loop in 1 sec increments and breaks on Unbound being fully shutdown.

for ($i=1; $i <= 30; $i++) {

Actions

Copy link

Updated by Kill Bill about 7 years ago

Could probably use is_process_running() from util.inc instead of the exec(). That debug stuff should certainly be wrapper into if ($g['debug']) {}, noone wants tens of logs entries about this normally.

Actions

Copy link

Updated by BBcan177 . about 7 years ago

I left the log message just for testing... Shouldn't be included in final code...
The function is_process_running() uses pgrep, so it will work fine.

Actions

Copy link

Updated by → luckman212 about 7 years ago

I pushed a commit for this (not a PR yet because I think there's more to be done) but in case anyone wants to use this for System Patches:
https://github.com/luckman212/pfsense/commit/7d01b12a991e084576d286a784528b788745f905

thx BBcan177

Actions

Copy link

Updated by BBcan177 . about 7 years ago

Can the devs chime in on this issue please?

Are there other functions in the code that also need to be patched to fully resolve this issue?

Actions

Copy link

#10

Updated by Jim Thompson about 7 years ago

Assignee changed from Jim Pingle to Anonymous

Actions

Copy link

#11

Updated by Steve Y almost 7 years ago

To "me too" this report from a different direction, I realized we were seeing this issue on our HA setup during the once daily Suricata rules update. The master server was unaffected but the backup router would end up with unbound not running. With the code change BBcan177's fix's loop triggered 18 times over 24 seconds but unbound does get started. We have both routers functioning as DNS servers so I would think that unbound isn't running dramatically differently on the backup router.

We didn't have the behavior of it stopping multiple times, just that it couldn't start. Also worth noting is that manual Suricata updates or firewall syncs don't cause this issue, and that we have no local-zones defined in unbound, just one "private-domain" entry. WAN IPs are all static, though we do get "pfSense package system has detected an IP change or dynamic WAN reconnection" from OpenVPN at any sync since OpenVPN apparently assigns itself an address when that package starts.

Also worth noting is this is running on a 4 CPU VM on a Xeon E5-2609 v3 so I would think CPU speed isn't an issue.

Actions

Copy link

#12

Updated by Jim Pingle almost 7 years ago

Category set to DNS Resolver
Assignee changed from Anonymous to Jim Pingle
Target version set to 2.3.4-p1
Affected Version set to All

Back to me, I have some changes ready that appear to fix this.

Actions

Copy link

#13

Updated by Jim Pingle almost 7 years ago

Status changed from New to Feedback
% Done changed from 0 to 100

Applied in changeset 38d110824c87ff60c6289c0432d55009586ceee4.

Actions

Copy link

#14

Updated by Jim Pingle almost 7 years ago

Be aware that if you attempt to add a patch for this manually, you also need to apply the patch from #7667 first.

Actions

Copy link

#15

Updated by Chris Collins almost 7 years ago

ok thank you for fixing this, I will test the fix as soon as possible on my firewall.

Actions

Copy link

#16

Updated by Jim Pingle almost 7 years ago

Status changed from Feedback to Assigned

Several complaints of unbound not starting/stopping correctly after that last patch was added, so I backed it out for now to avoid further problems while another solution is located.

Actions

Copy link

#17

Updated by Jim Pingle almost 7 years ago

Status changed from Assigned to Feedback

Applied in changeset 6e094e04f7d9634c7151bd9aa68ad93c71584d80.

Actions

Copy link

#18

Updated by Jim Pingle almost 7 years ago

Status changed from Feedback to Assigned

Actions

Copy link

#19

Updated by BBcan177 . almost 7 years ago

Instead of using this stop command

mwexec("echo '/usr/local/sbin/unbound-control stop' | /usr/bin/su -m unbound", true);

I would recommend the following command (And also for the other Unbound start/reload etc...)

mwexec("/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c {$g['unbound_chroot_path']}/unbound.conf stop", true);

At a minimum, please consider just adding the for/loop to check if Unbound has stopped before restarting, as there is no negative issue with that section of the recent changes.

Actions

Copy link

#20

Updated by Jim Pingle almost 7 years ago

BBcan177 . wrote:

I would recommend the following command (And also for the other Unbound start/reload etc...)

mwexec("/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c {$g['unbound_chroot_path']}/unbound.conf stop", true);

Unless I'm missing something here, why would that be any better? Doing a chroot to / would essentially a no-op, all it's doing is changing the user which "su -m" will do as well.

At a minimum, please consider just adding the for/loop to check if Unbound has stopped before restarting, as there is no negative issue with that section of the recent changes.

I'm not sure it's seen enough testing to justify that claim. I'd like to see some of the users who had problems with the full set of changes applied try a patch with just the loop.

Actions

Copy link

#21

Updated by Jim Pingle almost 7 years ago

File unbound-stop.diff unbound-stop.diff added

Here is a patch to try, for those who had problems with the previous commit.

Actions

Copy link

#22

Updated by Jim Pingle almost 7 years ago

Status changed from Assigned to Feedback

Applied in changeset 782453b4dbb77e5bc97a43f56b95a006c5434d65.

Actions

Copy link

#23

Updated by Jim Pingle over 6 years ago

Status changed from Feedback to Resolved

I still haven't been able to replicate the original problem here, but unbound appears to be restarting OK on WAN failure here when I test it with the new changes, so we'll close it out for now. It can always be reopened again if additional problems surface.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Bug #7326

Unbound fails to start during rc.wanipchange when using large enough dns lists

Updated by Chris Collins about 7 years ago

Updated by BBcan177 . about 7 years ago

Updated by Jim Thompson about 7 years ago

Updated by Chris Collins about 7 years ago

Updated by BBcan177 . about 7 years ago

Updated by Kill Bill about 7 years ago

Updated by BBcan177 . about 7 years ago

Updated by → luckman212 about 7 years ago

Updated by BBcan177 . about 7 years ago

Updated by Jim Thompson about 7 years ago

Updated by Steve Y almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Chris Collins almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by BBcan177 . almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle almost 7 years ago

Updated by Jim Pingle over 6 years ago