Project

General

Profile

Regression #11316

Unbound crashing with signal 11

Added by Martin Müller 3 months ago. Updated 4 days ago.

Status:
Feedback
Priority:
Normal
Category:
DNS Resolver
Target version:
Start date:
01/26/2021
Due date:
% Done:

0%

Estimated time:
Affected Version:
2.5.x
Affected Architecture:
Release Notes:
Default

Description

Seems to be the same as here...
https://forum.opnsense.org/index.php?topic=20516.0

My workaround: I have moved to DNS forwarder.

  • Why has the command opnsense-revert not been adopted for pfSense?
resolver.log.0.7z (1.06 MB) resolver.log.0.7z Christian Borchert, 03/09/2021 08:55 PM
resolver.log.0.7z (750 KB) resolver.log.0.7z Christian Borchert, 03/10/2021 05:08 AM

Related issues

Related to Bug #5413: Incorrect Handling of Unbound Resolver [service restarts, cache loss, DNS service interruption]Confirmed2015-11-10

History

#1 Updated by Jim Pingle 3 months ago

  • Status changed from New to Rejected

There is not nearly enough information here to constitute a proper bug report, and I cannot reproduce the problem as stated. It's perfectly stable here. This site is not for support or diagnostic discussion.

For assistance in solving problems, please post on the Netgate Forum or the pfSense Subreddit .

See Reporting Issues with pfSense Software for more information.

#3 Updated by Daniel Keller 2 months ago

I have the same problem. it happens only when the option "Register DHCP leases in the DNS Resolver" is set.

it looks like dhcp tries to restart the unbound service because of a new dhcp lease but unbound does not start again.

Jan 29 13:45:07 heimdall1 dhcpleases15066: Could not deliver signal HUP to process 33623: No such process.
Jan 29 13:45:05 heimdall1 dhcpleases15066: Could not deliver signal HUP to process 33623: No such process.
Jan 29 13:45:04 heimdall1 dhcpleases15066: Could not deliver signal HUP to process 33623: No such process.
Jan 29 13:45:02 heimdall1 kernel: pid 33623 (unbound), jid 0, uid 59: exited on signal 11

#4 Updated by Jim Pingle 2 months ago

Keep the discussion on the forum. If it's still happening, there is no evidence there. Last post was over a week ago and that last response implies their problem was resolved. No indication it's ongoing on current snapshots.

#5 Updated by Martin Müller 2 months ago

In the "competitor's" forum, there are several pages of error descriptions and error analyses for Unbound 1.13.0. Also a bug report is available at freebsd.
Ignoring is a great bug fix.

PS: I can confirm Daniel's observation. It seems to be related to the option "Register DHCP leases in the DNS Resolver".

#6 Updated by Jim Pingle 2 months ago

Behavior on other systems (even FreeBSD) isn't directly relevant to pfSense software. They may be similar, but it's not 100% the same. We need more information about how it happens.

I run my edge, and several others, with DHCP registration on and have zero problems on snapshots with the DNS Resolver. If there is a problem, there is much more to it. Keep discussing it on the forum for now.

If we can't reproduce it, we have no way to debug it or test if it's fixed.

Furthermore, the fix committed to FreeBSD is already present on the latest snapshots and has been for the last three weeks:

https://github.com/pfsense/FreeBSD-ports/commit/9510cfe4c453c2f589b2b065b9f42a85e7f3c5ba

If it's still happening, it may not be the same root cause, which still means we need more information. On the forum thread. Not here.

#7 Updated by Jim Pingle about 2 months ago

  • Tracker changed from Bug to Regression
  • Status changed from Rejected to New
  • Assignee set to Renato Botelho
  • Target version set to CE-Next

Now that there have been responses from several others on the forum post with info, it does appear there is a problem in unbound even though none of us here can reproduce it. It may be triggered by the -HUP restart from dhcpleases in some cases but I still can't reproduce it here that way. There must be some other mitigating factor but thus far it hasn't been identified. It happens on systems with and without pfBlocker as well. I've killed it hundreds of times in a loop and it is still running and responding properly and not crashing here in my lab.

Unbound 1.13.1 was just released a few days ago and is now in FreeBSD ports, so we may want to bring that into snapshots for testing. There are a few bugs fixed there which could be relevant, including the one we already have a patch for.

#8 Updated by Renato Botelho about 2 months ago

  • Target version changed from CE-Next to 2.5.1

#9 Updated by Renato Botelho about 2 months ago

1.13.1 cherry-picked to 2.5.0 branch

#10 Updated by Renato Botelho about 2 months ago

  • Status changed from New to Feedback

#11 Updated by Pim Janssen about 2 months ago

I never had any problem with the core system of pfSense on production. Today my unbound died. (about 5 hours after upgrading to 2.5.0-Release). Just before it happened some new Wifi client registered.
I disabled "Register DHCP leases in the DNS Resolver" and "Register DHCP static mappings in the DNS Resolver" (those where both enabled on my setup).
Ill report back if it happens again.

#12 Updated by Renato Botelho about 2 months ago

Pim Janssen wrote:

I never had any problem with the core system of pfSense on production. Today my unbound died. (about 5 hours after upgrading to 2.5.0-Release). Just before it happened some new Wifi client registered.
I disabled "Register DHCP leases in the DNS Resolver" and "Register DHCP static mappings in the DNS Resolver" (those where both enabled on my setup).
Ill report back if it happens again.

You will need to wait until version 1.13.1 is available and installed on your system to make sure your test is valid.

#13 Updated by Jim Pingle about 2 months ago

The forum thread linked above has instructions for installing the updated version manually from the snapshot repository:
https://forum.netgate.com/topic/160005/unbound-crashes-periodically-with-signal-11/57

Eventually we'll also have a build available in the 21.02/2.5.x repository for manual upgrades.

#14 Updated by Jim Pingle about 2 months ago

This is now in the 2.5.0 repository. To upgrade manually, run the following from an ssh or console shell prompt (not the GUI):

pkg upgrade -fy unbound; pfSsh.php playback svc restart unbound

#15 Updated by Tim Gagnon about 2 months ago

Will the update be made available to 21.02 soon? My 2.5.0 box finds it, but my 21.02 box does not.

Thanks!

#16 Updated by Marcos Mendoza about 2 months ago

On 21.02, in the meantime, the following will upgrade unbound:

pkg add -f https://files01.netgate.com/pfSense_v2_5_0_amd64-pfSense_v2_5_0/All/unbound-1.13.1.txz; pfSsh.php playback svc restart unbound

#17 Updated by Jim Pingle about 2 months ago

No need for that now, it's live in the 21.02 repository now that 21.02-p1 has been released to address SG-3100 stability.

It's not a part of 21.02-p1, but it's in the pkg repository, so you might happen to pick it up as a part of the upgrade like any other available package update post-release.

#18 Updated by Scott B about 2 months ago

I was seeing unbound simply die about once a day since upgrading to 2.5.0-RELEASE. No info as to why in the service's logs available via the web UI, but by running (dmesg | grep bound) I found it was due to segmentation violations (i.e., pid 65353 (unbound), jid 0, uid 59: exited on signal 11). I therefore ran (pkg upgrade -fy unbound) yesterday morning.

The service is not dying with SEGV any longer, at least not yet, but it is restarting regularly. Not having monitored it before, I don't know if that is expected. I suspect not, so I'm reporting it.

Feb 25 14:43:34     unbound     4750     [4750:0] info: start of service (unbound 1.13.1).
Feb 25 14:43:34     unbound     4750     [4750:0] notice: init module 0: iterator
Feb 25 14:43:34     unbound     4750     [4750:0] notice: Restart of unbound 1.13.1.
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 0: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: service stopped (unbound 1.13.1).
Feb 25 14:43:34     unbound     4750     [4750:0] info: start of service (unbound 1.13.1).
Feb 25 14:43:34     unbound     4750     [4750:0] notice: init module 0: iterator
Feb 25 14:43:34     unbound     4750     [4750:0] notice: Restart of unbound 1.13.1.
Feb 25 14:43:34     unbound     4750     [4750:0] info: 0.262144 0.524288 4
Feb 25 14:43:34     unbound     4750     [4750:0] info: 0.131072 0.262144 8
Feb 25 14:43:34     unbound     4750     [4750:0] info: 0.065536 0.131072 19

#19 Updated by Vaidotas Butkus about 2 months ago

Registered just to add to this as DNS is quite important part of the network and needs to be fixed.
I am too having problems with unbound randomly just stopping and not starting after upgrade to 2.5.0
once it happened with older version of unbound that was pushed with 2.5.0
then I manually updated unbound using suggested method
pkg upgrade -fy unbound; pfSsh.php playback svc restart unbound

but today I have it crash again
also in system logs I see it restarting at some intervals.

Feb 26 09:54:33 unbound 51482 [51482:0] info: start of service (unbound 1.13.1).
Feb 26 09:54:33 unbound 51482 [51482:0] notice: init module 1: iterator
Feb 26 09:54:33 unbound 51482 [51482:0] notice: init module 0: validator
Feb 26 09:54:33 unbound 51482 [51482:0] notice: Restart of unbound 1.13.1.

and so on...

until next time it will die and not restart.

I am running dns resolver with
DHCP Registration
and
Static DHCP
settings enabled
otherwise
default settings.

#20 Updated by Jim Pingle about 2 months ago

It is normal for Unbound to restart often when DHCP hostname registration is on. This bug is only for the actual crash (segfault), not other behavior.

#22 Updated by Jim Pingle about 2 months ago

  • Related to Bug #5413: Incorrect Handling of Unbound Resolver [service restarts, cache loss, DNS service interruption] added

#23 Updated by Vöggur Guðmundsson about 1 month ago

I have the same issue, after updating two of my pfsense boxes I see abut 4 to 5 messages from each per hour

"Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)"

#24 Updated by Mike Farmwald about 1 month ago

I'm losing DNS every day or so with pfsense 2.5. I'm using the latest from "pkg update".
If there's anything I can do to help - logs, try a special version, etc. I'm happy to help.
I'm running 2.5 on a non-critical firewall, so I'm willing to try things to get rid of this bug.

#25 Updated by Jim Pingle about 1 month ago

Assuming this is the same segfault others are hitting with Unbound they are still investigating it upstream: https://github.com/NLnetLabs/unbound/issues/411

There is a patch to have Unbound log more detailed debugging information if it does crash in the place mentioned in that issue:
https://github.com/NLnetLabs/unbound/commit/269c168f7e58dc3a18ff0148fd8cce959f71bad7

We can look into adding that to snapshots to at least help gather more information for upstream.

#26 Updated by Christian Borchert about 1 month ago

here's a Level 5 log (attached and forum link) from a signal 11 crash on unbound (1.13.1):

https://forum.netgate.com/topic/161372/2-5-0-unbound-1-13-1-exited-on-signal-8-sigfpe-floating-point-exception/4?_=1615344368096

Mar 9 20:30:37 router kernel: pid 32517 (unbound), jid 0, uid 59: exited on signal 11

#27 Updated by Christian Borchert about 1 month ago

Here's the logs from a second signal 11 crash a few hours later

Mar 10 03:44:09 router kernel: pid 87756 (unbound), jid 0, uid 59: exited on signal 11

#28 Updated by Jim Pingle about 1 month ago

Christian Borchert wrote:

Here's the logs from a second signal 11 crash a few hours later

Mar 10 03:44:09 router kernel: pid 87756 (unbound), jid 0, uid 59: exited on signal 11

There isn't anything relevant in those. You can keep these on the forum if you like, but they aren't going to be useful here and clutter up the issue.

#29 Updated by Jim Pingle about 1 month ago

  • Subject changed from Unbound 1.13.0 is routinely stopping/crashing to Unbound crashing with signal 11

Updating subject for release notes.

If Unbound doesn't find/fix the issue in 1.13.1 soon we may consider rolling Unbound back to 1.12.0 if it's viable. The only CVE addressed since then isn't a major concern on pfSense, and though some of the bug fixes in 1.13.x are beneficial, the instability is a bigger problem.

#30 Updated by Vaidotas Butkus 27 days ago

Jim Pingle wrote:

Updating subject for release notes.

If Unbound doesn't find/fix the issue in 1.13.1 soon we may consider rolling Unbound back to 1.12.0 if it's viable. The only CVE addressed since then isn't a major concern on pfSense, and though some of the bug fixes in 1.13.x are beneficial, the instability is a bigger problem.

Updating manually to 1.13.1 did not solve issue completely it may have decreased crashes to 1-4 times per week but it's still crashing, downgrade would be acceptable solution at least in my opinion as I never had problems with unbound crashing in previous versions of pfsense.

#31 Updated by Chris Collins 27 days ago

I hope the decision is not made to roll back unbound, as its just going back to old code, when the better decision might be to disable register dhcp leases by default, I am not sure of the benefits of that option for most people's configuration.

I think as long as you give people a heads up and the option remains for the few that need the option to enable it again I think it would be accepted by the community, everyone on the forum who I advised to disable the option reported their issues been fixed.

#32 Updated by Jim Pingle 27 days ago

Chris Collins wrote:

I hope the decision is not made to roll back unbound, as its just going back to old code, when the better decision might be to disable register dhcp leases by default, I am not sure of the benefits of that option for most people's configuration.

Sure Unbound 1.12.0 is "old code" but if you look at the Changelog, there isn't anything life changing from 1.12.0 to 1.13.x that would hurt us significantly. I'd rather roll back to code we know is stable than disable a feature used by tens of thousands of users because a handful of people still encounter instability that may (or may not) be related to that feature.

#33 Updated by Vaidotas Butkus 23 days ago

Chris Collins wrote:

I hope the decision is not made to roll back unbound, as its just going back to old code, when the better decision might be to disable register dhcp leases by default, I am not sure of the benefits of that option for most people's configuration.

I think as long as you give people a heads up and the option remains for the few that need the option to enable it again I think it would be accepted by the community, everyone on the forum who I advised to disable the option reported their issues been fixed.

Maybe in your case it is not important but for my it is I would gladly downgrade to unbound 1.12.0 if it solves my problem (is there a supported way to manually do it?)
I rely on hostnames for my internal network management (static ip hosts and dynamic ip hosts) heavily and disabling registering dns names for hosts would break my workflows.
I even consider moving to separate DNS/DHCP servers. for now watchguard rebooting unbound when it's crashed is temporary fix for me. I set up mail notifications for watchgourd rebooting unbound and getting email almost every single day that unbound has crashed.

#34 Updated by Chris Collins 23 days ago

Vaidotas, static DHCP should probably be used if you rely on hostnames so much. The feature in general has been the cause of so many problems, I have for several years been seeing reports of problems caused by it.

However I didnt propose removing the feature, just toggling the default.

#35 Updated by S P 5 days ago

Can confirm the same happening on my system. Unbound crashed with an interval of one week and always at night. And it only happened twice since I updated to 2.5.0 a long time ago.

Apr 1 01:02:14 pfSense kernel: pid 42679 (unbound), jid 0, uid 59: exited on signal 11
Apr 8 07:03:27 pfSense kernel: pid 43848 (unbound), jid 0, uid 59: exited on signal 11

Looking at it, it seems to be aligned with the pfBlockerNG update schedule although not sure if this is not a coincidence. Is there anything else I could provide to make troubleshooting easier? Some more logs?

Update: Just did some digging in the logs so see what happened just after and before the crash.

Apr 1 01:00:00 pfSense php5698: [pfBlockerNG] Starting cron process.
Apr 1 01:01:00 pfSense php64638: rc.dyndns.update: phpDynDNS (vpn.szymon.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
Apr 1 01:02:14 pfSense kernel: pid 42679 (unbound), jid 0, uid 59: exited on signal 11
Apr 1 01:02:32 pfSense php5698: [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload

Apr 8 07:00:00 pfSense php14261: [pfBlockerNG] Starting cron process.
Apr 8 07:02:35 pfSense php14261: [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
Apr 8 07:02:35 pfSense php14261:
Apr 8 07:03:27 pfSense kernel: pid 43848 (unbound), jid 0, uid 59: exited on signal 11

#36 Updated by Jim Pingle 4 days ago

  • Target version changed from 2.5.1 to CE-Next

There is a new commit on Unbound which may help but it's likely too late for 21.02.2/2.5.1, though we can do an out-of-band update to the package again if need be. I'd prefer to see a new release from Unbound first, but it is a one-line change we'd only need temporarily until they do a new release.

Also available in: Atom PDF