Project

General

Profile

Actions

Regression #11316

closed

Unbound crashes with signal 11 when reloading

Added by Martin Müller almost 4 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Category:
DNS Resolver
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
23.01
Release Notes:
Default
Affected Version:
2.5.x
Affected Architecture:

Description

Seems to be the same as here...
https://forum.opnsense.org/index.php?topic=20516.0

My workaround: I have moved to DNS forwarder.

  • Why has the command opnsense-revert not been adopted for pfSense?

Files

resolver.log.0.7z (1.06 MB) resolver.log.0.7z Christian Borchert, 03/09/2021 08:55 PM
resolver.log.0.7z (750 KB) resolver.log.0.7z Christian Borchert, 03/10/2021 05:08 AM
Screenshot from 2021-06-05 14-37-06.png (42.8 KB) Screenshot from 2021-06-05 14-37-06.png Kris Phillips, 06/05/2021 03:41 PM
Screenshot from 2021-06-05 14-37-24.png (41 KB) Screenshot from 2021-06-05 14-37-24.png Kris Phillips, 06/05/2021 03:41 PM
unbound-crash-2021-08-11.log (3.33 KB) unbound-crash-2021-08-11.log Akom Benevolent, 08/11/2021 02:12 PM
unbound-crash-2021-08-11-corrected.log (7.36 KB) unbound-crash-2021-08-11-corrected.log Akom Benevolent, 08/11/2021 03:19 PM

Related issues

Related to Bug #5413: Reduce disruptions when changing DNS records from DHCP leases in UnboundResolvedChristian McDonald11/10/2015

Actions
Related to Todo #11915: Temporarily move back to Unbound 1.12.x due to instability on Unbound 1.13.xResolvedRenato Botelho05/12/2021

Actions
Related to Bug #10624: Memory leak in Unbound with Python module and DHCP lease registration activeResolvedChristian McDonald

Actions
Actions #1

Updated by Jim Pingle almost 4 years ago

  • Status changed from New to Rejected

There is not nearly enough information here to constitute a proper bug report, and I cannot reproduce the problem as stated. It's perfectly stable here. This site is not for support or diagnostic discussion.

For assistance in solving problems, please post on the Netgate Forum or the pfSense Subreddit .

See Reporting Issues with pfSense Software for more information.

Actions #3

Updated by Daniel Keller almost 4 years ago

I have the same problem. it happens only when the option "Register DHCP leases in the DNS Resolver" is set.

it looks like dhcp tries to restart the unbound service because of a new dhcp lease but unbound does not start again.

Jan 29 13:45:07 heimdall1 dhcpleases15066: Could not deliver signal HUP to process 33623: No such process.
Jan 29 13:45:05 heimdall1 dhcpleases15066: Could not deliver signal HUP to process 33623: No such process.
Jan 29 13:45:04 heimdall1 dhcpleases15066: Could not deliver signal HUP to process 33623: No such process.
Jan 29 13:45:02 heimdall1 kernel: pid 33623 (unbound), jid 0, uid 59: exited on signal 11

Actions #4

Updated by Jim Pingle almost 4 years ago

Keep the discussion on the forum. If it's still happening, there is no evidence there. Last post was over a week ago and that last response implies their problem was resolved. No indication it's ongoing on current snapshots.

Actions #5

Updated by Martin Müller almost 4 years ago

In the "competitor's" forum, there are several pages of error descriptions and error analyses for Unbound 1.13.0. Also a bug report is available at freebsd.
Ignoring is a great bug fix.

PS: I can confirm Daniel's observation. It seems to be related to the option "Register DHCP leases in the DNS Resolver".

Actions #6

Updated by Jim Pingle almost 4 years ago

Behavior on other systems (even FreeBSD) isn't directly relevant to pfSense software. They may be similar, but it's not 100% the same. We need more information about how it happens.

I run my edge, and several others, with DHCP registration on and have zero problems on snapshots with the DNS Resolver. If there is a problem, there is much more to it. Keep discussing it on the forum for now.

If we can't reproduce it, we have no way to debug it or test if it's fixed.

Furthermore, the fix committed to FreeBSD is already present on the latest snapshots and has been for the last three weeks:

https://github.com/pfsense/FreeBSD-ports/commit/9510cfe4c453c2f589b2b065b9f42a85e7f3c5ba

If it's still happening, it may not be the same root cause, which still means we need more information. On the forum thread. Not here.

Actions #7

Updated by Jim Pingle almost 4 years ago

  • Tracker changed from Bug to Regression
  • Status changed from Rejected to New
  • Assignee set to Renato Botelho
  • Target version set to CE-Next

Now that there have been responses from several others on the forum post with info, it does appear there is a problem in unbound even though none of us here can reproduce it. It may be triggered by the -HUP restart from dhcpleases in some cases but I still can't reproduce it here that way. There must be some other mitigating factor but thus far it hasn't been identified. It happens on systems with and without pfBlocker as well. I've killed it hundreds of times in a loop and it is still running and responding properly and not crashing here in my lab.

Unbound 1.13.1 was just released a few days ago and is now in FreeBSD ports, so we may want to bring that into snapshots for testing. There are a few bugs fixed there which could be relevant, including the one we already have a patch for.

Actions #8

Updated by Renato Botelho almost 4 years ago

  • Target version changed from CE-Next to 2.5.1
Actions #9

Updated by Renato Botelho almost 4 years ago

1.13.1 cherry-picked to 2.5.0 branch

Actions #10

Updated by Renato Botelho almost 4 years ago

  • Status changed from New to Feedback
Actions #11

Updated by Pim Janssen almost 4 years ago

I never had any problem with the core system of pfSense on production. Today my unbound died. (about 5 hours after upgrading to 2.5.0-Release). Just before it happened some new Wifi client registered.
I disabled "Register DHCP leases in the DNS Resolver" and "Register DHCP static mappings in the DNS Resolver" (those where both enabled on my setup).
Ill report back if it happens again.

Actions #12

Updated by Renato Botelho almost 4 years ago

Pim Janssen wrote:

I never had any problem with the core system of pfSense on production. Today my unbound died. (about 5 hours after upgrading to 2.5.0-Release). Just before it happened some new Wifi client registered.
I disabled "Register DHCP leases in the DNS Resolver" and "Register DHCP static mappings in the DNS Resolver" (those where both enabled on my setup).
Ill report back if it happens again.

You will need to wait until version 1.13.1 is available and installed on your system to make sure your test is valid.

Actions #13

Updated by Jim Pingle almost 4 years ago

The forum thread linked above has instructions for installing the updated version manually from the snapshot repository:
https://forum.netgate.com/topic/160005/unbound-crashes-periodically-with-signal-11/57

Eventually we'll also have a build available in the 21.02/2.5.x repository for manual upgrades.

Actions #14

Updated by Jim Pingle almost 4 years ago

This is now in the 2.5.0 repository. To upgrade manually, run the following from an ssh or console shell prompt (not the GUI):

pkg upgrade -fy unbound; pfSsh.php playback svc restart unbound
Actions #15

Updated by Tim Gagnon almost 4 years ago

Will the update be made available to 21.02 soon? My 2.5.0 box finds it, but my 21.02 box does not.

Thanks!

Actions #16

Updated by Marcos M almost 4 years ago

On 21.02, in the meantime, the following will upgrade unbound:

pkg add -f https://files01.netgate.com/pfSense_v2_5_0_amd64-pfSense_v2_5_0/All/unbound-1.13.1.txz; pfSsh.php playback svc restart unbound

Actions #17

Updated by Jim Pingle almost 4 years ago

No need for that now, it's live in the 21.02 repository now that 21.02-p1 has been released to address SG-3100 stability.

It's not a part of 21.02-p1, but it's in the pkg repository, so you might happen to pick it up as a part of the upgrade like any other available package update post-release.

Actions #18

Updated by Scott B almost 4 years ago

I was seeing unbound simply die about once a day since upgrading to 2.5.0-RELEASE. No info as to why in the service's logs available via the web UI, but by running (dmesg | grep bound) I found it was due to segmentation violations (i.e., pid 65353 (unbound), jid 0, uid 59: exited on signal 11). I therefore ran (pkg upgrade -fy unbound) yesterday morning.

The service is not dying with SEGV any longer, at least not yet, but it is restarting regularly. Not having monitored it before, I don't know if that is expected. I suspect not, so I'm reporting it.

Feb 25 14:43:34     unbound     4750     [4750:0] info: start of service (unbound 1.13.1).
Feb 25 14:43:34     unbound     4750     [4750:0] notice: init module 0: iterator
Feb 25 14:43:34     unbound     4750     [4750:0] notice: Restart of unbound 1.13.1.
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
Feb 25 14:43:34     unbound     4750     [4750:0] info: server stats for thread 0: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 25 14:43:34     unbound     4750     [4750:0] info: service stopped (unbound 1.13.1).
Feb 25 14:43:34     unbound     4750     [4750:0] info: start of service (unbound 1.13.1).
Feb 25 14:43:34     unbound     4750     [4750:0] notice: init module 0: iterator
Feb 25 14:43:34     unbound     4750     [4750:0] notice: Restart of unbound 1.13.1.
Feb 25 14:43:34     unbound     4750     [4750:0] info: 0.262144 0.524288 4
Feb 25 14:43:34     unbound     4750     [4750:0] info: 0.131072 0.262144 8
Feb 25 14:43:34     unbound     4750     [4750:0] info: 0.065536 0.131072 19
Actions #19

Updated by Vaidotas Butkus almost 4 years ago

Registered just to add to this as DNS is quite important part of the network and needs to be fixed.
I am too having problems with unbound randomly just stopping and not starting after upgrade to 2.5.0
once it happened with older version of unbound that was pushed with 2.5.0
then I manually updated unbound using suggested method
pkg upgrade -fy unbound; pfSsh.php playback svc restart unbound

but today I have it crash again
also in system logs I see it restarting at some intervals.

Feb 26 09:54:33 unbound 51482 [51482:0] info: start of service (unbound 1.13.1).
Feb 26 09:54:33 unbound 51482 [51482:0] notice: init module 1: iterator
Feb 26 09:54:33 unbound 51482 [51482:0] notice: init module 0: validator
Feb 26 09:54:33 unbound 51482 [51482:0] notice: Restart of unbound 1.13.1.

and so on...

until next time it will die and not restart.

I am running dns resolver with
DHCP Registration
and
Static DHCP
settings enabled
otherwise
default settings.

Actions #20

Updated by Jim Pingle almost 4 years ago

It is normal for Unbound to restart often when DHCP hostname registration is on. This bug is only for the actual crash (segfault), not other behavior.

Actions #22

Updated by Jim Pingle almost 4 years ago

  • Related to Bug #5413: Reduce disruptions when changing DNS records from DHCP leases in Unbound added
Actions #23

Updated by Vöggur Guðmundsson almost 4 years ago

I have the same issue, after updating two of my pfsense boxes I see abut 4 to 5 messages from each per hour

"Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)"

Actions #24

Updated by Mike Farmwald almost 4 years ago

I'm losing DNS every day or so with pfsense 2.5. I'm using the latest from "pkg update".
If there's anything I can do to help - logs, try a special version, etc. I'm happy to help.
I'm running 2.5 on a non-critical firewall, so I'm willing to try things to get rid of this bug.

Actions #25

Updated by Jim Pingle almost 4 years ago

Assuming this is the same segfault others are hitting with Unbound they are still investigating it upstream: https://github.com/NLnetLabs/unbound/issues/411

There is a patch to have Unbound log more detailed debugging information if it does crash in the place mentioned in that issue:
https://github.com/NLnetLabs/unbound/commit/269c168f7e58dc3a18ff0148fd8cce959f71bad7

We can look into adding that to snapshots to at least help gather more information for upstream.

Actions #26

Updated by Christian Borchert almost 4 years ago

here's a Level 5 log (attached and forum link) from a signal 11 crash on unbound (1.13.1):

https://forum.netgate.com/topic/161372/2-5-0-unbound-1-13-1-exited-on-signal-8-sigfpe-floating-point-exception/4?_=1615344368096

Mar 9 20:30:37 router kernel: pid 32517 (unbound), jid 0, uid 59: exited on signal 11

Actions #27

Updated by Christian Borchert almost 4 years ago

Here's the logs from a second signal 11 crash a few hours later

Mar 10 03:44:09 router kernel: pid 87756 (unbound), jid 0, uid 59: exited on signal 11

Actions #28

Updated by Jim Pingle almost 4 years ago

Christian Borchert wrote:

Here's the logs from a second signal 11 crash a few hours later

Mar 10 03:44:09 router kernel: pid 87756 (unbound), jid 0, uid 59: exited on signal 11

There isn't anything relevant in those. You can keep these on the forum if you like, but they aren't going to be useful here and clutter up the issue.

Actions #29

Updated by Jim Pingle almost 4 years ago

  • Subject changed from Unbound 1.13.0 is routinely stopping/crashing to Unbound crashing with signal 11

Updating subject for release notes.

If Unbound doesn't find/fix the issue in 1.13.1 soon we may consider rolling Unbound back to 1.12.0 if it's viable. The only CVE addressed since then isn't a major concern on pfSense, and though some of the bug fixes in 1.13.x are beneficial, the instability is a bigger problem.

Actions #30

Updated by Vaidotas Butkus almost 4 years ago

Jim Pingle wrote:

Updating subject for release notes.

If Unbound doesn't find/fix the issue in 1.13.1 soon we may consider rolling Unbound back to 1.12.0 if it's viable. The only CVE addressed since then isn't a major concern on pfSense, and though some of the bug fixes in 1.13.x are beneficial, the instability is a bigger problem.

Updating manually to 1.13.1 did not solve issue completely it may have decreased crashes to 1-4 times per week but it's still crashing, downgrade would be acceptable solution at least in my opinion as I never had problems with unbound crashing in previous versions of pfsense.

Actions #31

Updated by Chris Collins almost 4 years ago

I hope the decision is not made to roll back unbound, as its just going back to old code, when the better decision might be to disable register dhcp leases by default, I am not sure of the benefits of that option for most people's configuration.

I think as long as you give people a heads up and the option remains for the few that need the option to enable it again I think it would be accepted by the community, everyone on the forum who I advised to disable the option reported their issues been fixed.

Actions #32

Updated by Jim Pingle almost 4 years ago

Chris Collins wrote:

I hope the decision is not made to roll back unbound, as its just going back to old code, when the better decision might be to disable register dhcp leases by default, I am not sure of the benefits of that option for most people's configuration.

Sure Unbound 1.12.0 is "old code" but if you look at the Changelog, there isn't anything life changing from 1.12.0 to 1.13.x that would hurt us significantly. I'd rather roll back to code we know is stable than disable a feature used by tens of thousands of users because a handful of people still encounter instability that may (or may not) be related to that feature.

Actions #33

Updated by Vaidotas Butkus almost 4 years ago

Chris Collins wrote:

I hope the decision is not made to roll back unbound, as its just going back to old code, when the better decision might be to disable register dhcp leases by default, I am not sure of the benefits of that option for most people's configuration.

I think as long as you give people a heads up and the option remains for the few that need the option to enable it again I think it would be accepted by the community, everyone on the forum who I advised to disable the option reported their issues been fixed.

Maybe in your case it is not important but for my it is I would gladly downgrade to unbound 1.12.0 if it solves my problem (is there a supported way to manually do it?)
I rely on hostnames for my internal network management (static ip hosts and dynamic ip hosts) heavily and disabling registering dns names for hosts would break my workflows.
I even consider moving to separate DNS/DHCP servers. for now watchguard rebooting unbound when it's crashed is temporary fix for me. I set up mail notifications for watchgourd rebooting unbound and getting email almost every single day that unbound has crashed.

Actions #34

Updated by Chris Collins almost 4 years ago

Vaidotas, static DHCP should probably be used if you rely on hostnames so much. The feature in general has been the cause of so many problems, I have for several years been seeing reports of problems caused by it.

However I didnt propose removing the feature, just toggling the default.

Actions #35

Updated by S P over 3 years ago

Can confirm the same happening on my system. Unbound crashed with an interval of one week and always at night. And it only happened twice since I updated to 2.5.0 a long time ago.

Apr 1 01:02:14 pfSense kernel: pid 42679 (unbound), jid 0, uid 59: exited on signal 11
Apr 8 07:03:27 pfSense kernel: pid 43848 (unbound), jid 0, uid 59: exited on signal 11

Looking at it, it seems to be aligned with the pfBlockerNG update schedule although not sure if this is not a coincidence. Is there anything else I could provide to make troubleshooting easier? Some more logs?

Update: Just did some digging in the logs so see what happened just after and before the crash.

Apr 1 01:00:00 pfSense php5698: [pfBlockerNG] Starting cron process.
Apr 1 01:01:00 pfSense php64638: rc.dyndns.update: phpDynDNS (vpn.szymon.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
Apr 1 01:02:14 pfSense kernel: pid 42679 (unbound), jid 0, uid 59: exited on signal 11
Apr 1 01:02:32 pfSense php5698: [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload

Apr 8 07:00:00 pfSense php14261: [pfBlockerNG] Starting cron process.
Apr 8 07:02:35 pfSense php14261: [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
Apr 8 07:02:35 pfSense php14261:
Apr 8 07:03:27 pfSense kernel: pid 43848 (unbound), jid 0, uid 59: exited on signal 11

Actions #36

Updated by Jim Pingle over 3 years ago

  • Target version changed from 2.5.1 to CE-Next

There is a new commit on Unbound which may help but it's likely too late for 21.02.2/2.5.1, though we can do an out-of-band update to the package again if need be. I'd prefer to see a new release from Unbound first, but it is a one-line change we'd only need temporarily until they do a new release.

Actions #37

Updated by Andrew Counterman over 3 years ago

I'm experiencing this issue as well.

It seems to be preceded by unbound going 100% cpu for several minutes, during which 'unbound-control stop' will time-out, the stop and restart buttons in the webui just spin for 30 second then return without doing anything, and even doing a kill on the unbound process does nothing. DNS queries will still go through for a while but it seems to eventually dig itself a hole it can't escape, after which it will stop processing queries. The only thing I've been able to do is reboot the entire box (XG-7100U), which is... not ideal.

Watchdog-ing unbound doesn't seem to help, as the process never seems to exit, it just stops responding.

Actions #38

Updated by Mike Farmwald over 3 years ago

Is there any hope for a fix? It's been quite a while and I don't see any progress.
I'm happy to try to help, but not sure what to do. I'm thinking of moving to another DNS approach.
Thanks.

Actions #39

Updated by Jim Pingle over 3 years ago

  • Status changed from Feedback to In Progress

Mike Farmwald wrote:

Is there any hope for a fix? It's been quite a while and I don't see any progress.
I'm happy to try to help, but not sure what to do. I'm thinking of moving to another DNS approach.

There is some movement on the Unbound issue (see my previous comment) but from the comments on their issue it isn't quite solved yet.

Actions #40

Updated by Christian Bourque over 3 years ago

This really is a show stopper for us, I've upgraded from 2.4.5 2 days ago and now our unbound server is crashing constantly!

And the proposed workaround is not really an option for us as we rely on this feature to manage our workstations remotely...

Actions #41

Updated by Scott Long over 3 years ago

We're reverting to unbound 1.12 in order to restore stability. We have to backport at least one CVE for it, so it's taking extra time before we can put out an update. Preliminary testing looks like the problem is improved.

Actions #42

Updated by Mike Farmwald over 3 years ago

Is there a release of the reverted unbound to try? I'm willing to try it.
I'm now crashing 3 or 4 times a day.

Actions #43

Updated by Vaidotas Butkus over 3 years ago

Had crash that even watchdog did not recover. tried manually restarting unbound via web GUI, status was showing as ok but could not resolve any host names, but IP addresses were no problem. Had to reboot device. Thank you for looking into it as it is starting to drive me crazy.

Actions #44

Updated by Jim Pingle over 3 years ago

Snapshots for 2.6.0 and 21.05 have had Unbound 1.12.x for a couple weeks now.

I created #11915 to note that since that is a temporary measure and this root issue is not yet solved.

Actions #45

Updated by Jim Pingle over 3 years ago

  • Related to Todo #11915: Temporarily move back to Unbound 1.12.x due to instability on Unbound 1.13.x added
Actions #46

Updated by Mike Farmwald over 3 years ago

Jim Pingle wrote:

Snapshots for 2.6.0 and 21.05 have had Unbound 1.12.x for a couple weeks now.

I created #11915 to note that since that is a temporary measure and this root issue is not yet solved.

Is there a way to install the reverted Unbound in 2.5.1?

Actions #47

Updated by Jim Pingle over 3 years ago

Mike Farmwald wrote:

Jim Pingle wrote:

Snapshots for 2.6.0 and 21.05 have had Unbound 1.12.x for a couple weeks now.

I created #11915 to note that since that is a temporary measure and this root issue is not yet solved.

Is there a way to install the reverted Unbound in 2.5.1?

The unbound112 build is available in the pkg repository but we're still working on a good set of pkg commands to swap it out safely. Since it's such a critical core dependency it's not quite as simple as removing one and adding the other. But feel free to try it out (hopefully safely in a lab first).

Actions #48

Updated by Renato Botelho over 3 years ago

  • Status changed from In Progress to Feedback

I've imported https://github.com/NLnetLabs/unbound/commit/ff6b527184b33ffe1e2b643db8a32fae8061fc5a into our devel branches. Unbound 1.13.1_2 will have it and must be tested

Actions #49

Updated by Alexis Mestag over 3 years ago

Jim Pingle wrote:

The unbound112 build is available in the pkg repository but we're still working on a good set of pkg commands to swap it out safely. Since it's such a critical core dependency it's not quite as simple as removing one and adding the other. But feel free to try it out (hopefully safely in a lab first).

Is the "good set of pkg commands" available somewhere already ? I couldn't find it, nor find any information on how to swap unbound for unbound112 safely.
If not, do you know where such procedure will be made available ?

I can see that pfSense 21.05 was released with unbound 1.12 instead of 1.13, but nothing for pfSense CE yet.

Actions #50

Updated by Kris Phillips over 3 years ago

The DHCP service doesn't appear to be reliably updating the DNS server either. Tested on 21.09 Jun 5th build, I did the following:

1. Set my LAN DHCP to 192.168.1.100-245
2. Pulled a DHCP lease. Checked DNS and the entry for my laptop at .100 was there
3. Reset the LAN DHCP to 192.168.1.150-245 and renewed DHCP. Laptop received .150.
4. Tried a DNS Lookup DNS resolves to the old .100 address still.
5. Tried clearing all DHCP leases and forcing a DHCP renew on my laptop. DNS record remains at .100

However, after doing this a few dozen times I see the DNS server restarting, but not crashing in 21.09. I don't have enough data yet to say definitively whether it's improved or not, though.

See attached screenshots.

Build Info:
21.09-DEVELOPMENT (arm)
built on Sat Jun 05 01:08:58 EDT 2021
FreeBSD 12.2-STABLE

Actions #51

Updated by Akom Benevolent over 3 years ago

As an ugly workaround, I'm using "Service Watchdog" package to restart unbound when it crashes. This happens every few minutes but seems to work alright.

I'm on CE 2.5.1 with unbound 1.13.1

Actions #52

Updated by Eduard Rozenberg over 3 years ago

For those commenting about restarting the service - that didn't help me. Had to restart the firewalls.

This 5 month old issue with no resolution and no suggested workaround should be getting a lot more exposure such as on the Netgate Blog because it's a dangerous landmine that can strike at any time and you have zero warning or clue when it hits you.

Some weeks after updating to 2.5.1 it suddently caused all our site's services to become unusably slow and luckily I thought to ping some servers via LAN IP's and noticed that was working fine which made me guess it was some kind of DNS issue, and maybe restarting the service would help (it didn't) and then went to the old fallback of rebooting the firewalls which fixed the issue (until the next time). You have to be lucky and thinking pretty clearly and calmly to figure this out while all of your stuff isn't working and alerts and emails are coming in.

The only "good" part is this happened at 9pm on a Sunday and not 3am on a Monday. Things can always be worse.

Actions #53

Updated by Remo Wylliams over 3 years ago

This problem is very much interfering with my network operations. I have watchdog restarting unbound but
it can take several seconds or more before I can resolve again. This is not a small problem.
Please provide some status update.

Many Thanks.

Actions #54

Updated by Jim Pingle over 3 years ago

There is no change since the last updates already covered above in previous comments:

  • Unbound still hasn't put out a fixed release
  • We added the latest relevant commits from the Unbound project to Unbound 1.13.1 on CE 2.6.0 and Plus 21.09 snapshots for testing
  • We moved back to Unbound 1.12.x for Plus 21.05 and for CE 2.5.2 which is in RC stage now, and will be coming out soon. See #11915

The workaround is the same as it has been, disable DHCP lease registration in the DNS resolver or try a development snapshot.

Actions #55

Updated by Remo Wylliams over 3 years ago

Jim Pingle wrote:

There is no change since the last updates already covered above in previous comments:

  • Unbound still hasn't put out a fixed release
  • We added the latest relevant commits from the Unbound project to Unbound 1.13.1 on CE 2.6.0 and Plus 21.09 snapshots for testing
  • We moved back to Unbound 1.12.x for Plus 21.05 and for CE 2.5.2 which is in RC stage now, and will be coming out soon. See #11915

The workaround is the same as it has been, disable DHCP lease registration in the DNS resolver or try a development snapshot.

Hi Jim,
I updated to CE 2.6.xx and the unbound failures seem to be fewer but still a problem.

Jul 1 09:02:56 unbound 80302 [80302:0] info: 512.000000 1024.000000 6
Jul 1 09:02:56 unbound 80302 [80302:0] notice: Restart of unbound 1.13.1.
Jul 1 09:02:56 unbound 80302 [80302:0] notice: init module 0: validator
Jul 1 09:02:56 unbound 80302 [80302:0] notice: init module 1: iterator
Jul 1 09:02:56 unbound 80302 [80302:0] info: start of service (unbound 1.13.1).
Jul 1 09:02:56 unbound 80302 [80302:0] info: service stopped (unbound 1.13.1).
Jul 1 09:02:56 unbound 80302 [80302:0] info: server stats for thread 0: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Jul 1 09:02:56 unbound 80302 [80302:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
Jul 1 09:02:56 unbound 80302 [80302:0] info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Jul 1 09:02:56 unbound 80302 [80302:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
Jul 1 09:02:56 unbound 80302 [80302:0] notice: Restart of unbound 1.13.1.
Jul 1 09:02:57 unbound 80302 [80302:0] notice: init module 0: validator
Jul 1 09:02:57 unbound 80302 [80302:0] notice: init module 1: iterator
Jul 1 09:02:57 unbound 80302 [80302:0] info: start of service (unbound 1.13.1).
Jul 1 09:02:57 unbound 80302 [80302:0] info: generate keytag query _ta-4f66. NULL IN
Jul 1 09:02:58 unbound 80302 [80302:1] info: generate keytag query _ta-4f66. NULL IN

I look forward to the unbound devs resolving this problem. Having to use an ip address just to get to the router to restart things when even watchdog fails
is disheartening.

Cheers

Actions #56

Updated by Jim Pingle over 3 years ago

Remo Wylliams wrote:

I updated to CE 2.6.xx and the unbound failures seem to be fewer but still a problem.

Are you still seeing a crash/core dump from unbound? Check the main system log. The resolver log only shows it restarting (which is normal when it updates data from DHCP leases), that log snippet doesn't show the crash.

Actions #57

Updated by Remo Wylliams over 3 years ago

No I'm not seeing any crashes on unbound. Didn't know about the DHCP leases.
Thanks again.

Actions #58

Updated by Remo Wylliams over 3 years ago

I take that back. no sooner do I say it's working then.. BANG..
No crash in the log but completely hung DNS and restarting resolver causes unbound to fail.
Jul 1 13:34:09 php-fpm 1720 /status_services.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1625164449] unbound[43151:0] error: bind: address already in use [1625164449] unbound[43151:0] fatal error: could not open ports'

so it was stuck and wouldn't respond to the signal.
I had to reboot my router to post this.
So less crashy more stubborn.

Actions #59

Updated by Morten Mathiasen over 3 years ago

Martin Müller wrote:

"Unbound 1.13.1 was just released a few days ago

I had the same problem with pfsense 2.5.1 running unbound version 1.13.1 that randomly crashes.
I upgraded to pfsense 2.5.2 which runs unbound 1.12.0 and that fixes the issue ;-)

Actions #60

Updated by Remo Wylliams over 3 years ago

The most recent 2.5.2 version of pfsense seems to work great. or at least in a way I haven't found a problem.
Thanks for the update.

Actions #61

Updated by Kris Phillips over 3 years ago

I have been running 21.05 for over a month and haven't seen any unbound crashes at all on 1.12. We will want to retest if we move up to a newer version of unbound but this version appears stable for now.

Actions #62

Updated by Jim Pingle over 3 years ago

Kris Phillips wrote in #note-61:

I have been running 21.05 for over a month and haven't seen any unbound crashes at all on 1.12. We will want to retest if we move up to a newer version of unbound but this version appears stable for now.

Plus 21.09 and CE 2.6.0 snapshots have a patched version of Unbound 1.13.x for testing. If anyone can reliably reproduce the crash, that's the next thing to test.

Actions #63

Updated by Renato Botelho over 3 years ago

I've imported more patches from upstream and bumped unbound to 1.13.1_3. This version will be available with next round of snapshots and must be tested to make sure it fixes all issues

Actions #64

Updated by Akom Benevolent over 3 years ago

I just had an unbound 1.12.0 crash on CE 2.5.2
It was up for about a month (with 1.13.0 on 2.5.1 it crashed every few hours).

Log attached.

UPDATE: I can't tell what I'm running after updating to 2.5.2.
Logs say 1.13.1, but:

[2.5.2-RELEASE][admin@router]/root: unbound -V
Version 1.12.0

Actions #65

Updated by Jim Pingle over 3 years ago

That is odd, the log also didn't show a crash, but a clean stop and start. That is also a very old log, maybe you uploaded the start of the log instead of the end? It appears to be in reverse order which is difficult to follow.

Actions #66

Updated by Akom Benevolent over 3 years ago

Jim Pingle wrote in #note-65:

That is odd, the log also didn't show a crash, but a clean stop and start. That is also a very old log, maybe you uploaded the start of the log instead of the end? It appears to be in reverse order which is difficult to follow.

You're quite right, I had it upside down in the UI and grabbed the start. The recent entries do show 1.12.0.

No obvious crash in the logs though. I reinstalled Service Watchdog and will hopefully get an email if it happens again.

Actions #67

Updated by Jim Pingle over 3 years ago

If it was a similar crash to the previous issues, you would see the crash in the main system log. You can filter that for messages containing unbound to narrow it down. You'd see messages like those in note 3, 18, and so on above.

Actions #68

Updated by Akom Benevolent over 3 years ago

Jim Pingle wrote in #note-67:

If it was a similar crash to the previous issues, you would see the crash in the main system log. You can filter that for messages containing unbound to narrow it down. You'd see messages like those in note 3, 18, and so on above.

Thanks. Turns I did have crash messages, but in syslog, not unbound log.


Aug 11 14:53:49 router kernel: pid 47761 (unbound), jid 0, uid 59, was killed: out of swap space
Aug 11 14:53:50 router kernel: pid 17663 (pcscd), jid 0, uid 0, was killed: out of swap space

That could be unrelated
(Though I have no swap (it's on flash) and memory shows 41% used. No idea what that's about)

Actions #69

Updated by Jim Pingle over 3 years ago

OK that is unrelated to this bug. It is #12095 which can be mitigated by the patch on #11933.

Actions #70

Updated by Renato Botelho over 3 years ago

  • Target version changed from CE-Next to 2.6.0
  • Plus Target Version set to 21.09

Unbound 1.13.2 is now imported on 2.6.0 and 21.09 and will be available on tomorrow's snapshots

Actions #71

Updated by Jim Pingle over 3 years ago

  • Subject changed from Unbound crashing with signal 11 to Unbound crashes with signal 11 when reloading

Updating subject for release notes.

Actions #72

Updated by Jim Pingle about 3 years ago

  • Plus Target Version changed from 21.09 to 22.01
Actions #73

Updated by Kris Phillips about 3 years ago

I've tested this a bit in 1.13.2 on 22.01 and have been unable to reproduce DHCP/DNS crashes with the latest unbound running 1.13.2. We'll want more expansive testing, but in a few hours of trying to make it crash with repeated DHCP lease requests and DNS registrations I don't see any issues.

Actions #74

Updated by Ben Ito almost 3 years ago

There are two problems here.

1. There is no watchdog restarting the unbound service
2. An invalid unbound configuration is being written by dhcpleases to /var/unbound/

Jan 17 12:00:04 unbound[98083]: [98083:0] notice: Restart of unbound 1.12.0.                                                                                                        
Jan 17 12:00:04 unbound[98083]: [98083:0] fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-
checkconf                                                                                                        

Notice that dhcpleases HUP's unbound several times in a row. It's very likely that `dhcpleases` is doing non-atomic writes to /var/unbound/dhcpleases_entries.conf, and unbound happens to read the file mid-write.

PfSense: 2.5.2-RELEASE (amd64)
Unbound: 1.12.0

Actions #75

Updated by Jim Pingle almost 3 years ago

  • Status changed from Feedback to Resolved

No recent reports of the crash happening on 22.01/2.6.0, anything happening on older versions wouldn't be a valid metric at this point.

Actions #76

Updated by BBcan177 . almost 3 years ago

Jim Pingle, this is still an open issue.

Actions #77

Updated by Kevin Grelling almost 3 years ago

I hate to bring up a sore point especially in a closed ticket, but this is still happening for me on two up-to-date pfSense 2.6.0 / 22.01 installations. I upgraded from 2.5.2 / 21.05.2 and was very hopeful, but while 22.01 / 2.6.0 may have made the issue occur less frequently, it did not fix the underlying problem.

I have a NAGIOS instance which monitors this and I get a notification whenever nslookups of statically defined DHCP entries results in a NXDOMAIN. I have written a quick script which disables DHCP, deletes all the current DHCP reservations, restarts DHCP then restarts unbound on both installations, and the problem goes away for a few days, but it does come back.

For my pfSense instances, I have noticed that it seems to be with resources that have static DHCP entries that are currently offline. I have a laptop with a statically defined DHCP entry, and it seems the DNS failure occurs when my laptop is NOT connected to the network and its lease is shown as 'offline'. To be clear: NORMALLY, I can nslookup 'laptop.example.com' and get its statically defined IP address whether or not the laptop is online. In the failure mode, I get an NXDOMAIN for the same lookup whether or not my laptop is online. However, I have never seen the failure mode START when my laptop is online, only when it has been off for a period of time, and the lease is shown as 'offline'.

In any case, I do not want to be that person who only complains, so please let me know what I need to do to help further.

Actions #78

Updated by mururoa mururoa over 2 years ago

Why is this closed ??
All was ok for my pfsense until a power outage.
I have pfsense 2.6 up to date and it has been running for months without any problem.
After restart I get this again and again.
So I desactivated lease registration but, what can I do to enable it back ?

Actions #79

Updated by Jim Pingle almost 2 years ago

  • Status changed from Resolved to Feedback
  • Assignee changed from Renato Botelho to Christian McDonald
  • Target version changed from 2.6.0 to 2.7.0
  • Start date deleted (01/26/2021)
  • Plus Target Version changed from 22.01 to 23.01

Christian tracked this down to a problem in the Unbound Python module when the Maxmind library is enabled, as it is with pfBlockerNG. The fixes have been upstreamed to Maxmind and the FreeBSD ports repo (net/py-maxminddb) and pulled back into our repository.

Relevant commits (Adding our local patch, then removing it once it was available upstream):

There is still a potential for a memory leak from the Unbound Python module, but we have a separate fix for that coming (See #10624). There are also some related changes coming to improve the Maxmind Python library behavior further, though those changes may not make this release.

Actions #80

Updated by Jim Pingle almost 2 years ago

  • Related to Bug #10624: Memory leak in Unbound with Python module and DHCP lease registration active added
Actions #81

Updated by BBcan177 . almost 2 years ago

This is great news! Thanks Christian McDonald for diving into this issue. Even when it was almost a complete rewrite of the MaxMindDB Python Module and outside of the base pfSense code! It's very much appreciated!

Will be rolling back the safety belts that was added to pfBlockerNG to prevent unbound python mode from crashing which will allow for users to re-enable the dhcp leases option for the pfSense versions that have these recent changes.

Actions #82

Updated by Jim Pingle almost 2 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100

No reports of problems that I've seen since the fix is in place. If it recurs we can always reopen or start a fresh issue as needed.

Actions

Also available in: Atom PDF