Project

General

Profile

Bug #10833

unbound exits on configuration error when link status flaps on LAN interface

Added by John Hood 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Configuration Backend
Target version:
-
Start date:
08/13/2020
Due date:
% Done:

0%

Estimated time:
Affected Version:
2.4.5-p1
Affected Architecture:

Description

I have pfSense installed at home on a small, old, core2duo-based machine. It does pretty typical home-router duty; the most obvious-to-me unusual parts of the configuration are that the internal IPv4 network is 198.206.215.0/24 instead of an RFC1918 network address, and I have an IPv6 tunnel to Hurricane Electric.

This week, the 11-year-old unmanaged GbE switch attached to the LAN port got flaky, and started to fail in some way that caused it to blink all lights on the front and stop passing traffic. Logs show link status flapping on the LAN interface. On power-cycling the switch, it would start working again. But DNS service was gone, though restartable at Status/Services/unbound. I found this in resolver.log:

Aug 13 20:28:22 router unbound: [27434:0] fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf

I wrote a little monitoring script that does 'pgrep unbound' and 'ifconfig em1' every 10 seconds. That seems to show link flapping between normal:

        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active

and no link:
        media: Ethernet autoselect
        status: no carrier

It also showed two copies of dhcpleases running after the link starts flapping.

Edited/excerpted logs and the monitoring script are attached, the switch starts flapping at Aug 13 20:27:57 in the logs, and I power-cycled the switch about 20:28:45. I restarted unbound at 20:30:36.

I tried reproducing the problem by manually plugging/unplugging the patch cable involved, and was not able to reproduce the problem. Alas, I destroyed the switch by plugging the wrong power supply in, so it's no longer helpful either. So I have no repro. I suspect connecting a FreeBSD box and running a little script that did things with 'ifconfig down' and 'ifconfig up' and 'ifconfig mediaopt <blah>' combined with some randomized short delays would eventually knock unbound over.

I haven't investigated the code at all, but it smells like some kind of race condition in the link-configuration scripts to me.

dhcpd.log (6.81 KB) dhcpd.log John Hood, 08/13/2020 11:44 PM
system.log (10.4 KB) system.log John Hood, 08/13/2020 11:44 PM
resolver.log (16.1 KB) resolver.log John Hood, 08/13/2020 11:44 PM
watch-em1.sh.15499.log (16.4 KB) watch-em1.sh.15499.log John Hood, 08/13/2020 11:44 PM
watch-em1.sh (110 Bytes) watch-em1.sh John Hood, 08/13/2020 11:44 PM

Also available in: Atom PDF