Bug #13707
openUnbound not binding to LAN on startup when explicitly set
Added by Simon Byrnand about 2 years ago. Updated about 1 year ago.
0%
Description
Hi,
This is related to the following forum thread:
https://forum.netgate.com/topic/176155/unbound-not-responding-on-all-chosen-interfaces-after-reboot
To summarise the thread, if I configure unbound to bind to only Localhost and the LAN interface, on restart it does not bind to the LAN, so will not respond to queries from the LAN. If I restart the service manually it starts working, and binding to "All" is a workaround. On different hardware which I have tried, the problem does not occur so this seems to be triggered by specific timing relationship between the startup scripts and the ethernet links going up during boot. In other words a race condition.
I note from the system.log that the unbound service starts after igb0 (WAN) goes up, but before igb1 (LAN) goes up, and there is no attempt to restart unbound anywhere in the log after the LAN interface goes up:
Nov 29 09:18:48 pfSense-Home check_reload_status409: Linkup starting igb0
Nov 29 09:18:48 pfSense-Home kernel:
Nov 29 09:18:48 pfSense-Home kernel: igb0: link state changed to UP
Nov 29 09:18:48 pfSense-Home check_reload_status409: rc.newwanip starting igb0
Nov 29 09:18:48 pfSense-Home php431: rc.bootup: Resyncing OpenVPN instances.
Nov 29 09:18:48 pfSense-Home kernel: done.
Nov 29 09:18:48 pfSense-Home php431: rc.bootup: [squid] Installed but disabled. Not installing 'nat' rules.
Nov 29 09:18:48 pfSense-Home kernel: pflog0: promiscuous mode enabled
Nov 29 09:18:48 pfSense-Home php431: rc.bootup: [squid] Installed but disabled. Not installing 'pfearly' rules.
Nov 29 09:18:48 pfSense-Home kernel: .
Nov 29 09:18:48 pfSense-Home php431: rc.bootup: [squid] Installed but disabled. Not installing 'filter' rules.
Nov 29 09:18:48 pfSense-Home kernel: ..
Nov 29 09:18:48 pfSense-Home kernel: .done.
Nov 29 09:18:49 pfSense-Home php431: rc.bootup: Default gateway setting Interface WAN_DHCP Gateway as default.
Nov 29 09:18:49 pfSense-Home php431: rc.bootup: Gateway, none 'available' for inet6, use the first one configured. ''
Nov 29 09:18:49 pfSense-Home kernel: done.
Nov 29 09:18:49 pfSense-Home php-fpm371: /rc.newwanip: rc.newwanip: Info: starting on igb0.
Nov 29 09:18:49 pfSense-Home php-fpm371: /rc.newwanip: rc.newwanip: on (IP address: x.x.x.x) (interface: WAN[wan]) (real interface: igb0).
Nov 29 09:18:50 pfSense-Home php431: rc.bootup: sync unbound done.
Nov 29 09:18:50 pfSense-Home kernel: done.
Nov 29 09:18:50 pfSense-Home kernel: done.
Nov 29 09:18:53 pfSense-Home check_reload_status409: Linkup starting igb1
Nov 29 09:18:53 pfSense-Home kernel:
Nov 29 09:18:53 pfSense-Home kernel: igb1: link state changed to UP
In the unbound config file the LAN interface IP is missing but it is there after the service is manually restarted. Presumably the script which generated the config file saw the LAN interface was down at the time it first launched unbound and did not include it.
I have been down the rabbit hole of studying redmine tickets 12613, 13254 and several other related tickets, however they relate to different versions which have different code for rc.linkup than 2.6.0, and I also see there is a significant rewrite of this code underway for 2.7.0.
When I look at the code for rc.linkup shipped in 2.6.0 I can clearly see there is code intended to restart unbound when an interface goes up, however this does not seem to trigger when the interface goes up during initial bootup, as a result after a full reboot or cold boot it does not bind to the LAN interface. It does work after a "reroot reboot" presumably because the interface remains up the whole time.
Files
PFSense-2.7.0-error.png (3.58 MB) PFSense-2.7.0-error.png | Simon Byrnand, 12/03/2022 11:52 AM | ||
clipboard-202212031920-ksvxu.png (16.8 KB) clipboard-202212031920-ksvxu.png | interfaces_unbound_pfStartup | Jordan G, 12/03/2022 07:20 PM | |
boot-sequence.png (911 KB) boot-sequence.png | screenshot of boot sequence | Simon Byrnand, 12/05/2022 02:12 PM |
Updated by Jim Pingle about 2 years ago
- Status changed from New to Feedback
The fix for #13254 may have addressed this already. That fix won't apply to older versions, however, you will need to try a current development snapshot to test there.
Updated by Simon Byrnand about 2 years ago
Hi Jim,
Thanks for the reply.
If I take a backup of my current config, is it possible to do an in-place upgrade to a development snapshot which includes the fix in #13254 ? If so I could try that in a few days, verify whether I still see the same issue or not, then reinstall back to 2.6.0 and restore my backup, as I don't want to run a development version on this box long term.
Updated by Danilo Zrenjanin about 2 years ago
I tested against the:
23.01-DEVELOPMENT (amd64) built on Thu Dec 01 06:04:55 UTC 2022 FreeBSD 14.0-CURRENT
Even though #13254 has been fixed in this release, I partially replicated this issue. My hardware always finishes configuring interfaces before it starts the DNS resolver. So, to replicate the issue you're facing I tried to keep the LAN interface disconnected till the system started the DNS resolver. If I plug the cable back immediately after the DNS resolver gets started during the boot, the LAN interface won't be listed in the unbound.conf under #Interface IP addresses to bind to. But if I leave the boot process to finish completely and then plug the cable into the LAN interface, the unbound will restart and update the config file accordingly.
It would be helpful if you could test the latest dev release on your hardware and share the results here.
Updated by Simon Byrnand about 2 years ago
Hi Danilo,
Yes, I'll try the latest development snapshot on the affected box sometime in the next few days and report back.
It's just a matter of finding a good time to take the internet in the house down for an extended period of time without eliciting too many complaints. :)
Updated by Simon Byrnand about 2 years ago
- File PFSense-2.7.0-error.png PFSense-2.7.0-error.png added
Well that was an ordeal updating to the development snapshot. :(
My first attempt at an in place upgrade from 2.6.0 failed ending up with a mostly unusable system with no networking interfaces and no normal boot menu. See attached screenshot. (Excuse the messy room in the reflection of the TV!)
Thinking it might be related to the Unifi controller software I'm also running on there (which causes additional FreeBSD packages to be installed) I formatted, reinstalled 2.6.0 without the Unifi software, restored a backup of my configuration, updated and got exactly the same problem shown in the screenshot.
The third time (lucky ?) I formatted, did the upgrade without restoring my configuration first which seemed to go through, I then restored my 2.6.0 configuration on 2.7.0 and that seems to have been successful as well. So something is really broken about the upgrade process when my full configuration is already in place.
In any case I'm running 2.7.0.a.20221202.0600 and unfortunately the problem is still there - with the resolver configured to use only LAN and Localhost it does not bind to LAN on boot - I see the same things in the system.log as before that unbound completes starting up just before the LAN interface comes up and nothing restarts it to regenerate the configuration and bind it to the newly up LAN interface.
If I physically unplug the LAN cable and plug it back in after boot unbound is restarted and it works fine.
In https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/31c37082cad1ca068fc22d93fe3dc3c6a8005144 it says "Do not restart unbound on linkup as other mechanisms already do that which only leads to it being restarted multiple times."
What are these other mechanisms ? Is there something that inhibits these other mechanisms while the system is still booting ?
Updated by Jordan G about 2 years ago
Simon Byrnand wrote in #note-5:
Thinking it might be related to the Unifi controller software I'm also running on there (which causes additional FreeBSD packages to be installed)
gozoinks?
my interfaces all start before the resolver on 23.01.a.20221202.0600 - if the system is rebooted and the only interface(s) selected for unbound are physically disconnected, unbound remains halted until one is enabled or connected, which seems expected.
Updated by Simon Byrnand about 2 years ago
Jordan Greene wrote in #note-6:
Simon Byrnand wrote in #note-5:
Thinking it might be related to the Unifi controller software I'm also running on there (which causes additional FreeBSD packages to be installed)
gozoinks?
Forget about that - I retested with it not installed, no difference.
my interfaces all start before the resolver on 23.01.a.20221202.0600 - if the system is rebooted and the only interface(s) selected for unbound are physically disconnected, unbound remains halted until one is enabled or connected, which seems expected.
Judging by your screenshot not being a photo, I'm assuming you're running PFSense in a virtual machine of some kind ? (Unless there is some built in way to screenshot the console I'm not aware of)
If so you're unlikely to be able to reproduce this problem. Also did you check the system.log for the order of unbound starting and interface going up rather than just looking at the boot screen ?
Updated by Jim Pingle about 2 years ago
- Status changed from Feedback to New
Simon Byrnand wrote in #note-5:
In https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/31c37082cad1ca068fc22d93fe3dc3c6a8005144 it says "Do not restart unbound on linkup as other mechanisms already do that which only leads to it being restarted multiple times."
What are these other mechanisms ? Is there something that inhibits these other mechanisms while the system is still booting ?
It gets started/restarted by rc.bootup, rc.newwanip, and/or rc.newwanipv6. There are a small number of cases where rc.newwanip would not run at boot, but they aren't likely to be your LAN (PPP type interfaces). It's possible the latter are not running the case you have when it's a static LAN, but it looks like it should if the interface was never up at boot at that point.
I tried several systems here in my lab, virtual and physical, and I cannot reproduce this anywhere. They all boot up as expected with unbound configured as described.
You probably have something in your local gear causing the delay which is contributing to the problem. For example, your switch may need to be set for portfast/client/access port, or some other means of informing STP that it's an endpoint/device and not a switch. The exact method varies by vendor/model.
That's not to say there isn't a problem with this particular edge case, just that it's rare and unlikely to be hit in the wild with a properly configured environment.
Updated by Simon Byrnand about 2 years ago
Jim Pingle wrote in #note-8:
It gets started/restarted by rc.bootup, rc.newwanip, and/or rc.newwanipv6. There are a small number of cases where rc.newwanip would not run at boot, but they aren't likely to be your LAN (PPP type interfaces). It's possible the latter are not running the case you have when it's a static LAN, but it looks like it should if the interface was never up at boot at that point.
There seems to be something missing in your description here - rc.bootup only runs during startup and is starting unbound before the Lan Interface comes up. So unbound is initially configured without the LAN IP address.
rc.newwanip runs when the WAN interface comes up, but this is happening before the LAN interface comes up, so if it restarted unbound it would still be configured without the LAN IP at this point. I don't have ipv6 enabled so rc.newwanipv6 can't help either.
So which script is restarting unbound if I physically unplug and reconnect the LAN cable after boot is finished ? And why does whatever that is not run when an interface comes up before the boot process has finished ?
I tried several systems here in my lab, virtual and physical, and I cannot reproduce this anywhere. They all boot up as expected with unbound configured as described.
Not uncommon with boot time race conditions... I've dealt with enough of them myself and they can be hard to reproduce. Which is why I'm here offering to help debug the problem on hardware that readily reproduces it.
You probably have something in your local gear causing the delay which is contributing to the problem. For example, your switch may need to be set for portfast/client/access port, or some other means of informing STP that it's an endpoint/device and not a switch. The exact method varies by vendor/model.
No, there's nothing unusual about my network. Virgin Media cable model on the WAN side (DHCP enabled) and a 24 port D-Link switch on the LAN side. The switch doesn't have any settings like portfast to tinker with.
That's not to say there isn't a problem with this particular edge case, just that it's rare and unlikely to be hit in the wild with a properly configured environment.
I'm not sure what you think is not properly configured about my environment ?
Knowing this bug is lurking means that I won't be trusting setting the binding to anything other than "All" even on other devices where I don't seem to be experiencing an issue.
Is there any further information I can provide or testing I can do ?
Updated by Jim Pingle about 2 years ago
Simon Byrnand wrote in #note-9:
Jim Pingle wrote in #note-8:
There seems to be something missing in your description here - rc.bootup only runs during startup and is starting unbound before the Lan Interface comes up. So unbound is initially configured without the LAN IP address.
No it's not -- the LAN is configured well before then.
A typical boot looks something like this (lots of lines trimmed)
[...] Configuring VLAN interfaces...done. Configuring WG_WAYNE interface...done. Configuring LAN1 interface...done. Configuring LAN2 interface...done. Configuring OPTX interface...done. Configuring LAN4 interface...done. Configuring WAN1 interface...done. Configuring WAN2 interface...done. Configuring IPsec VTI interfaces...done. Configuring CARP settings...done. Syncing OpenVPN settings...done. Configuring firewall......done. Starting PFLOG...done. Setting up gateway monitors...done. Setting up static routes...done. Setting up DNSs... Starting DNS Resolver...done. [...]
As you can see, LAN is configured well before the resolver is started initially. If your LAN is down at that time, perhaps it didn't take the config or would have been skipped, but what you are seeing is not typical.
rc.newwanip runs when the WAN interface comes up, but this is happening before the LAN interface comes up, so if it restarted unbound it would still be configured without the LAN IP at this point. I don't have ipv6 enabled so rc.newwanipv6 can't help either.
rc.newwanip gets triggered on any interface event on any interface, despite its name. So it would happen on LAN as well.
So which script is restarting unbound if I physically unplug and reconnect the LAN cable after boot is finished ? And why does whatever that is not run when an interface comes up before the boot process has finished ?
That would be from rc.newwanip most likely.
I'm not sure what you think is not properly configured about my environment ?
That's not a subject for Redmine -- it's something you should take to the forum to discuss more in-depth. It has to be specific to your environment, because it does not happen for anyone else.
Knowing this bug is lurking means that I won't be trusting setting the binding to anything other than "All" even on other devices where I don't seem to be experiencing an issue.
Is there any further information I can provide or testing I can do ?
Start a forum thread to discuss it more and isolate potential causes. Until we have a better idea of why it happens just to you and apparently nobody else, there isn't much we can do. Something in your setup is causing your LAN to not be configured when it should, and this is just a side effect of that. Fixing your real root cause is likely the solution here, but this isn't the place to diagnose things of that nature.
Updated by Simon Byrnand about 2 years ago
- File boot-sequence.png boot-sequence.png added
Jim Pingle wrote in #note-10:
Simon Byrnand wrote in #note-9:
Jim Pingle wrote in #note-8:
There seems to be something missing in your description here - rc.bootup only runs during startup and is starting unbound before the Lan Interface comes up. So unbound is initially configured without the LAN IP address.No it's not -- the LAN is configured well before then.
A typical boot looks something like this (lots of lines trimmed)
[...]As you can see, LAN is configured well before the resolver is started initially. If your LAN is down at that time, perhaps it didn't take the config or would have been skipped, but what you are seeing is not typical.
I've done some more testing on this tonight and have a better idea of what is happening and unfortunately your boot screen log is misleading, as I found out in tonight's testing.
In short, it seems to be configurating the LAN interface asynchronously, and when it says "done" on the boot screen it is actually NOT done and the interface is NOT up at the time it says done, so it is proceeding immediately with the rest of the boot process while the interface is still not back up.
In fact when it starts to configure the interface the link light on the adaptor (which was already up prior to this) goes down for about 2 seconds before coming back up again. So something in the configuration process is taking the ethernet link down completely at the link level and causing it to re-sync with the switch unnecessarily.
This is happening on both the WAN interface and the LAN interface, one after the other. The difference between the two is it actually waits for the WAN interface to come back up before it proceeds with the boot process, (possibly because it's configured for DHCP) whereas with the LAN interface it says "done" the moment the link goes down and then races ahead with the boot process completing unbound before the link comes back up, hence the problem.
Please see the attached screenshot showing the same boot ordering as what you posted - with LAN supposedly configured many steps before Unbound is started. But this is extremely misleading as I can visually see that the link light goes down when it starts configuring LAN and doesn't come back up until after unbound is configured, and this is confirmed by a system.log snippet showing unbound configuration complete before the kernel reports igb1 has come up: (screenshot and log are from the same startup session)
Dec 5 19:13:46 pfSense-Home kernel: lo0: link state changed to UP Dec 5 19:13:47 pfSense-Home sshd[11153]: Server listening on :: port 22. Dec 5 19:13:47 pfSense-Home sshd[11153]: Server listening on 0.0.0.0 port 22. Dec 5 19:13:47 pfSense-Home sshguard[11459]: Now monitoring attacks. Dec 5 19:13:50 pfSense-Home kernel: Dec 5 19:13:50 pfSense-Home kernel: igb0: link state changed to UP Dec 5 19:13:50 pfSense-Home check_reload_status[395]: Linkup starting igb0 Dec 5 19:13:51 pfSense-Home check_reload_status[395]: rc.newwanip starting igb0 Dec 5 19:13:51 pfSense-Home php-cgi[418]: rc.bootup: Resyncing OpenVPN instances. Dec 5 19:13:51 pfSense-Home kernel: done. Dec 5 19:13:51 pfSense-Home kernel: pflog0: promiscuous mode enabled Dec 5 19:13:51 pfSense-Home php-cgi[418]: rc.bootup: [squid] Installed but disabled. Not installing 'nat' rules. Dec 5 19:13:51 pfSense-Home kernel: . Dec 5 19:13:51 pfSense-Home php-cgi[418]: rc.bootup: [squid] Installed but disabled. Not installing 'pfearly' rules. Dec 5 19:13:51 pfSense-Home kernel: . Dec 5 19:13:52 pfSense-Home php-fpm[356]: /rc.newwanip: rc.newwanip: Info: starting on igb0. Dec 5 19:13:52 pfSense-Home php-fpm[356]: /rc.newwanip: rc.newwanip: on (IP address: x.x.x.x) (interface: WAN[wan]) (real interface: igb0). Dec 5 19:13:52 pfSense-Home kernel: .. Dec 5 19:13:52 pfSense-Home php-cgi[418]: rc.bootup: [squid] Installed but disabled. Not installing 'filter' rules. Dec 5 19:13:53 pfSense-Home rc.gateway_alarm[38422]: >>> Gateway alarm: WAN_DHCP (Addr:x.x.x.x Alarm:1 RTT:0ms RTTsd:0ms Loss:100%) Dec 5 19:13:53 pfSense-Home check_reload_status[395]: updating dyndns WAN_DHCP Dec 5 19:13:53 pfSense-Home check_reload_status[395]: Restarting IPsec tunnels Dec 5 19:13:53 pfSense-Home check_reload_status[395]: Restarting OpenVPN tunnels/interfaces Dec 5 19:13:53 pfSense-Home check_reload_status[395]: Reloading filter Dec 5 19:13:53 pfSense-Home kernel: .done. Dec 5 19:13:53 pfSense-Home kernel: done. Dec 5 19:13:54 pfSense-Home php-cgi[418]: rc.bootup: Gateway, NONE AVAILABLE Dec 5 19:13:54 pfSense-Home php-cgi[418]: rc.bootup: Gateway, NONE AVAILABLE Dec 5 19:13:54 pfSense-Home kernel: done. Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS: updatedns() starting Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS noip-free (-.ddns.net): _checkIP() starting. Dec 5 19:13:54 pfSense-Home php-fpm[357]: /rc.filter_configure_sync: [squid] Installed but disabled. Not installing 'nat' rules. Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS noip-free (-.ddns.net): x.x.x.x extracted from local system. Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS (-.ddns.net): running get_failover_interface for wan. found igb0 Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS noip-free (-.ddns.net): _detectChange() starting. Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS noip-free (-.ddns.net): _checkIP() starting. Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic DNS noip-free (-.ddns.net): x.x.x.x extracted from local system. Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: Dynamic Dns (-.ddns.net): Current WAN IP: x.x.x.x Cached IP: x.x.x.x Dec 5 19:13:54 pfSense-Home php-fpm[356]: /rc.dyndns.update: phpDynDNS (-.ddns.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. Dec 5 19:13:54 pfSense-Home php-fpm[357]: /rc.filter_configure_sync: [squid] Installed but disabled. Not installing 'pfearly' rules. Dec 5 19:13:55 pfSense-Home php-cgi[418]: rc.bootup: sync unbound done. Dec 5 19:13:55 pfSense-Home kernel: done. Dec 5 19:13:55 pfSense-Home kernel: done. Dec 5 19:13:55 pfSense-Home php-fpm[357]: /rc.filter_configure_sync: [squid] Installed but disabled. Not installing 'filter' rules. Dec 5 19:13:55 pfSense-Home check_reload_status[395]: Linkup starting igb1 Dec 5 19:13:55 pfSense-Home kernel: Dec 5 19:13:55 pfSense-Home kernel: igb1: link state changed to UP Dec 5 19:14:01 pfSense-Home php-cgi[418]: rc.bootup: NTPD is starting up. Dec 5 19:14:01 pfSense-Home kernel: done. Dec 5 19:14:03 pfSense-Home kernel: done. Dec 5 19:14:03 pfSense-Home check_reload_status[395]: Updating all dyndns Dec 5 19:14:03 pfSense-Home kernel: done. Dec 5 19:14:03 pfSense-Home php-cgi[418]: rc.bootup: [squid] Installed but disabled. Not installing 'nat' rules. Dec 5 19:14:03 pfSense-Home kernel: . Dec 5 19:14:03 pfSense-Home php-cgi[418]: rc.bootup: [squid] Installed but disabled. Not installing 'pfearly' rules. Dec 5 19:14:03 pfSense-Home kernel: . Dec 5 19:14:03 pfSense-Home php-cgi[418]: rc.bootup: [squid] Installed but disabled. Not installing 'filter' rules.
This poses two questions:
1) Why does configuring an IP address on the network interface or enabling the network interface actually take the already established link layer to the switch DOWN for 2 seconds ? Is it doing a full reset of the adaptor hardware or something ? (If so where is this done from?) Is it normal for this link down then back up to occur during the interface configuration during boot ?
2) Why does it claim the configuration of the interface is "done" and immediately race ahead with the rest of the boot process before the interface has a chance to come back up ? Surely it should wait for the interface to finish configuring ? The fact that you're showing me the boot screen ordering as "proof" that the LAN is configured before unbound loads when that's not actually happening on my system suggests that it should in fact be waiting.
rc.newwanip runs when the WAN interface comes up, but this is happening before the LAN interface comes up, so if it restarted unbound it would still be configured without the LAN IP at this point. I don't have ipv6 enabled so rc.newwanipv6 can't help either.
rc.newwanip gets triggered on any interface event on any interface, despite its name. So it would happen on LAN as well.
So which script is restarting unbound if I physically unplug and reconnect the LAN cable after boot is finished ? And why does whatever that is not run when an interface comes up before the boot process has finished ?
That would be from rc.newwanip most likely.
So there is more than one script that may be trying to restart unbound when a link goes up ? If I comment out:
if (platform_booting()) { return; }
from /etc/rc.linkup, I can confirm that unbound does get re-started after the LAN interface comes up 2 seconds later and gets configured properly, however I'm sure that code is there for a reason and there could be side effects from commenting it out...
I'm not sure what you think is not properly configured about my environment ?
That's not a subject for Redmine -- it's something you should take to the forum to discuss more in-depth. It has to be specific to your environment, because it does not happen for anyone else.
Nobody else has ever struck this problem ? I find that hard to believe. There's an obvious race condition in the start-up and network configuration code, if 99% of people don't hit it due to lucky timing then lucky them. That doesn't mean there isn't an underlying bug here. I'm not familiar enough with the code to delve more than superficially into it at the moment though.
Is there any further information I can provide or testing I can do ?
Start a forum thread to discuss it more and isolate potential causes. Until we have a better idea of why it happens just to you and apparently nobody else, there isn't much we can do. Something in your setup is causing your LAN to not be configured when it should, and this is just a side effect of that. Fixing your real root cause is likely the solution here, but this isn't the place to diagnose things of that nature.
There is already a forum thread - I linked it as the first post in this ticket...
I would say that the real root cause here is the scripting or kernel forcing the ethernet interface to drop the link completely when it's being configured, and then not waiting long enough for it to recover before proceeding with the rest of the boot process.
Unbound is only beating the interface coming back up by a fraction of a second - on hardware where the ethernet interface came back up half a second quicker the problem wouldn't exhibit. Maybe the precise delay depends somewhat on the model of switch as well. Or perhaps on some network adaptors the configuration process doesn't cause the ethernet link layer to go down at all.
Continuing to boot before a network interface finishes configuring and (re)establishing an ethernet link seems to be fraught with potential problems with side effects with unbound being only one potential pitfall, I would have thought there would be code in place to poll the link state of the interface say every half second after configuring it waiting up to a maximum of 5-10 seconds before assuming the link is not physically connected and moving on, or continuing as soon as it comes up. The effect on total boot time would be very small and it would certainly be a lot more robust in edge cases like this.
Updated by Jim Pingle about 2 years ago
Please continue the discussion on the forum, this isn't the place to diagnose your situation in that kind of detail -- but what you are seeing is not typical behavior and something in your specific setup is triggering that, it does not happen in any case I have here locally (igb or otherwise). The link doesn't drop, and the log says it's up ~15 seconds before unbound starts. Either it's something in your LAN/NIC configuration, the switch, or the hardware itself.
Updated by Simon Byrnand about 2 years ago
Thanks Jim, but if I'm just going to be shunted back to the forum with "it must be something wrong with your hardware, or switch, or configuration, (or anything except the software) and nobody else ever had this problem" when I've already volunteered many hours of my time looking into the problem so far and believe I've provided pretty compelling evidence that there is a race condition present in the start-up scripting which does not properly wait for interfaces to be configured, then I'm afraid I've done all I can.
Short of spending dozens of hours trying to familiarise myself with the PHP codebase in PFSense, (which I really don't have the spare time to do, even though I know PHP) I don't see what else can be done on the forum that couldn't have been done here with the evidence provided already.
I'm not asking for help with a "configuration problem", I'm trying to offer assistance to diagnose a bug in your software in the name of trying to help make PFSense - which I like very much - better and more robust, and I recognised that this type of problem might not be easy for others to reproduce if they didn't have the right hardware, hence taking the time to do the testing I've done so far.
However I don't seem to have been successful in convincing anyone that there is a bug to be looked at so with precious little free time available to me to push a barrow up hill I'm afraid I'm out - I'm just going to format and re-install 2.6.0 and simply set unbound to bind to "All" interfaces which is a perfectly acceptable workaround for my use case.
Thanks for your time, you can close the ticket now.
Updated by Jim Pingle about 2 years ago
This isn't a discussion platform, the forum is. Simple as that. To find the root cause, this needs more discussion, and that discussion can't happen here.
Updated by robotox sysadmin almost 2 years ago
Hi,
I have the same problem but with OpenVPN interfaces, as described here https://forum.netgate.com/topic/176155/unbound-not-responding-on-all-chosen-interfaces-after-reboot.
That happens to me in 2.6.0-RELEASE with custom hardware.
It does not happen in 22.05-RELEASE with SG-1100.
I believe it is the same as in https://forum.netgate.com/topic/161790/dns-resolver-unbound-fails-after-reboot-unless-manually-restarted.
Thank you for your time.
Updated by robotox sysadmin over 1 year ago
Hi,
I now have an SG-2100 with 23.05.1 for the same setup and still the same problem.
Unbound fails to start as I have OpenVPNs as Outgoing Network Interfaces.
Thank you for your time.
Updated by robotox sysadmin over 1 year ago
Now testing the SG-2100 with 23.05.1 for the similar setup but with multiple Wireguards instead of multiple OpenVPNs.
Unbound starts correctly.
I am guessing that Wireguard is faster than OpenVPN starting at boot.
Thanks again.
Updated by Anthony Gentile about 1 year ago
I am seeing this same issue on a typical setup with a Netgate 4100 (pfSense 23.09) and a Comcast Business modem with a block of static IP's. DNS requests will fail indefinitely after a reboot until Unbound is restarted or another service triggers it to restart.
Updated by Mike Moore about 1 year ago
I would like to help tshoot this issue but not here. Forums
As already stated this isnt the place for this and logging in to just say you have the same problem wont get any devs to review this ticket.