Bug #6186
open
race conditions in service startup
Added by Chris Buechler over 8 years ago.
Updated over 2 years ago.
Description
There have always been a variety of possibilities for race conditions in service startup because of the nature of how multiple different things can call the functions that do the startup. That's a larger architectural issue which we're discussing options for properly addressing in the future.
The more immediate issue is after removing the "exit if booting" check from rc.newwanip(v6) in 2.3, which fixed a variety of edge case bugs with interfaces that are slow to come online during boot, some systems end up running certain things twice at almost exactly the same time. For instance, #6160, and probably #6132.
Adding locks in vpn_ipsec_configure was fine for strongswan in #6160. Might be fine in other areas, though adding locking like that can be risky in potentially breaking things that are fine now, if some of those functions end up recursing.
- Description updated (diff)
Same applies for services that start up in the wrong order. So if a VPN client interface is slow to start up, and a unbound DNS forwarder has the VPN client interface as outgoing, unbound will sometimes start before the VPN client interface has came up, causing the unbound server to permanently return "SERVFAIL" as it reports a configuration error since said interface didn't exist at start.
Same applies to routing.
I think the whole architecture needs recoding such as it will first always bring up the interfaces including starting services related to interfaces such as VPN clients/servers, (blocking operation), then start any services not related to interfaces (also a blocking operation), and then apply any firewall rules, custom routes, default gateway and NAT.
VPN and DNS is not that clear a solution. You have a chicken-and-egg scenario there. In plenty of cases you need working DNS before the VPN can be brought up, especially if you are using a hostname for the VPN peer. In that case you'd have to start DNS, then the VPN, then restart DNS if it doesn't (re)attach to the VPN interface.
yes that would be a good idea. Forced restart on unbound after VPN success. However, it can be a good idea to then delay the restart of unbound a few seconds after VPN success to ensure the interface has "settled" before attempting to attach DNS server to it.
- Target version changed from 2.3.1 to 2.3.2
What I committed takes things back to 2.2.x and earlier behavior, plus retaining the fix for #5952. That's confirmed to fix/avoid the "pf wedged" issues, things like unbound and dhcpd starting twice at almost exactly the same time (though those don't hurt anything, ugly log spam), among other things.
Virtually all the race conditions people have encountered are from that change in rc.newwanip during boot. So we're at 2.2.6 and a bit better now for 2.3.1.
But there's a larger architectural issue to be addressed. This is a hack to avoid these kinds of issues.
- Target version changed from 2.3.2 to 2.4.0
- Assignee deleted (
Marc Dye)
- Assignee set to Renato Botelho
I've run into this issue as well on my pfSense machines that have ovpn client interfaces set as the outgoing interfaces for unbound. Although in my case, I don't see unbound fail to start, but rather unbound.conf reverts to its default of using all interfaces as outgoing interfaces. I don't know if this is helpful information or not, but I started a thread on the forums in which I include more detailed information:
[[https://forum.pfsense.org/index.php?topic=126925.0]]
Also, I understand the point about the DNS/VPN chicken and egg scenario, although I just use raw IPs for my VPN client connections. I feel that's a reasonable expectation, especially since most people using VPN client connections likely want all DNS traffic flowing through them as well. That said, I am by no means trying to evangelize here, just throwing in my two cents :)
- Target version changed from 2.4.0 to 2.4.1
- Target version changed from 2.4.1 to 2.4.2
- Target version changed from 2.4.2 to 2.4.3
The more immediate issue is after removing the "exit if booting" check from rc.newwanip(v6) in 2.3, which fixed a variety of edge case bugs with interfaces that are slow to come online during boot, some systems end up running certain things twice at almost exactly the same time. For instance, #6160, and probably #6132.
Would this fix #5999?
- Target version changed from 2.4.3 to 2.4.4
- Status changed from Confirmed to New
- Status changed from New to 13
- Status changed from 13 to New
- Target version changed from 2.4.4 to 48
- Target version changed from 48 to 2.5.0
- Target version changed from 2.5.0 to Future
- Assignee deleted (
Renato Botelho)
Also available in: Atom
PDF