Project

General

Profile

Bug #6186

race conditions in service startup

Added by Chris Buechler over 3 years ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
Start date:
04/17/2016
Due date:
% Done:

0%

Estimated time:
Affected Version:
All
Affected Architecture:

Description

There have always been a variety of possibilities for race conditions in service startup because of the nature of how multiple different things can call the functions that do the startup. That's a larger architectural issue which we're discussing options for properly addressing in the future.

The more immediate issue is after removing the "exit if booting" check from rc.newwanip(v6) in 2.3, which fixed a variety of edge case bugs with interfaces that are slow to come online during boot, some systems end up running certain things twice at almost exactly the same time. For instance, #6160, and probably #6132.

Adding locks in vpn_ipsec_configure was fine for strongswan in #6160. Might be fine in other areas, though adding locking like that can be risky in potentially breaking things that are fine now, if some of those functions end up recursing.

Associated revisions

Revision c4b5c8be (diff)
Added by Chris Buechler about 3 years ago

Setup gateway monitors and exit in rc.newwanip(v6) if system is booting. Ticket #6186

Revision d239edd1 (diff)
Added by Chris Buechler about 3 years ago

Setup gateway monitors and exit in rc.newwanip(v6) if system is booting. Ticket #6186

Revision 6d4fd80b (diff)
Added by Chris Buechler about 3 years ago

Don't start unbound in track6 config if system is booting. Add dnsmasq here as well. Based on PR 2943. Ticket #6186

Revision b460c43b (diff)
Added by Chris Buechler about 3 years ago

Don't start unbound in track6 config if system is booting. Add dnsmasq here as well. Based on PR 2943. Ticket #6186

History

#1 Updated by Chris Buechler over 3 years ago

  • Description updated (diff)

#2 Updated by sebastian nielsen about 3 years ago

Same applies for services that start up in the wrong order. So if a VPN client interface is slow to start up, and a unbound DNS forwarder has the VPN client interface as outgoing, unbound will sometimes start before the VPN client interface has came up, causing the unbound server to permanently return "SERVFAIL" as it reports a configuration error since said interface didn't exist at start.

Same applies to routing.

I think the whole architecture needs recoding such as it will first always bring up the interfaces including starting services related to interfaces such as VPN clients/servers, (blocking operation), then start any services not related to interfaces (also a blocking operation), and then apply any firewall rules, custom routes, default gateway and NAT.

#3 Updated by Jim Pingle about 3 years ago

VPN and DNS is not that clear a solution. You have a chicken-and-egg scenario there. In plenty of cases you need working DNS before the VPN can be brought up, especially if you are using a hostname for the VPN peer. In that case you'd have to start DNS, then the VPN, then restart DNS if it doesn't (re)attach to the VPN interface.

#4 Updated by sebastian nielsen about 3 years ago

yes that would be a good idea. Forced restart on unbound after VPN success. However, it can be a good idea to then delay the restart of unbound a few seconds after VPN success to ensure the interface has "settled" before attempting to attach DNS server to it.

#5 Updated by Chris Buechler about 3 years ago

  • Target version changed from 2.3.1 to 2.3.2

What I committed takes things back to 2.2.x and earlier behavior, plus retaining the fix for #5952. That's confirmed to fix/avoid the "pf wedged" issues, things like unbound and dhcpd starting twice at almost exactly the same time (though those don't hurt anything, ugly log spam), among other things.

Virtually all the race conditions people have encountered are from that change in rc.newwanip during boot. So we're at 2.2.6 and a bit better now for 2.3.1.

But there's a larger architectural issue to be addressed. This is a hack to avoid these kinds of issues.

#6 Updated by Chris Buechler about 3 years ago

  • Target version changed from 2.3.2 to 2.4.0

#7 Updated by Renato Botelho over 2 years ago

  • Assignee deleted (Marc Dye)

#8 Updated by Jim Thompson over 2 years ago

  • Assignee set to Renato Botelho

#9 Updated by John Cairns over 2 years ago

I've run into this issue as well on my pfSense machines that have ovpn client interfaces set as the outgoing interfaces for unbound. Although in my case, I don't see unbound fail to start, but rather unbound.conf reverts to its default of using all interfaces as outgoing interfaces. I don't know if this is helpful information or not, but I started a thread on the forums in which I include more detailed information:
[[https://forum.pfsense.org/index.php?topic=126925.0]]

Also, I understand the point about the DNS/VPN chicken and egg scenario, although I just use raw IPs for my VPN client connections. I feel that's a reasonable expectation, especially since most people using VPN client connections likely want all DNS traffic flowing through them as well. That said, I am by no means trying to evangelize here, just throwing in my two cents :)

#10 Updated by Renato Botelho almost 2 years ago

  • Target version changed from 2.4.0 to 2.4.1

#11 Updated by Jim Pingle almost 2 years ago

  • Target version changed from 2.4.1 to 2.4.2

#12 Updated by Jim Pingle over 1 year ago

  • Target version changed from 2.4.2 to 2.4.3

#13 Updated by Abuzer Rafey over 1 year ago

The more immediate issue is after removing the "exit if booting" check from rc.newwanip(v6) in 2.3, which fixed a variety of edge case bugs with interfaces that are slow to come online during boot, some systems end up running certain things twice at almost exactly the same time. For instance, #6160, and probably #6132.

Would this fix #5999?

#14 Updated by Jim Pingle over 1 year ago

  • Target version changed from 2.4.3 to 2.4.4

#15 Updated by Steve Beaver 12 months ago

  • Status changed from Confirmed to New

#16 Updated by Steve Beaver 12 months ago

  • Status changed from New to This Sprint

#17 Updated by Steve Beaver 12 months ago

  • Status changed from This Sprint to New

#18 Updated by Steve Beaver 11 months ago

  • Target version changed from 2.4.4 to 48

#19 Updated by Jim Pingle 4 months ago

  • Target version changed from 48 to 2.5.0

Also available in: Atom PDF