Project

General

Profile

Bug #4155

ntpd crashes on 32 bit with dynamic WAN reconnections and OpenVPN client configured

Added by Kill Bill almost 5 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Category:
NTPD
Target version:
Start date:
12/28/2014
Due date:
% Done:

0%

Estimated time:
Affected Version:
2.2.x
Affected Architecture:
i386

Description

Seeing this on tons of Alix boxes. (Frankly had to install Service Watchdog package to keep ntpd running, it crashes many times every day, sometimes it crashes many times in succession, then it keeps running for many hours, then crashes again randomly.)

Related forum thread: https://forum.pfsense.org/index.php?topic=84690.5 (also seen crashing on boot)

(Also see tons of that "setsockopt IPV6_MULTICAST_IF 0 failed" noise in NTP logs, pretty annoying.)

Associated revisions

Revision 0aa8caec (diff)
Added by Chris Buechler about 4 years ago

Kicking ntpd in rc.newwanip(v6) dates back to OpenNTPD which didn't have dynamic interface scanning support. This seems to be the source of crashes in ntpd as described in Ticket #4155, and appears to be unnecessary.

History

#1 Updated by Phillip Davis almost 5 years ago

My Alix boxes do it also. Here is an example:
$ clog /var/log/system.log | grep signal
Dec 24 13:16:36 skt-rt-01 kernel: pid 35322 (ntpd), uid 0: exited on signal 11
Dec 27 04:18:24 skt-rt-01 kernel: pid 60021 (ntpd), uid 0: exited on signal 11
Dec 27 04:19:01 skt-rt-01 kernel: pid 60888 (ntpd), uid 0: exited on signal 11
Dec 27 04:20:01 skt-rt-01 kernel: pid 79252 (ntpd), uid 0: exited on signal 11
Dec 27 04:25:21 skt-rt-01 kernel: pid 92364 (ntpd), uid 0: exited on signal 11
Dec 27 04:26:01 skt-rt-01 kernel: pid 93587 (ntpd), uid 0: exited on signal 11
Dec 27 17:35:15 skt-rt-01 kernel: pid 14530 (ntpd), uid 0: exited on signal 11
Dec 27 17:36:01 skt-rt-01 kernel: pid 15946 (ntpd), uid 0: exited on signal 11
Dec 27 17:37:01 skt-rt-01 kernel: pid 32966 (ntpd), uid 0: exited on signal 11
Dec 28 02:30:07 skt-rt-01 kernel: pid 48811 (ntpd), uid 0: exited on signal 11
Dec 28 02:31:01 skt-rt-01 kernel: pid 63587 (ntpd), uid 0: exited on signal 11
Dec 28 03:46:02 skt-rt-01 kernel: pid 2961 (ntpd), uid 0: exited on signal 11
Dec 28 03:46:35 skt-rt-01 kernel: pid 59757 (ntpd), uid 0: exited on signal 11
Dec 28 03:47:01 skt-rt-01 kernel: pid 61027 (ntpd), uid 0: exited on signal 11
Dec 28 03:48:02 skt-rt-01 kernel: pid 74877 (ntpd), uid 0: exited on signal 11
Dec 28 14:16:02 skt-rt-01 kernel: pid 74659 (ntpd), uid 0: exited on signal 11
Dec 28 14:16:21 skt-rt-01 kernel: pid 97905 (ntpd), uid 0: exited on signal 11
Dec 28 14:17:01 skt-rt-01 kernel: pid 1412 (ntpd), uid 0: exited on signal 11
Dec 28 14:18:02 skt-rt-01 kernel: pid 16969 (ntpd), uid 0: exited on signal 11

This has Service Watchdog monitoring ntpd - so it gets restarted when it dies.
This location has just single WAN and LAN.
Alix at other locations are like this also - some with similar number of entries, others with just 1 or 2 exits, some have no occurrences in system.log. They are all simple sites with a single WAN and LAN and 2 OpenVPN site-to-site clients going out to our 2 main offices.
The 2 APUs I have do not have any ntpd exited messages. That could be because the 64-bit ntpd does not have this issue, or it might be because the APU site has better internet (static public IP...) - WAN "events" do not happen often where the APUs are.

#2 Updated by Kill Bill almost 5 years ago

Phillip Davis wrote:

The 2 APUs I have do not have any ntpd exited messages. That could be because the 64-bit ntpd does not have this issue, or it might be because the APU site has better internet (static public IP...) - WAN "events" do not happen often where the APUs are.

I can definitely exclude instable WAN connection. Things are pretty stable here, also no relation to that in logs. Two boxes I am seeing this on have a static IP, the other has a static IP assigned over DHCP (weird ISP, cable modem).

#3 Updated by Bipin Chandra almost 5 years ago

i have a stable internet connection on the alix and i just noticed the same behaviour

#4 Updated by Phillip Davis almost 5 years ago

It happens after some WAN event that has potential IP change, and the OpenVPN clients are restarted. About 20 seconds later NTPD dies. Service Watchdog sometimes has to kick it into life a couple of times (at 1 minute intervals) and then ntpd seems to stay up for a while...
Happy to send system log output to someone if it will help.

#5 Updated by Kill Bill almost 5 years ago

Phillip Davis wrote:

It happens after some WAN event that has potential IP change, and the OpenVPN clients are restarted. About 20 seconds later NTPD dies. Service Watchdog sometimes has to kick it into life a couple of times (at 1 minute intervals) and then ntpd seems to stay up for a while...

I can see one instance where it's related to some IP "change" (the IPsec event itself makes no sense to me...)

Dec 28 08:33:01    php: servicewatchdog_cron.php: NTPD is starting up.
Dec 28 08:33:01    php: servicewatchdog_cron.php: Service Watchdog detected service ntpd stopped. Restarting ntpd (NTP clock sync)
Dec 28 08:32:02    kernel: pid 78218 (ntpd), uid 0: exited on signal 11
Dec 28 08:32:01    php: servicewatchdog_cron.php: NTPD is starting up.
Dec 28 08:32:01    php: servicewatchdog_cron.php: Service Watchdog detected service ntpd stopped. Restarting ntpd (NTP clock sync)
Dec 28 08:32:00    kernel: pid 76576 (ntpd), uid 0: exited on signal 11
Dec 28 08:31:56    php-fpm[49534]: /rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.

However, there are many many other crashes where there's nothing going on - but ntpd crashing. Frankly, the ISC's NTPd inability to bind to configured interfaces only is a disaster in itself, completely broken design. If someone packaged openntpd, I'd install it ASAP and never look back. No need for fancy stuff like GPS cards or similar here.

#6 Updated by Chris Buechler almost 5 years ago

  • Subject changed from ntpd crashes a lot - (ntpd), uid 0: exited on signal 11 to ntpd crashes on 32 bit with dynamic WAN reconnections - ntpd: exited on signal 11

openntpd is a buggy mess, we've been there, done that, and got rid of it for many reasons. Why people think it's better I don't know, their experience is obviously limited.

This looks to be specific to dynamic WANs and 32 bit versions judging by a search of our logging servers.

#7 Updated by Kill Bill almost 5 years ago

Chris Buechler wrote:

Why people think it's better I don't know

Probably because it does not try to bind to unwanted interfaces, so that it does not try to accommodate for subsequent "changes" to WAN etc., where's it not supposed to listen in the first place? Not really sure what to say here, I want reliable timekeeping, not features that 99,999999% of people never use, yet they cause critical security issues because they cannot be disabled.

#8 Updated by Jim Pingle almost 5 years ago

OpenNTPD had its own share of crashes and other bad behavior (See #2423 for one major example). The only thing it handles better is interface binding.

#9 Updated by Chris Buechler almost 5 years ago

doesn't seem to be every WAN reconnection, I can't replicate it at will. I only see one instance of it happening in months' worth of syslogs from quite a few systems (granted, most of those are 64 bit). The one instance I see was when my 4G card on one system lost connectivity and took a few minutes before it got an answer via PPP. It died on sig11 three times while PPP was reconnecting. I'm not in a position at the moment to force a failure on it to see what happens though.

Anyone have a specific set of steps to replicate?

#10 Updated by Chris Buechler almost 5 years ago

  • Subject changed from ntpd crashes on 32 bit with dynamic WAN reconnections - ntpd: exited on signal 11 to ntpd crashes on 32 bit with dynamic WAN reconnections and OpenVPN client configured
  • Status changed from New to Feedback
  • Target version changed from 2.2 to 2.2.1

This seems to be limited impact. 32 bit is minority, and it's an unusual circumstance. I can't seem to replicate it even as described here. Destroying the tun interface while ntpd is bound to it just results in something like:

Dec 31 01:43:49 alix ntpd[58816]: Deleting interface #13 ovpns2, fe80::20d:b9ff:fe1f:c55c%12#123, interface stats: received=0, sent=0, dropped=0, active_time=136 secs
Dec 31 01:43:49 alix ntpd[58816]: Deleting interface #14 ovpns2, 100.90.90.1#123, interface stats: received=0, sent=0, dropped=0, active_time=136 secs

and continued to work fine.

It recovered on its own on my system the one time it happened. Watchdog can work around where it doesn't. Fixing things along these lines this late in a release process is risky. Moving to target 2.2.1.

#11 Updated by Jim Thompson almost 5 years ago

  • Assignee set to Chris Buechler

#12 Updated by Chris Buechler almost 5 years ago

  • Target version changed from 2.2.1 to 2.2.2

#13 Updated by Charlie m almost 5 years ago

Why not just remove system_ntp_configure and ntp_sync_once.sh from rc.newwanip and rc.newwanipv6 (https://forum.pfsense.org/index.php?topic=88359.msg496016#msg496016)? ntpd has the much-discussed property of dynamic interface scanning by watching the routing socket for interface changes, so why not use it?

#14 Updated by Chris Buechler over 4 years ago

  • Target version changed from 2.2.2 to 2.2.3

#15 Updated by Chris Buechler over 4 years ago

  • Target version changed from 2.2.3 to 2.3

#16 Updated by Chris Buechler about 4 years ago

  • Affected Version changed from 2.2 to 2.2.x

I removed the NTP bits from rc.newwanip(v6) as they should be unnecessary and appear to be the source of the issues here. Leaving for feedback.

#17 Updated by Jim Thompson almost 4 years ago

Bump for feedback.

#18 Updated by Chris Buechler almost 4 years ago

  • Status changed from Feedback to Resolved

The change has been reported as working by a couple people who were seeing the issue.

Also available in: Atom PDF