Project

General

Profile

Bug #4642

OpenVPN process status stopped... but its running

Added by Alejandro Olivan over 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Category:
OpenVPN
Target version:
Start date:
04/20/2015
Due date:
% Done:

0%

Estimated time:
Affected Version:
All
Affected Architecture:

Description

Hi again...

This is something i suffer consistently on my nanobsd installs since... don't remember, maybe from the very begining.
I have found several entries on forums related to this, stating this is something that happens to some people (I'm not the only one!), and the solution points to a console restart of services in order to align GUI with currently services status...So, we got pid file content mess:

While reading the forums, I could confirm that pid files included pid numbers that were not the ones corresponding to the actually running service.
The service runs normally, and I dont know whether it restarted sometime and the pid file get outdated, or the pid file is somehow overwritten with a wrong pid number.... dont know.
The fact is that the GUI marks the service as down, and it cant be controlled anymore.

Affected services include specially OpenVpn.
I remember this also happening to Radius server.
I remeber this happened with other services but I cant reproduce or remember this at this time.

I have not seen this on non nanobsd installs (most I have) but just on nanobsd installs.

Hope this could someday be workarounded!
Best regards!

rc.carpbackup (5.21 KB) rc.carpbackup Cullen Trey, 11/01/2015 04:39 AM
rc.carpmaster (4.99 KB) rc.carpmaster Cullen Trey, 11/01/2015 04:39 AM

History

#1 Updated by Alejandro Olivan over 4 years ago

Hi... I got some time to play a little bit with a pair of pfsense boxes updated to 2.2.1, nanobsd installs, suffering this issue.

In both boxes,openvpn service was up and running, and in both web GUI marked it as down, and therefore, uncontrollable.
I SSH at both of them and -9 killed running processes.
After this I attempted to start the services... to no avail.

The issue solves but, when I entered the VPN setup screen on the GUI and clicked SAVE.
By clicking SAVE button and having the service down, the sistem regains apparent usability:
Clicking start do starts the service.
Clicking start/stop buttons act normally, and service do start/stop.

Now the question is why is necessary to click SAVE button while having the service down.
It appears to me that some how, nanobsd install with CF may hit some pid mess, from which GUI cant recover.

Hope it could help and this issue could be included in the reoadmap to be dealt with...

Best regards.

#2 Updated by Chris Buechler over 4 years ago

  • Subject changed from Nanobsd: process status stopped... but its running to OpenVPN process status stopped... but its running
  • Category set to OpenVPN

it's not a general service status problem. The issue described here is with OpenVPN. An issue there with FreeRADIUS, if you can replicate at some point in the future, is in the package itself and unrelated. If you can replicate that, start a new ticket with its specifics.

is this OpenVPN circumstance always replicable with your system/config?

#3 Updated by Alejandro Olivan over 4 years ago

OK... I will track those routers behaviour, and report on them.
I updated here talking about openvpn just because it is what I have found on them.
Probably, issues, provide they exist may effectively be unrelated, as you point.

I understand that you want, you need precisse data, and I'm shooting in the darkness.
So, let me observe those routers behaviour and i will report (I know issues will arise soon or latter) and I will report.

Regarding your question: I do not know how to reproduce/trigger the issue... it just happens with time. but it allways ends happing.

I will help reporting best I can...

Best regards

#4 Updated by Alejandro Olivan over 4 years ago

OK... ntpd NTP clock sync is stopped.

NTP clock sync stops after hours or just few days up... this is something well known here.
This occurs on these installs very often... I do not know wheter it has relation or not... it just happens.

This service but, KEEPS CONTROLLABLE on the GUI, it starts normally on start click
So I will look for a similar bug report or if absent, start a new one.

I thought but, that this issue could be somehow related with openvpn/freeradius issue... I report this here just for that reason

The rest of services keep working normally....

#5 Updated by Kill Bill over 4 years ago

Install the Service Watchdog package to keep your ntpd running.

#6 Updated by Phillip Davis over 4 years ago

This also happens to me on "random" systems. I have an example on an APU 64-bit nanoBSD 2.2.2 system now. This is the client end of a site-to-site OpenVPN link. I am accessing the APU across the link, so obviously it is working!

On the dashboard it says:
------------
Client Instance Statistics
Banke-Simpani UDP Unable to contact daemon
0 Service not running?
------------

The relevant running-working process is:
------------
root 97637 0.1 0.3 21732 5848 - Ss 20May15 330:10.06 /usr/local/sbin/openvpn --config /var/etc/openvpn/client1.conf
------------

But in /var/run/openvpn_client1.pid is:
71585

It should be 97637.

The system in question has been up ever since 2.2.2 upgrade, and the problem happened some time weeks ago, so I do not have any logs of the events around when it might have occurred. The site has dual-WAN that are reasonably reliable but do go up-and-down at times. The OpenVPN client is in a gateway group, so it will be moved between WANs and I guess the issue happens somewhere in the timing of restarting the OpenVPN process and writing the PID file...

A complicating factor is that Service Watchdog package has that OpenVPN client in its watch list, so there could be a timing issue there also if Service Watchdog happens to do its checks when a WAN failover is also in progress. I will get rid of Service Watchdog, reboot and then see if it happens again.

I had thought this issue only happened on slow and/or low-memory hardware like Alix. But now it see it on an APU.

@Alejandro - do you have Service Watchdog package installed and monitoring the OpenVPN instance?

#7 Updated by Alejandro Olivan over 4 years ago

Hi guys.

No... I do not have watchdog on any system... overall pfsense stabitlity as a router is superb.
BUt you're right... In the meanwhile I may install watchdog in order to have ntp up.

The issue but, in my opinion, is to deal with the pid - pidfile file mismatch cause.

Best regards

#8 Updated by Alejandro Olivan over 4 years ago

EDIT:

15 days passed by since I installed watchdog and set it to keep ntp up (ntp crashed all the time).
Since then All suspicious services like OpenVPN or FreeRadius are up and running on the panel.
This is the preiod of gratest stability of all times on nanobsd install...
I'm wondering whether all mess was in fact a derivative of ntp crashing....

will keep on reporting...

#9 Updated by Alejandro Olivan over 4 years ago

Updating:

Ok, 21 days passed, and there it goes... today I have discovered on the monitored nanobsd installs finally caught this issue.
It has 2 openVPN services enabled: a remote access server, and a site to site client: the remote access server is shown as stoped on the GUI but it is running normally.
At this point, this service is out of control.

Provide some info is needed feel free to ask!
Best regards!

#10 Updated by Chris Buechler about 4 years ago

  • Status changed from New to Confirmed
  • Priority changed from Normal to High
  • Affected Version set to All

Someone that can replicate this, could you provide OpenVPN logs from the time it lost track of the process? Also system log contents if anything is relevant or correlates.

#11 Updated by Alejandro Olivan about 4 years ago

Hi Chris... unfortunately I can't replicate, because I do not know how to trigger it... it simply occurs.
What I can do is to provide you OpenVPN logs as soon as I discover a failing GUI... but I guess that this may be probably unusable if a failing GUI is not detected early after failure, so logs on openvpn could be analyzed at time of failure.

I can but state one thing to consider:
Stability has been dramatically increased... so now it is dificult to get a failing GUI (right now I have none!)
I do not know whether it is related to GUI constant engine upgrading along with pfsense versions (I keep strictly all devices uptodate), the usage of watchdog on ntp, or a combination of both.

Anyhow... as soon as I got a failing GUI on a device I will report, count on that!

Thank you for your incredible work.
Best regards!

#12 Updated by Anonymous about 4 years ago

I have a similar, if not the same, problem running a 2.2.4 cluster in Carp Mode w/ Gateway Group. My OpenVPN instance shows the same behaviour where the GUI displays no instance running, but in fact there is one started.

To make things even more horrible often the OpenVPN client runs on both - master AND backup - instance which of course leads to massive problems with tunnel stability since both instances compete to connect to the same server (so lots of TLS out of sync messages on the server side).

The workaround is to kill OpenVPN on the master and the backup node and restart it on the master. This mitigates the problem for some unspecified time until something 'decides' to start OpenVPN on both nodes.

IMHO it is important to keep rouge OpenVPN processes from running. You might just kill them if the do not correspond to the PID file. Anyhow precautions must be taken by the controlling software that the same OpenVPN client is not running on both nodes. And obviously it is not reliable to only check for the PID files.

#13 Updated by Chris Buechler about 4 years ago

  • Target version set to 2.3

#14 Updated by Cullen Trey about 4 years ago

A similar but not identical case is described in the bug tracker of OpenVPN. From my understanding, even we don't use --management-hold, the behavour is exactly the same.

https://community.openvpn.net/openvpn/ticket/563

#15 Updated by Cullen Trey about 4 years ago

If it is the bug, it can even be fixed for 2.2.5 with an update of OpenVPN to 2.3.8.

#16 Updated by Chris Buechler about 4 years ago

2.2.5 already has OpenVPN 2.3.8. If someone who can replicate can try 2.2.5 and report back, that'd be appreciated. It's not directly related to the original report in that OpenVPN bug ticket but the fix could be relevant.

#17 Updated by Grzegorz Sliwa about 4 years ago

hi. I report bug as https://redmine.pfsense.org/issues/5344#change-21849 ,but it looks like this same.

Today i installed latest version of pfsense pfSense-Full-Update-2.2.5-DEVELOPMENT-amd64--latest.tgz

The problem is still exist :/

The true is ,that problem is wrong PID.

This is my quick diagnostic:


[2.2.5-DEVELOPMENT][root@FW-NODE1]/root: ps -aux |grep openvpn
root 49369 0.0 0.0 21732 5252 - Ss 12:06PM 0:00.01 /usr/local/sbin/openvpn --config /var/etc/openvpn/server3.conf
root 51089 0.0 0.0 21732 5496 - Ss 12:06PM 0:00.10 /usr/local/sbin/openvpn --config /var/etc/openvpn/server1.conf
root 53715 0.0 0.0 21732 5476 - Ss 12:06PM 0:00.05 /usr/local/sbin/openvpn --config /var/etc/openvpn/server4.conf
root 56039 0.0 0.0 18876 2384 0 S+ 12:08PM 0:00.00 grep openvpn
[2.2.5-DEVELOPMENT][root@FW-NODE1]/root: cd /var/run
[2.2.5-DEVELOPMENT][root@FW-NODE1]/var/run: cat open
openvpn_server1.pid openvpn_server3.pid openvpn_server4.pid
[2.2.5-DEVELOPMENT][root@FW-NODE1]/var/run: cat openvpn_server1.pid
51089
[2.2.5-DEVELOPMENT][root@FW-NODE1]/var/run: cat openvpn_server3.pid
49369
[2.2.5-DEVELOPMENT][root@FW-NODE1]/var/run: cat openvpn_server4.pid
58709

In this situation I don't see one of my PeerToPeer Server Instance as running. (but it is ,tunnel is established)

#18 Updated by Cullen Trey about 4 years ago

Hi all,

I can only confirm this issue with CARP IPs and OpenVPN. The PIDs for nanobsd or full install without CARP IPs are working for me. When I restart one of my boxes, i get the error from the beginning. There I have a lot of CARP IPs. Here is the system log:

Oct 28 11:51:48 php-fpm34005: /rc.carpmaster: Carp cluster member "192.168.255.3 - WAN LTE CARP IP (5@re0_vlan93)" has resumed the state "MASTER" for vhid 5@re0_vlan93
Oct 28 11:51:48 php-fpm34005: /rc.carpmaster: Starting OpenVPN client instance on 192.168.255.3 - WAN LTE CARP IP because of transition to CARP master.
Oct 28 11:51:48 php-fpm283: /rc.carpmaster: Carp cluster member "10.4.71.3 - WLAN Work CARP IP (6@re1_vlan71)" has resumed the state "MASTER" for vhid 6@re1_vlan71
Oct 28 11:51:48 php-fpm283: /rc.carpmaster: Starting OpenVPN client instance on 10.4.71.3 - WLAN Work CARP IP because of transition to CARP master.
Oct 28 11:51:48 php-fpm283: /rc.carpmaster: Starting OpenVPN client instance on 10.4.71.3 - WLAN Work CARP IP because of transition to CARP master.
Oct 28 11:51:48 php-fpm283: /rc.carpmaster: Starting OpenVPN client instance on 10.4.71.3 - WLAN Work CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm34005: /rc.carpmaster: Starting OpenVPN client instance on 192.168.255.3 - WAN LTE CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm50739: /rc.carpmaster: Carp cluster member "172.16.0.3 - WANSDSL CARP IP (2@re0_vlan91)" has resumed the state "MASTER" for vhid 2@re0_vlan91
Oct 28 11:51:49 php-fpm50739: /rc.carpmaster: Starting OpenVPN client instance on 172.16.0.3 - WANSDSL CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm67737: /rc.carpmaster: Carp cluster member "10.4.73.3 - WLAN Unsecure CARP IP (8@re1_vlan73)" has resumed the state "MASTER" for vhid 8@re1_vlan73
Oct 28 11:51:49 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 10.4.73.3 - WLAN Unsecure CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 10.4.73.3 - WLAN Unsecure CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm34005: /rc.carpmaster: Starting OpenVPN client instance on 192.168.255.3 - WAN LTE CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 10.4.73.3 - WLAN Unsecure CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm67737: /rc.carpmaster: Carp cluster member "172.16.1.3 - WAN DSL O2 1 CARP IP (3@re0_vlan92)" has resumed the state "MASTER" for vhid 3@re0_vlan92
Oct 28 11:51:49 php-fpm50739: /rc.carpmaster: Starting OpenVPN client instance on 172.16.0.3 - WANSDSL CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 172.16.1.3 - WAN DSL O2 1 CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm50739: /rc.carpmaster: Starting OpenVPN client instance on 172.16.0.3 - WANSDSL CARP IP because of transition to CARP master.
Oct 28 11:51:49 php-fpm283: /rc.carpmaster: Carp cluster member "10.4.70.3 - WLAN Control CARP IP (9@re1_vlan70)" has resumed the state "MASTER" for vhid 9@re1_vlan70
Oct 28 11:51:49 php-fpm283: /rc.carpmaster: Starting OpenVPN client instance on 10.4.70.3 - WLAN Control CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm34005: /rc.carpmaster: Carp cluster member "172.16.2.3 - WAN DSL O2 2 CARP IP (4@re0_vlan94)" has resumed the state "MASTER" for vhid 4@re0_vlan94
Oct 28 11:51:50 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 172.16.1.3 - WAN DSL O2 1 CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm34005: /rc.carpmaster: Starting OpenVPN client instance on 172.16.2.3 - WAN DSL O2 2 CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 172.16.1.3 - WAN DSL O2 1 CARP IP because of transition to CARP master.
Oct 28 11:51:50 kernel: done.
Oct 28 11:51:50 php-fpm34005: /rc.carpmaster: Starting OpenVPN client instance on 172.16.2.3 - WAN DSL O2 2 CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm283: /rc.carpmaster: Starting OpenVPN client instance on 10.4.70.3 - WLAN Control CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm50739: /rc.carpmaster: Carp cluster member "10.4.72.3 - WLAN Guest CARP IP (10@re1_vlan72)" has resumed the state "MASTER" for vhid 10@re1_vlan72
Oct 28 11:51:50 php-fpm50739: /rc.carpmaster: Starting OpenVPN client instance on 10.4.72.3 - WLAN Guest CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm34005: /rc.carpmaster: Starting OpenVPN client instance on 172.16.2.3 - WAN DSL O2 2 CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm283: /rc.carpmaster: Starting OpenVPN client instance on 10.4.70.3 - WLAN Control CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm67737: /rc.carpmaster: Carp cluster member "192.168.4.3 - LAN CARP IP (1@re1)" has resumed the state "MASTER" for vhid 1@re1
Oct 28 11:51:50 php-fpm50739: /rc.carpmaster: Starting OpenVPN client instance on 10.4.72.3 - WLAN Guest CARP IP because of transition to CARP master.
Oct 28 11:51:50 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 192.168.4.3 - LAN CARP IP because of transition to CARP master.
Oct 28 11:51:51 php-fpm50739: /rc.carpmaster: Starting OpenVPN client instance on 10.4.72.3 - WLAN Guest CARP IP because of transition to CARP master.
Oct 28 11:51:51 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 192.168.4.3 - LAN CARP IP because of transition to CARP master.
Oct 28 11:51:51 php-fpm67737: /rc.carpmaster: Starting OpenVPN client instance on 192.168.4.3 - LAN CARP IP because of transition to CARP master.
Oct 28 11:51:51 php-fpm67737: /rc.carpmaster: Starting OpenVPN instance on 192.168.4.3 - LAN CARP IP because of transition to CARP master.

As for every CARP IP the OpenVPN Client is restarted and I think this is done simultaneously, we have multiple OpenVPN Clients with the SAME CONFIG running/starting simultaneously. Now multiple instances try to write to the same PID file. HERE I think the problem is. The first ones writes the file, the next one can't write, but will be executed longer and gets first in connecting to the server (or opening the tun or tap interface and therefore wins the race) without writing its PID to the file.

Therefore I would be greate to make the function restart_openvpn and perhaps rc.carpmaster and rc.carpbackup not executeable simultaneously with flock (http://us3.php.net/flock). For restart_openvpn it would be possible to flock the config file. And execute restart_openvpn first after acquiring the lock or sleep until you get it.

#19 Updated by Cullen Trey about 4 years ago

Hi,

I could fix all my problems with the attached changes in rc.carpmaster and rc.carpbackup.

1. Added more detailed output of the openvpn instance that is restarted. I added the description. E.g. :
log_error("Stopping OpenVPN client instance \"{$settings['description']}\" on {$friendly_descr} because of transition to CARP backup.");

2. Added an flock in rc.carpmaster and rc.carpbackup so that carp events are executed after each other and not simultan.

3. Only restarted the OpenVPN client instances with a gateway group specifying vips, that really needed a restart, because the VIP got into master / backup.

Could you please add them to git and add them to 2.2.5 of pfsense?!

Kind regards

#20 Updated by Phillip Davis about 4 years ago

The CARP stuff is one source of this problem. But it also happens on systems without any CARP. I have it happen occasionally and none of my systems have CARP. I suspect there might be some issue related to multi-WAN and possibly when there is a tier1-WAN failure then it comes back up quickly, before the failover to the tier2 WAN was really finished being processed or...
It seems to happen to me less these days. But then our internet providers are getting noticeably better at providing continuous service - a good thing.

#21 Updated by Cullen Trey about 4 years ago

Perhaps we can apply the same flock in the file for switching the tier1-Wan tier2-Wan connection. Then, only one instance of the script can run. I think it has to do with apinger. Somebody knows, which file is called / executed?

#22 Updated by Alejandro Olivan about 4 years ago

Hi again guys...

I can confirm that fact... I didn't put attention on that...
I use multiwan... and I'm also using tiered failover.

...this could be a clue...

Regards

#23 Updated by Jim Thompson about 4 years ago

  • Priority changed from High to Normal

#24 Updated by Cullen Trey about 4 years ago

Hi,

the rc.carpbackup and rc.carpmaster files, accidentally, fixed another bug.

If you receive quite a lot of duplicate mails explaining that a gateway is down/loss/etc... This is because the function return_gateway_groups_array(); is called multiple times in the upstream version. The function is called multiple times, because it is in a loop...

In the uploaded version, the function is called only once.

$a_groups = return_gateway_groups_array();

Kind regards

#25 Updated by Cullen Trey about 4 years ago

Furthermore, with the uploaded files and pfsense 2.2.5 I'm not expiriencing the bug any more. Not on nanobsd and not on a full install. I have running the two files in production for around 40 days without problems.

#26 Updated by Alejandro Olivan almost 4 years ago

HI again...

I would agree with Cullen: I got also no single problem in this aspect since I upgraded everything to 2.2.5
I will keep an eye on this just in case, but I got nanobsd systems running 2.2.5 with almost a month already running perfect!

#27 Updated by Jim Thompson almost 4 years ago

  • Assignee set to Chris Buechler

Bump. assigned to CMB for eval and possible closure.

#28 Updated by Chris Buechler almost 4 years ago

  • Status changed from Confirmed to Resolved

OpenVPN 2.3.9 seems to have fixed this, "write pid file immediately after daemonizing" from their change log. The CARP master/backup still has issues in 2.3, that's covered elsewhere though.

Also available in: Atom PDF