Project

General

Profile

Bug #6132

race condition in OpenVPN startup

Added by NOYB NOYB about 1 year ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Category:
OpenVPN
Target version:
Start date:
04/13/2016
Due date:
% Done:

100%

Affected version:
All
Affected Architecture:

Description

On a fresh install the first OpenVPN server instance won't start. Create a 2nd instance an it runs fine. Or import just the OpenVPN section from a backup works fine too.

Noticed that the first instance that does not work has a few of the fields in confg as for ex: <ipaddr></ipaddr> rather than <ipaddr/> like the 2nd instance and restored instances that work.

Several of the empty fields are different like this.

2016-04-13 16_24_30- Status_ Dashboard.png (7.41 KB) Juha Kirsi, 04/13/2016 09:23 AM

2016-04-13 16_24_43- Status_ Services.png (3.3 KB) Juha Kirsi, 04/13/2016 09:23 AM

2016-04-13 16_24_55-Status_ OpenVPN.png (12.1 KB) Juha Kirsi, 04/13/2016 09:23 AM

Widget doesn_t show OpenVpn connection.png - Widget doesn't show OpenVpn connection (26.8 KB) Igor Kanesev, 05/15/2016 11:48 AM

openvpn_pid.png - openvpn_pid (8.89 KB) Igor Kanesev, 05/15/2016 11:51 AM

History

#1 Updated by Jim Pingle about 1 year ago

  • Status changed from New to Feedback

I can't replicate this as stated. Is there a forum thread with more info? Or do you have some more detail about the config of that server?

Have you checked with ps to see if it's actually running but not reflecting that in the GUI? (logs would also indicate the port is already in use, for example). If that is the case that would fall under #4642 instead.

#2 Updated by Juha Kirsi about 1 year ago

NOYB NOYB wrote:

On a fresh install the first OpenVPN server instance won't start. Create a 2nd instance an it runs fine. Or import just the OpenVPN section from a backup works fine too.

Noticed that the first instance that does not work has a few of the fields in confg as for ex: <ipaddr></ipaddr> rather than <ipaddr/> like the 2nd instance and restored instances that work.

Several of the empty fields are different like this.

I have the same issue.
Networks behind openvpn client 1 are reachable but gui dashboard, service status show that it's not working. I did update from 2.2.4 -> 2.3
After disabling openvpn client 1 and creating a second client configuration with identical settings the GUI dashboard show everything as normal...

Update, during 15 min the 2nd openvpn client has been unstable...

Apr 13 17:37:33 openvpn 42510 Initialization Sequence Completed
Apr 13 17:37:32 openvpn 42510 Peer Connection Initiated with [AF_INET]x.x.x.x:1194
Apr 13 17:37:31 openvpn 42510 UDPv4 link remote: [AF_INET]x.x.x.x:1194
Apr 13 17:37:31 openvpn 42510 UDPv4 link local (bound): [AF_INET]y.y.y.y
Apr 13 17:37:31 openvpn 42510 Preserving previous TUN/TAP instance: ovpnc2
Apr 13 17:37:31 openvpn 42510 Re-using pre-shared static key
Apr 13 17:37:31 openvpn 42510 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
Apr 13 17:37:29 openvpn 42510 SIGUSR1[soft,ping-restart] received, process restarting
Apr 13 17:37:29 openvpn 42510 Inactivity timeout (--ping-restart), restarting
Apr 13 17:36:31 openvpn 61410 Initialization Sequence Completed
Apr 13 17:36:30 openvpn 61410 Peer Connection Initiated with [AF_INET]x.x.x.x:1194
Apr 13 17:36:30 openvpn 61410 UDPv4 link remote: [AF_INET]x.x.x.x:1194
Apr 13 17:36:30 openvpn 61410 UDPv4 link local (bound): [AF_INET]y.y.y.y
Apr 13 17:36:30 openvpn 61410 Preserving previous TUN/TAP instance: ovpnc1
Apr 13 17:36:30 openvpn 61410 Socket Buffers: R=[42080->42080] S=[57344->57344]
Apr 13 17:36:30 openvpn 61410 Re-using pre-shared static key
Apr 13 17:36:30 openvpn 61410 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
Apr 13 17:36:28 openvpn 61410 Restart pause, 2 second(s)
Apr 13 17:36:28 openvpn 61410 SIGUSR1[soft,ping-restart] received, process restarting
Apr 13 17:36:28 openvpn 61410 Inactivity timeout (--ping-restart), restarting
Apr 13 17:35:30 openvpn 42510 Initialization Sequence Completed
Apr 13 17:35:30 openvpn 42510 Peer Connection Initiated with [AF_INET]x.x.x.x:1194
Apr 13 17:35:28 openvpn 42510 UDPv4 link remote: [AF_INET]x.x.x.x:1194
Apr 13 17:35:28 openvpn 42510 UDPv4 link local (bound): [AF_INET]y.y.y.y
Apr 13 17:35:28 openvpn 42510 Preserving previous TUN/TAP instance: ovpnc2
Apr 13 17:35:28 openvpn 42510 Re-using pre-shared static key
Apr 13 17:35:28 openvpn 42510 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
Apr 13 17:35:26 openvpn 42510 SIGUSR1[soft,ping-restart] received, process restarting
Apr 13 17:35:26 openvpn 42510 Inactivity timeout (--ping-restart), restarting
Apr 13 17:34:31 openvpn 61410 Initialization Sequence Completed
Apr 13 17:34:30 openvpn 61410 Peer Connection Initiated with [AF_INET]x.x.x.x:1194
Apr 13 17:34:30 openvpn 61410 UDPv4 link remote: [AF_INET]x.x.x.x:1194
Apr 13 17:34:30 openvpn 61410 UDPv4 link local (bound): [AF_INET]y.y.y.y
Apr 13 17:34:30 openvpn 61410 Preserving previous TUN/TAP instance: ovpnc1
Apr 13 17:34:30 openvpn 61410 Socket Buffers: R=[42080->42080] S=[57344->57344]
Apr 13 17:34:30 openvpn 61410 Re-using pre-shared static key
Apr 13 17:34:30 openvpn 61410 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
Apr 13 17:34:28 openvpn 61410 Restart pause, 2 second(s)
Apr 13 17:34:28 openvpn 61410 SIGUSR1[soft,ping-restart] received, process restarting
Apr 13 17:34:28 openvpn 61410 Inactivity timeout (--ping-restart), restarting
Apr 13 17:34:15 openvpn 42510 Authenticate/Decrypt packet error: bad packet ID (may be a replay): [ #35237 / time = (1460551950) Wed Apr 13 15:52:30 2016 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Apr 13 17:34:15 openvpn 42510 Authenticate/Decrypt packet error: bad packet ID (may be a replay): [ #35236 / time = (1460551950) Wed Apr 13 15:52:30 2016 ] -- see the man page entry for --no-replay and --replay-window for more info or silence this warning with --mute-replay-warnings
Apr 13 17:33:38 openvpn 42510 Initialization Sequence Completed
Apr 13 17:33:38 openvpn 42510 Peer Connection Initiated with [AF_INET]x.x.x.x:1194
Apr 13 17:33:30 openvpn 42510 UDPv4 link remote: [AF_INET]x.x.x.x:1194
Apr 13 17:33:30 openvpn 42510 UDPv4 link local (bound): [AF_INET]y.y.y.y
Apr 13 17:33:30 openvpn 42510 Preserving previous TUN/TAP instance: ovpnc2
Apr 13 17:33:30 openvpn 42510 Re-using pre-shared static key
Apr 13 17:33:30 openvpn 42510 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
Apr 13 17:33:28 openvpn 42510 SIGUSR1[soft,ping-restart] received, process restarting

UPDATE2:
After I removed the client1 configuration and rebooted the fw the GUI is once again broken for the 2nd client confguration. Gui shows as OpenVPN not working / running . The service is running based on ps and networks routable for the moment.

ps aux| grep open

root 40757 0.0 0.1 12620 4332 - Ss 6:22PM 0:00.28 /usr/local/sbin/openvpn --config /var/etc/openvpn/client2.conf

UPDATE3:
[2.3-RELEASE][]/root: cat /var/run/openvpn_client2.pid
77332
[2.3-RELEASE][]/root:
[2.3-RELEASE][]/root: kill -9 40757
[2.3-RELEASE][]/root: rm /var/run/openvpn_client2.pid
[2.3-RELEASE][]/root: ps aux| grep open
root 52810 0.0 0.1 10260 1900 0 S+ 6:31PM 0:00.00 grep open
[2.3-RELEASE][]/root: ps aux | grep open
root 53236 0.0 0.1 12620 4588 - Ss 6:31PM 0:00.01 /usr/local/sbin/openvpn --config /var/etc/openvpn/client2.conf
root 68127 0.0 0.1 10260 1900 0 S+ 6:32PM 0:00.00 grep open
[2.3-RELEASE][]/root: cat /var/run/openvpn_client2.pid
53236
[2.3-RELEASE][]/root:

After this the gui is back. Please, howto fix this?

"If that is the case that would fall under #4642 instead."
That shows as fixed?

What about the reconnection issues as seen in my logs..

#3 Updated by NOYB NOYB about 1 year ago

Haven't checked with ps to see if it is actually running. I took the WebGUI's word for it because I didn't know there had been some issue with that.

Took a look at that other ticket and this just seem a little different to me. But anyway as Juha mentioned that ticket is marked as resolved. OpenVPN Log does report address already in use. Wouldn't it do that for all OpenVPN instances? Not just the first one?

Looking in the config some of the empty parameters such as <ipaddr> are represented differently between working and non working instances:
Working instances represent the empty parameters such as <ipaddr/>
Non-working is represented such as <ipaddr></ipaddr>

Restored the OpenVPN section from my VM dev machine and it uses the <ipaddr/> representation for both instances and works fine.

#4 Updated by Jim Pingle about 1 year ago

An empty tag with open and close is the same as a single tag that closes itself. I've tried adding OpenVPN servers and shuffling things around and between a dozen test systems I can't reproduce this. Do you happen to have a whole config that reproduces it (even when restored) that you can share?

If the process is running but the GUI can't "see" it, that means somehow the pid file and management socket are not quite right. We've seen that before (like in the other ticket I linked) but the cause may not be quite the same here. It's a significant difference than it not starting at all, however. Completely different potential issues each way.

#5 Updated by Chris Buechler about 1 year ago

  • Subject changed from OpenVPN 1st Instance Won't Start to race condition in OpenVPN startup
  • Category set to OpenVPN
  • Assignee set to Chris Buechler
  • Affected version changed from 2.3 to All

I think the issue is subject, similar to #6160 for strongswan. If you launch OpenVPN twice really close together you get into a PID mess, which breaks the status page. That's been going on for a while in some difficult to replicate edge cases.

NOYB and Juha, can you try this change and report back please:

--- a/src/etc/inc/openvpn.inc
+++ b/src/etc/inc/openvpn.inc
@@ -1298,8 +1298,10 @@ function openvpn_delete_csc(& $settings) {

 // Resync the configuration and restart the VPN
 function openvpn_resync($mode, $settings) {
+       $openvpn_resync_lock = lock('openvpnresync', LOCK_EX);
        openvpn_reconfigure($mode, $settings);
        openvpn_restart($mode, $settings);
+       unlock($openvpn_resync_lock);
 }

#6 Updated by Jim Pingle about 1 year ago

That should probably append mode+vpnid (which I think is in $settings at that point) to the lock name since two different openvpn instances wouldn't need to be locked from each other.

#7 Updated by Chris Buechler about 1 year ago

Jim Pingle wrote:

That should probably append mode+vpnid (which I think is in $settings at that point) to the lock name since two different openvpn instances wouldn't need to be locked from each other.

Yeah that's why I didn't just commit it. :) Quick and simpler proof of concept as to whether that's definitely the root issue.

#8 Updated by NOYB NOYB about 1 year ago

The system that was having the problem was fixed by restoring the openvpn section from vm dev machine that had the same settings. The only difference I noticed between them was that the first instance in the dev machine used the xml short form for empty parameters (<xxxx/>).

Anyway I currently don't have a machine exhibiting the problem to test with at the moment. I may try to reproduce it again on that machine in the coming days/weeks. But since it's currently working its not on my short list at the moment.

#9 Updated by NOYB NOYB about 1 year ago

Out of the blue this just occurred again.

openvpn_server1.pid: 53960
openvpn_server2.pid: 57333

ps:
50721 - Ss 0:00.01 /usr/local/sbin/openvpn --config /var/etc/openvpn/serve
57333 - Ss 0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/serve

Applied the two line patch above. Hit the start openvpn button. Still failed to start. Don't know if that requires a reboot, but if it does it's going to be really difficult because at this point it would appear this is very infrequent. Maybe 1/10 or less.

Removed the patch and rebooted and both openvpn instances started fine.

#10 Updated by Chris Buechler about 1 year ago

When you get into this scenario, the GUI will fail to start OpenVPN because it's already running, under a PID that doesn't match the one in its PID file. It probably would have been fine if you manually killed openvpn processes, then started in the GUI.

I'm not sure whether subject is the issue here, it's a working theory at the moment. That patch must be applied before the problem occurs, not after or once it's happened. Reboot after applying to get a clean start with it in place.

OpenVPN recently fixed an issue with not updating its PID in some circumstances, but we've had the version with that fix for months. It's possible the root cause here is another OpenVPN PID bug.

NOYB: if you can go back through your OpenVPN logs, and see what you have for all 4 of the PIDs in question there, that might be telling.

#11 Updated by NOYB NOYB about 1 year ago

System has ramdisk enabled, so those logs are gone (rebooted).

re: the patch; Was afraid that might be the case that it needs to be there prior. Since this is not reproducible on demand it becomes like trying to prove a negative. Rebooted about a dozen times today, testing ramdisk alias tables restore, and just that one time had this issue pop up.

#12 Updated by Igor Kanesev 11 months ago

After i upgraded 2.3 -> 2.3.1, i have same issue.
After Pfsense boot, Widget doesn't show OpenVpn connection, but tunnel is working

I kill OpenVpn process and start again through web interface, after this steps widget work

#13 Updated by Igor Kanesev 11 months ago

Sorry for my english

#14 Updated by Chris Buechler 11 months ago

  • Target version changed from 2.3.1 to 2.3.2

if someone can get me access to a box that exhibits this issue, that would help greatly (email cmb at pfsense dot org and we can arrange something). This isn't easy to replicate.

#15 Updated by NOYB NOYB 11 months ago

Haven't seen this yet on 2.3.1 but because it is so inconsistent/sporadic, it is too soon to tell really. But so far so good.

If I see it again, I'll post an update.

#16 Updated by Chris Buechler 10 months ago

  • Status changed from Feedback to Confirmed
  • Assignee deleted (Chris Buechler)
  • Target version changed from 2.3.2 to 2.4.0

The root issue can still be a problem. Probably ought to put a lock around it, but long-term needs a better service management framework to properly handle things.

#17 Updated by Pi Ba 5 months ago

Just found this one issue, looks i made a duplicate https://redmine.pfsense.org/issues/6940 , i did implemented the 'lock fix' for openvpn..

#18 Updated by Renato Botelho 4 months ago

  • Assignee set to Renato Botelho

#19 Updated by Renato Botelho 4 months ago

  • Status changed from Confirmed to Feedback
  • % Done changed from 0 to 100

PR has been merged

#20 Updated by Malcolm Hussain-Gambles 3 months ago

Fresh install and one OpenVPN server seems to work fine for me with fresh install from Dec and no probs with latest betas

#21 Updated by Renato Botelho 3 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF