Project

General

Profile

Actions

Regression #12382

open

Certificate Depth checking creates OpenVPN micro-outages every time a user authenticates after 2.5.2 upgrade

Added by Brett Keller about 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OpenVPN
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Default
Affected Version:
2.5.2
Affected Architecture:

Description

We're running several OpenVPN servers on a single pfSense box at our office, and ever since upgrading from 2.4.5p1 to 2.5.2, with no changes to our OpenVPN configuration that has otherwise been working well for a long time, we've suddenly been experiencing crippling performance problems. The symptoms are that every time a user authenticates with the server, all traffic stops flowing on that OpenVPN instance for 1 to 2 seconds for every single user at the same time. On our busiest server instance with 50-60 users, we're seeing micro-outages causing packet loss and latency spikes as frequently as 4 times per minute, making the VPN almost unusable.

To illustrate the problem, I have attached a graph that shows the micro-outages. I performed a packet capture for just over one minute on our WAN interface, filtered for one of the UDP ports we use for an OpenVPN server (this one with 50-60 simultaneous users). The graph shows number of packets per time unit from that packet capture, and for a busy VPN server like this one, the traffic should always be flowing. Note, however, that the graph dips to the zero line four times during the capture. Each of these events lasted for somewhere between 1 to 2 seconds, and each time one of these micro-outages occurred, it directly corresponded to a time that a user attempted to authenticate (as recorded in /var/log/openvpn.log). If I inspect the raw packet capture at these points in time, I can see that the clients keep sending OpenVPN UDP packets to the server, but the server does not send any traffic to any client for the duration of the dip in the graph.

After a ton of troubleshooting and testing, I was able to find the root cause of this problem. When certificate depth checking is enabled for an OpenVPN instance in pfSense (and it is enabled by default), it creates these micro-outages. Changing the "Certificate Depth" setting to "Do Not Check" completely eliminates the outages, and clients no longer experience horrific latency spikes and packet loss.

When certificate depth checking is enabled, pfSense adds a tls-verify hook script to the generated OpenVPN config file. During client authentication, OpenVPN executes this shell script synchronously, and because OpenVPN is single-threaded, the entire server instance blocks execution until the shell script returns with its result. During this time it will not be able to respond to any incoming packets from any users (not just the authenticating user), and if this blocking call takes long enough, it will start to manifest as latency spikes (delayed responses) or packet loss (missed responses). Because of this, it's critical that all of these script hooks are efficient and return results very quickly.

The shell script called in the tls-verify hook is /usr/local/sbin/ovpn_auth_verify, and that shell script in turn synchronously calls a PHP script /etc/inc/openvpn.tls-verify.php. When "Certificate Depth" is set to the default of "One (Client+Server)", that PHP script gets executed 4 times for every user authentication attempt: once for each verification level (the tls_serial_{n} environment variables) multiplied by once for each certificate in the chain. On our production pfSense hardware, it takes about 1.2 to 1.5 seconds total for the four rounds of openvpn.tls-verify.php to execute, depending on current load. Obviously, a blocking script in a critical path like this should be returning in milliseconds, not seconds.

To measure the time spent running openvpn.tls-verify.php for each user authentication event, I hacked up this gnarly shell one-liner that provides sample typical inputs for the four runs that OpenVPN would normally invoke via the tls-verify hook:

[2.5.2-RELEASE][root@pfSense.mycompany.com]/usr/local/sbin: time sh -c "\
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=3&config=/var/etc/openvpn/server1/config.ovpn'; \
echo" 
OKOKOKOK
1.357u 0.172s 0:01.53 99.3%     3730+323k 171+0io 0pf+0w

That particular test completed in a super slow 1.357 seconds.

I believe the reason for this regression is most likely due to commit 1bfdb794cb2a06932da0029ca37f9727c3f74274, which landed in pfSense 2.5.2:
https://github.com/pfsense/pfsense/commit/1bfdb794cb2a06932da0029ca37f9727c3f74274

That commit changed the PHP interpreter used to execute openvpn.tls-verify.php. If I take my same timing test command from above, and I adapt it to use the old interpreter, I get the following results:

[2.5.2-RELEASE][root@pfSense.mycompany.com]/usr/local/sbin: time sh -c "\
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=3&config=/var/etc/openvpn/server1/config.ovpn'; \
echo" 
OKOKOKOK
0.000u 0.012s 0:00.12 8.3%      0+0k 5+0io 0pf+0w

Those four rounds completed in less than a millisecond -- so fast, that the time command doesn't offer enough precision to measure a value greater than zero.

It looks like the interpreter was changed in order to offer a temporary fix for Bug #4521. Unfortunately, that patch has caused this painful regression. Given the impact of the regression, coupled with the fact that the switch in interpreter that caused it was only intended as a temporary fix until a more permanent fix using fcgicli could be written, would you all be willing to consider reverting that patch? Alternatively, fast-tracking a new patch that solves Bug #4521 using the original fcgicli interpreter would be the most ideal solution, but I don't have a sense of how feasible that would be.

For reference, here's a thread on the pfSense subreddit where it looks like I'm not the only one impacted by this regression:
https://old.reddit.com/r/PFSENSE/comments/pffgj8/latency_spikes_and_packet_loss_on_openvpn_after/


Files

pfSense_OpenVPN_Micro-Outages.png (143 KB) pfSense_OpenVPN_Micro-Outages.png Traffic graph showing micro-outages Brett Keller, 09/15/2021 05:58 PM

Related issues

Related to Bug #4521: OpenVPN authentication and certificate validation fail due to size of data passed through ``fcgicli``ClosedJim Pingle03/14/2015

Actions
Actions

Also available in: Atom PDF