Project

General

Profile

Actions

Regression #12382

open

Certificate Depth checking creates OpenVPN micro-outages every time a user authenticates after 2.5.2 upgrade

Added by Brett Keller about 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OpenVPN
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Default
Affected Version:
2.5.2
Affected Architecture:

Description

We're running several OpenVPN servers on a single pfSense box at our office, and ever since upgrading from 2.4.5p1 to 2.5.2, with no changes to our OpenVPN configuration that has otherwise been working well for a long time, we've suddenly been experiencing crippling performance problems. The symptoms are that every time a user authenticates with the server, all traffic stops flowing on that OpenVPN instance for 1 to 2 seconds for every single user at the same time. On our busiest server instance with 50-60 users, we're seeing micro-outages causing packet loss and latency spikes as frequently as 4 times per minute, making the VPN almost unusable.

To illustrate the problem, I have attached a graph that shows the micro-outages. I performed a packet capture for just over one minute on our WAN interface, filtered for one of the UDP ports we use for an OpenVPN server (this one with 50-60 simultaneous users). The graph shows number of packets per time unit from that packet capture, and for a busy VPN server like this one, the traffic should always be flowing. Note, however, that the graph dips to the zero line four times during the capture. Each of these events lasted for somewhere between 1 to 2 seconds, and each time one of these micro-outages occurred, it directly corresponded to a time that a user attempted to authenticate (as recorded in /var/log/openvpn.log). If I inspect the raw packet capture at these points in time, I can see that the clients keep sending OpenVPN UDP packets to the server, but the server does not send any traffic to any client for the duration of the dip in the graph.

After a ton of troubleshooting and testing, I was able to find the root cause of this problem. When certificate depth checking is enabled for an OpenVPN instance in pfSense (and it is enabled by default), it creates these micro-outages. Changing the "Certificate Depth" setting to "Do Not Check" completely eliminates the outages, and clients no longer experience horrific latency spikes and packet loss.

When certificate depth checking is enabled, pfSense adds a tls-verify hook script to the generated OpenVPN config file. During client authentication, OpenVPN executes this shell script synchronously, and because OpenVPN is single-threaded, the entire server instance blocks execution until the shell script returns with its result. During this time it will not be able to respond to any incoming packets from any users (not just the authenticating user), and if this blocking call takes long enough, it will start to manifest as latency spikes (delayed responses) or packet loss (missed responses). Because of this, it's critical that all of these script hooks are efficient and return results very quickly.

The shell script called in the tls-verify hook is /usr/local/sbin/ovpn_auth_verify, and that shell script in turn synchronously calls a PHP script /etc/inc/openvpn.tls-verify.php. When "Certificate Depth" is set to the default of "One (Client+Server)", that PHP script gets executed 4 times for every user authentication attempt: once for each verification level (the tls_serial_{n} environment variables) multiplied by once for each certificate in the chain. On our production pfSense hardware, it takes about 1.2 to 1.5 seconds total for the four rounds of openvpn.tls-verify.php to execute, depending on current load. Obviously, a blocking script in a critical path like this should be returning in milliseconds, not seconds.

To measure the time spent running openvpn.tls-verify.php for each user authentication event, I hacked up this gnarly shell one-liner that provides sample typical inputs for the four runs that OpenVPN would normally invoke via the tls-verify hook:

[2.5.2-RELEASE][root@pfSense.mycompany.com]/usr/local/sbin: time sh -c "\
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/bin/php-cgi -q /etc/inc/openvpn.tls-verify.php 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=3&config=/var/etc/openvpn/server1/config.ovpn'; \
echo" 
OKOKOKOK
1.357u 0.172s 0:01.53 99.3%     3730+323k 171+0io 0pf+0w

That particular test completed in a super slow 1.357 seconds.

I believe the reason for this regression is most likely due to commit 1bfdb794cb2a06932da0029ca37f9727c3f74274, which landed in pfSense 2.5.2:
https://github.com/pfsense/pfsense/commit/1bfdb794cb2a06932da0029ca37f9727c3f74274

That commit changed the PHP interpreter used to execute openvpn.tls-verify.php. If I take my same timing test command from above, and I adapt it to use the old interpreter, I get the following results:

[2.5.2-RELEASE][root@pfSense.mycompany.com]/usr/local/sbin: time sh -c "\
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=1&certsubject=CN=CA-CVP-RA2, C=US, ST=IL, L=MyCity, O=My Company&serial=&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=0&config=/var/etc/openvpn/server1/config.ovpn'; \
/usr/local/sbin/fcgicli -f /etc/inc/openvpn.tls-verify.php -d 'servercn=CVP-RA2-VPN&depth=1&certdepth=0&certsubject=C=US, ST=IL, L=MyCity, O=My Company, CN=myuser-RA2&serial=3&config=/var/etc/openvpn/server1/config.ovpn'; \
echo" 
OKOKOKOK
0.000u 0.012s 0:00.12 8.3%      0+0k 5+0io 0pf+0w

Those four rounds completed in less than a millisecond -- so fast, that the time command doesn't offer enough precision to measure a value greater than zero.

It looks like the interpreter was changed in order to offer a temporary fix for Bug #4521. Unfortunately, that patch has caused this painful regression. Given the impact of the regression, coupled with the fact that the switch in interpreter that caused it was only intended as a temporary fix until a more permanent fix using fcgicli could be written, would you all be willing to consider reverting that patch? Alternatively, fast-tracking a new patch that solves Bug #4521 using the original fcgicli interpreter would be the most ideal solution, but I don't have a sense of how feasible that would be.

For reference, here's a thread on the pfSense subreddit where it looks like I'm not the only one impacted by this regression:
https://old.reddit.com/r/PFSENSE/comments/pffgj8/latency_spikes_and_packet_loss_on_openvpn_after/


Files

pfSense_OpenVPN_Micro-Outages.png (143 KB) pfSense_OpenVPN_Micro-Outages.png Traffic graph showing micro-outages Brett Keller, 09/15/2021 05:58 PM

Related issues

Related to Bug #4521: OpenVPN authentication and certificate validation fail due to size of data passed through ``fcgicli``ClosedJim Pingle03/14/2015

Actions
Actions #1

Updated by Viktor Gurov about 3 years ago

  • Related to Bug #4521: OpenVPN authentication and certificate validation fail due to size of data passed through ``fcgicli`` added
Actions #2

Updated by Viktor Gurov about 3 years ago

Should be improved in #11829

Actions #3

Updated by Jim Pingle about 3 years ago

  • Priority changed from Very High to Normal

If it works for you to revert that change, by all means do so, but more people were negatively impacted by the other failures than have experienced the problem you are seeing.

It is still our intention to fix it eventually, and as Viktor said it's likely improved on 2.6.0/21.09 snapshots.

Actions #4

Updated by Brett Keller about 3 years ago

Well, obviously you know your userbase better than I do, so if more people would be negatively impacted by reverting the change than would be helped, then we'll just have to live with this bug for the moment. In the meantime, on our own production system we've disabled certificate depth checks as a workaround for this bug.

I tested the latest snapshot (2.6.0.a.20210916.0100), which includes the work for #11829, and while the situation has improved, it's still unusable in our environment with certificate depth checks enabled. The openvpn.tls-verify.php script now only runs 3 times instead of 4, skipping the run for the empty serial number. Accordingly, my tests are showing a 25% reduction in time spent in the tls-verify hook, but that still leaves over one second of blocked OpenVPN execution time for every auth, which still is plenty of time to create micro-outages that lead to latency spikes and dropped packets.

Actions #5

Updated by Denis Grilli about 3 years ago

Just want to thank you Brett for the debug of the problem that unfortunately affects us as well.

Doesn't change anything but just to notice that the certificate checks happens not just when a new user connect to the VPN but also every time TLS get renegotiated that by default is every hour.

Actions #6

Updated by Marcos M about 3 years ago

The solution here should be to implement async auth plugin. It's already being done with /usr/local/sbin/ovpn_auth_verify_async. Unclear if it's also possible to instead use auth-user-pass-verify which could be deferred now in OpenVPN 2.6.

Actions #7

Updated by Joao Assad about 3 years ago

So this bug is affecting us too. We need to route all our VPN clients traffic through the VPN and this bug is causing a huge impact and headache to us. We disabled the certificate depth check to test it and can confirm it is indeed what is causing the problem.

Is there an ETA on the fix for this ? We have a support contract.

Actions #8

Updated by Kris Phillips almost 3 years ago

Joao Assad wrote in #note-7:

So this bug is affecting us too. We need to route all our VPN clients traffic through the VPN and this bug is causing a huge impact and headache to us. We disabled the certificate depth check to test it and can confirm it is indeed what is causing the problem.

Is there an ETA on the fix for this ? We have a support contract.

#11829 is in Feedback and per the previous redmine response from Viktor the situation should be improved in that patch. This should be in 2.6.0 and 22.01 of pfSense and pfSense Plus.

Actions #9

Updated by Jens Groh almost 3 years ago

Is there an ETA on the fix for this ? We have a support contract.

#11829 is in Feedback and per the previous redmine response from Viktor the situation should be improved in that patch. This should be in 2.6.0 and 22.01 of pfSense and pfSense Plus.

I beg you pardon, but AFAIR that patch for #11829 has no direct implications for this and the test script from Brett Keller above is still running incredibly slow. If you mean the content of ee49cc683b224222ae8a33bcdc15971c8365db76 I just checked that. It just added the serial checking if-clause, but the interpreter used (php-cgi) still runs a huge time longer than the original fcgicli one. And as long as the runtime is high and blocks all VPN traffic flowing for a second or two, that will cause outages. You may correct me if I'm wrong but I don't see anything behaving better with that commit. Other then switching back to using fcgicli or rewriting the script/call in another kind of script that's way faster then PHP+php-cgi.

We are also hit with that now and that's only with ~10 users in VPN but as our certs include an intermediate level (master CA + VPN CA + certificate), that brings 2-3s blockage with every dial-in/renew for all VPN users. For me and a few bigger customers that change effectively breaks VPN connectivity as VoIP gets unusable with stuttering and standstill.

Actions #10

Updated by Viktor Gurov almost 3 years ago

Jens Groh wrote in #note-9:

Is there an ETA on the fix for this ? We have a support contract.

#11829 is in Feedback and per the previous redmine response from Viktor the situation should be improved in that patch. This should be in 2.6.0 and 22.01 of pfSense and pfSense Plus.

I beg you pardon, but AFAIR that patch for #11829 has no direct implications for this and the test script from Brett Keller above is still running incredibly slow. If you mean the content of ee49cc683b224222ae8a33bcdc15971c8365db76 I just checked that. It just added the serial checking if-clause, but the interpreter used (php-cgi) still runs a huge time longer than the original fcgicli one. And as long as the runtime is high and blocks all VPN traffic flowing for a second or two, that will cause outages. You may correct me if I'm wrong but I don't see anything behaving better with that commit. Other then switching back to using fcgicli or rewriting the script/call in another kind of script that's way faster then PHP+php-cgi.

We are also hit with that now and that's only with ~10 users in VPN but as our certs include an intermediate level (master CA + VPN CA + certificate), that brings 2-3s blockage with every dial-in/renew for all VPN users. For me and a few bigger customers that change effectively breaks VPN connectivity as VoIP gets unusable with stuttering and standstill.

As a workaround, you can manually revert https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/1bfdb794cb2a06932da0029ca37f9727c3f74274/diff changes

Actions #11

Updated by Jens Groh over 2 years ago

Viktor Gurov wrote in #note-10:

We are also hit with that now and that's only with ~10 users in VPN but as our certs include an intermediate level (master CA + VPN CA + certificate), that brings 2-3s blockage with every dial-in/renew for all VPN users. For me and a few bigger customers that change effectively breaks VPN connectivity as VoIP gets unusable with stuttering and standstill.

As a workaround, you can manually revert https://redmine.pfsense.org/projects/pfsense/repository/1/revisions/1bfdb794cb2a06932da0029ca37f9727c3f74274/diff changes

As a follow-up: Reading https://redmine.pfsense.org/issues/12407 I was asking myself, if that also mitigates this problem here with certificate chain checking being slow. Is there any intel if that also solves this problem with using the "slower" PHP parser?

Thanks!

Actions

Also available in: Atom PDF