Bug #14386
closed``openvpn.auth-user.php`` gets stuck at 100% CPU usage when RADIUS authentication times out
100%
Description
This seems to be new behavior on pfsense plus 23.01-RELEASE. I'm seeing openvpn.auth-user.php processes stuck consuming 100% cpu. They seem to be correlated to openvpn AUTH_FAILED events as well as:
May 14 16:09:32 openvpn[8028]: /openvpn.auth-user.php: Error during RADIUS authentication : Operation timed out
Process:
26834 - R 1124:57.09 /usr/local/sbin/fcgicli -f /etc/inc/openvpn.auth-user.php -d username=USERNAME&password=PASSWORD&cn=CN&strictcn=false&authcfg=AUTHCFGE=&dev=ovpns3&untrusted_port=55994&modeid=server3&nas_port=1196
truss shows the process continually running:
recvfrom(0,0x820ea2280,8,0,NULL,0x0) = 0 (0x0)
Updated by Kris Phillips over 1 year ago
Hello,
Do you see this same behavior in 23.05?
Updated by aleksei prokofiev over 1 year ago
Can not reproduce on 23.05.1
23.05.1-RELEASE (amd64)
built on Wed Jun 28 03:57:27 UTC 2023
FreeBSD 14.0-CURRENT
Updated by Orion Poplawski over 1 year ago
I just checked my 23.05 box and I see 5 of these runaways. I guess we'll need to update to 23.05.1 to test that...
Updated by Marcos M about 1 year ago
I'm not sure that the Operation timed out
log is related because the existence of the log means that the authentication process completes and the script returns, hence fcgicli
should exit. The truss output seems to indicate that fcgicli
is waiting on openvpn.auth-user.php
to finish which implies the issue is with the .php file. I did manage to reproduce this issue once (unfortunately I did not have any debugging output applied); as expected no output was returned from openvpn.attributes.php. I also noticed the following in the OpenVPN logs rather than in the system logs (reversed):
Oct 11 14:08:17 openvpn 81755 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - -> 172.25.1.1 - Restarting packages. Oct 11 14:08:17 openvpn 81755 /rc.newwanip: rc.newwanip called with empty interface. Oct 11 14:08:17 openvpn 81755 /rc.newwanip: rc.newwanip: on (IP address: 172.25.1.1) (interface: []) (real interface: ovpns1). Oct 11 14:08:17 openvpn 81755 /rc.newwanip: rc.newwanip: Info: starting on ovpns1.
Perhaps the script isn't ending due to some issue around openlog()? A quick way to test that would be to simply omit
openlog()
and closelog()
altogether.
To narrow down the issue further, it'd be useful to know if this can be reproduced after doing any or all of the following:
- Disabling
Duplicate Connection
in the OpenVPN server config, or - Renaming
/etc/inc/openvpn.attributes.php
to/etc/inc/openvpn.attributes.php.bak
(used for Cisco-AVPair rules) - Apply the following patch to omit openlog() Show
Updated by Orion Poplawski about 1 year ago
Duplicate Connection was already disabled (multiple connections from the same user are not allowed - check box not checked.
I moved /etc/inc/openvpn.attributes.php out of the way.
Just had a re-occurrence, so I will now apply the patch.
Updated by Danilo Zrenjanin about 1 year ago
The patch fixed the issue on a firewall with the same symptoms. There were no side effects after applying the patch.
23.05.1-RELEASE (amd64) built on Wed Jun 28 03:57:27 UTC 2023 FreeBSD 14.0-CURRENT
Updated by Marcos M about 1 year ago
Preferably we can get more testing/feedback on this. Given the inconsistent nature of the issue, it may simply be the issue has not appeared yet.
Updated by Orion Poplawski about 1 year ago
Just had another reproducer. However, this time it did not appear to the associated with the auth timeout message initially reported, but a different variant:
Oct 27 10:04:22 PLUGIN_CALL: POST /usr/local/lib/openvpn/plugins/openvpn-plugin-auth-script.so/PLUGIN_AUTH_USER_PASS_VERIFY status=2 Oct 27 10:04:22 TLS: Username/Password authentication deferred for username 'USER' ... Oct 27 10:05:18 PUSH: Received control message: 'PUSH_REQUEST' Oct 27 10:05:34 Delayed exit in 5 seconds Oct 27 10:05:34 SENT CONTROL [UNDEF]: 'AUTH_FAILED' (status=1) Oct 27 10:05:34 SENT CONTROL [akialoa.cora.nwra.com]: 'AUTH_FAILED' (status=1)
Although now that I check I see that the "Duplicate Connection" box is checked on this particular VPN configuration, so I need to disable that now. Sigh
Updated by Marcos M about 1 year ago
Orion Poplawski wrote in #note-8:
Just had another reproducer. However, this time it did not appear to the associated with the auth timeout message initially reported, but a different variant:
[...]
Although now that I check I see that the "Duplicate Connection" box is checked on this particular VPN configuration, so I need to disable that now. Sigh
Is that with the patch applied?
Updated by Orion Poplawski about 1 year ago
Marcos M wrote in #note-9:
Orion Poplawski wrote in #note-8:
Just had another reproducer. However, this time it did not appear to the associated with the auth timeout message initially reported, but a different variant:
[...]
Although now that I check I see that the "Duplicate Connection" box is checked on this particular VPN configuration, so I need to disable that now. SighIs that with the patch applied?
Yes - patch applied and /etc/inc/openvpn.attributes.php moved out of the way.
Updated by Marcos M about 1 year ago
In that case, it seems more likely that the issue is with fcgicli. Given this issue seems to have appeared in 23.01, it may be due to reintroducing the use of fcgicli here:
https://github.com/pfsense/pfsense/commit/758ee42ae096fee8436efc89f2c9bcc4ae7ea23d
Please try the following patch to use php-cgi instead (updated): Show
Updated by Orion Poplawski about 1 year ago
The patched version of ovpn_auth_verify seems to work okay, but if I patch ovpn_auth_verify_async I get auth failures:
SENT CONTROL [UNDEF]: 'AUTH_FAILED' (status=1)
not much else in the log to know why.
Updated by Orion Poplawski 10 months ago
FWIW - still present in 2.7.2 and 23.09.1.
Updated by Orion Poplawski 10 months ago
As I noted, patching ovpn_auth_verify_async gave me auth failures. I could try it again an see if that is still the case. Or just patch ovpn_auth_verify, but I wasn't sure if patching just that was sufficient.
Updated by Orion Poplawski 10 months ago
Okay, I have that applied and working now. And I don't seem to see a runaway with a simple auth failure. We'll see if anything else triggers it. Thanks.
Updated by Reid Linnemann 10 months ago
- Assignee set to Reid Linnemann
- Target version set to 24.03
Since fcgicli is continually calling recv() for 8 bytes, I think it's stuck in its read_packet loop. The socket's been closed by php-fpm but no fastcgi message has been received indicating the request is complete. Probably the socket is closed because php-fpm hits a terminal condition when accepting the connection, like running out of FDs, and just immediately closes the socket without sending any fastcgi messages back. fcgicli then loops on recv() because the return value is >= 0, interpreted as a short read. I'll correct this in devel and add some more foot-shooting protection. Do you have the ability to test this against a devel snapshot?
Updated by Orion Poplawski 10 months ago
I don't have a test instance at the moment, but I've thought about spinning one up for quite a while. I'll see what I can do. Thanks for taking a look at this - it sounds promising.
Updated by Reid Linnemann 10 months ago
What version(s) are you currently running and encountering this? Perhaps a test build of fcgicli would be appropriate.
Updated by Orion Poplawski 10 months ago
pfSense 2.7.2 and pfSense Plus 23.09.1
Updated by Reid Linnemann 10 months ago
- Status changed from New to Feedback
Should be fixed in https://github.com/pfsense/FreeBSD-ports/commit/c0a12f594ba2a873ffd5ec8974c5582e6283fbdf. The 0 byte read handling is definitely the culprit.
Updated by Jim Pingle 10 months ago
- Project changed from pfSense Plus to pfSense
- Subject changed from openvpn.auth-user.php gets stuck at 100% cpu usage after RADIUS auth timed out to ``openvpn.auth-user.php`` gets stuck at 100% CPU usage when RADIUS authentication times out
- Category changed from OpenVPN to OpenVPN
- Target version changed from 24.03 to 2.8.0
- % Done changed from 0 to 100
- Affected Plus Version deleted (
23.01) - Plus Target Version set to 24.03
Updated by Jim Pingle 8 months ago
- Status changed from Feedback to Resolved
No reports of failures since this went in. Can always reopen it if someone can reproduce the problem on current builds.