Bug #6318
closedIPsec dashboard widget causes GUI failure
Added by Rick Strangman over 8 years ago. Updated almost 7 years ago.
100%
Description
Since 2.3_1 the webconfigurator is continually being non responsive. Attempting to access my https website on port 444 the page hangs and eventually responds with 504 Gateway Time-out - nginx on both IE & Firefox. The nginx-error log file shows the following:
2016/05/05 18:27:54 [error] 30498#0: *254302 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.xxx.10, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "pf.xxxxx.biz:444"
2016/05/05 19:17:32 [alert] 30180#0: close() socket failed (9: Bad file descriptor)
2016/05/05 19:20:44 [error] 87811#0: *12 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.xxx.10, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "pf.xxxxx.biz:444"
2016/05/05 19:27:58 [error] 87811#0: *447 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.xxx.10, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "pf.xxxxx.biz:444"
Restarting the webconfigurator from the console does not resolve the issue.
Other than the web not functioning, the firewall is performing as normal.
Files
php-stuck-truss-04.txt (44.7 KB) php-stuck-truss-04.txt | Jim Pingle, 12/22/2016 02:39 PM |
Updated by Chris Buechler over 8 years ago
- Status changed from New to Feedback
- Affected Version deleted (
2.3.1)
that's 2.3(.0)_1 rather than 2.3.1. It wasn't 2.3->2.3_1 that did it, since that only upgraded ntpd, rather something that would have happened on 2.3 as well. I'm guessing it's one of two things. Either something related to #6177, or there seems to be some kind of issue with the IPsec dashboard widget causing that to happen for a few people.
If this is replicable for you, if you have the IPsec dashboard widget enabled, please try to remove that and see if that fixes the problem. That'll at least tell us where the issue resides.
Updated by Brent Kerlin over 8 years ago
Chris Buechler wrote:
that's 2.3(.0)_1 rather than 2.3.1. It wasn't 2.3->2.3_1 that did it, since that only upgraded ntpd, rather something that would have happened on 2.3 as well. I'm guessing it's one of two things. Either something related to #6177, or there seems to be some kind of issue with the IPsec dashboard widget causing that to happen for a few people.
If this is replicable for you, if you have the IPsec dashboard widget enabled, please try to remove that and see if that fixes the problem. That'll at least tell us where the issue resides.
I have seen this issue frequently on clients since 2.3 rolled. I was more concerned with #6296 which was causing me many headaches, but I will try removing the IPSec widget on a few sites and report back (I have one with the webgui locked up right now who is a prime candidate). Any log dumps that would be helpful?
Updated by Brent Kerlin over 8 years ago
Restarting the webconfigurator from the console does not resolve the issue.
Other than the web not functioning, the firewall is performing as normal.
Try restarting PHP-FPM from the console. That seems to clear up the issue for me...
Updated by Brent Kerlin over 8 years ago
Brent Kerlin wrote:
I have seen this issue frequently on clients since 2.3 rolled. I was more concerned with #6296 which was causing me many headaches, but I will try removing the IPSec widget on a few sites and report back (I have one with the webgui locked up right now who is a prime candidate). Any log dumps that would be helpful?
I have removed the IPSec Widget from all the sites at which I was having this PHP-FPM issue. I'll report back in a couple days or if the problem persists.
Updated by Rick Strangman over 8 years ago
I have no issues since removing the IPsec widget. Now on 2.3.1 and have not seen a lockup
Updated by Chris Buechler over 8 years ago
- Subject changed from pfsense webconfigurator to IPsec dashboard widget causes GUI failure
- Status changed from Feedback to Confirmed
- Target version set to 2.3.2
- Affected Version set to 2.3.x
- Affected Architecture added
- Affected Architecture deleted (
amd64)
Updated by Anonymous over 8 years ago
I have looked through the code again and nothing really stands out.
It would be helpful to know:
- How many tunnels do people have in cases where the issue is seen?
- Does it make any difference if the widget is set to show Overview, Tunnels, or Mobile?
THanks!
Updated by Chris Buechler over 8 years ago
Steve Beaver wrote:
I have looked through the code again and nothing really stands out.
Ditto. Heard of roughly a handful of reports of this, but never seen it myself. Additional details would be appreciated.
Updated by Chris Buechler over 8 years ago
Thanks to Alex for getting me into an affected system. It's occasionally getting stuck in pfSense_ipsec_list_sa, without triggering any of the printfs there.
PHP_FUNCTION(pfSense_ipsec_list_sa) { vici_conn_t *conn; vici_req_t *req; vici_res_t *res; array_init(return_value); vici_init(); conn = vici_connect(NULL); if (conn) { if (vici_register(conn, "list-sa", build_ipsec_sa_array, (void *) return_value) != 0) { php_printf("VICI registration failed: %s\n", strerror(errno)); } else { req = vici_begin("list-sas"); res = vici_submit(req, conn); if (res) { vici_free_res(res); } } vici_disconnect(conn); } else { php_printf("VICI connection failed: %s\n", strerror(errno)); } vici_deinit(); }
What I committed on this ticket should prevent this (and many other possible failure scenarios with commands that don't return) from killing the GUI. request_terminate_timeout will kill them off after 900 seconds. It only happens once every few minutes when continually refreshing a page that uses that function, so that's been enough to keep Alex's system from killing the GUI again.
Updated by Chris Buechler over 8 years ago
- Target version changed from 2.3.2 to 2.4.0
Updated by Jim Pingle almost 8 years ago
- File php-stuck-truss-04.txt php-stuck-truss-04.txt added
This also affects Status > IPsec
We have access to a customer system that has 70 tunnels defined, and it happens every 5-20 minutes (timing varies) while a browser is left on Status > IPsec. The requests are not piling up, they only take about 300ms to complete. Leaving a browser open on Status > IPsec with firebug or similar running, it's easy to spot when it stops responding.
When it happens, there are always two PHP child processes:
: ps uxawww | grep '[p]hp' root 64113 0.5 0.9 272496 38300 - S 1:43PM 0:02.26 php-fpm: pool nginx (php-fpm) root 267 0.0 0.6 268400 25140 - Ss 3:01AM 0:02.49 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm) root 64043 0.0 0.9 285304 38604 - I 1:43PM 0:00.19 php-fpm: pool nginx (php-fpm)
Attempting to run a truss on the top process (In state "S", sleeping) shows no output at all
Running truss on the other process (In state "I", idle) outputs info and then the browser gets a response. So long as the truss happens before the browser times out, everything keeps running. The truss output is attached. I have several more copies of truss output from other times I reproduced the issue, but they are all very close if not identical. I find it odd that merely attaching to the process with truss is somehow waking it up and causing it to proceed. I've tried hitting the process with other signals like kill -HUP
but so far nothing brings it back to life but touching it with truss, or killing/restarting PHP-FPM.
There isn't much that happens in the AJAX request being made for Status > IPsec or the IPsec widget, it could be getting stuck in vici interaction.
Updated by Nick Wenos almost 8 years ago
We are also having what appears to be the same issue running on version 2.3.2 As a side affect of php-fpm going down our OpenVPN clients also lose the ability to connect until we restart php-fpm and openvpn. I don't know if this would affect all OpenVPN or just those using ssl cert authentication as is the case with our setup.
Updated by Eric Machabert over 7 years ago
Nick Wenos wrote:
We are also having what appears to be the same issue running on version 2.3.2 As a side affect of php-fpm going down our OpenVPN clients also lose the ability to connect until we restart php-fpm and openvpn. I don't know if this would affect all OpenVPN or just those using ssl cert authentication as is the case with our setup.
We are also seeing this on 2.3.3
Running netstat -an shows request filling up the Recv-Q for IPC connection to /var/run/php-fpm.socket.
Updated by Chris Baker over 7 years ago
I am also seeing this on 2.3.3. Is there any known work around other than removing the ipsec widget? Maybe changing the polling frequency?
Updated by Marcio Merlone over 7 years ago
I think this bug's priority should be raised since it also breaks openvpn functionality.
Updated by Anonymous over 7 years ago
- Target version changed from 2.4.0 to 2.4.1
Updated by Anonymous over 7 years ago
- Status changed from Confirmed to Feedback
- Target version changed from 2.4.1 to 2.4.0
I have done a LOT of research into this. I believe that since most dashboard widgets have their own timer, their own buffer and their own AJAX calling functions, they are from time to time stepping on each other and causing havoc on the server side.
As an experiment (for now) I have removed all of the individual refresh stuff from the widgets and replaced them with a single, central refresh service that loops though the dashboard updating each widget one at a time.
So far, the results appear to be dramatically better. I can't guarantee that this will solve the IPSec widget issue, but I think it might. I note that the time taken to refresh the IPSec widget has reduced from 5 seconds to about 10 mS so that has got to help.
The changes will be in 2.4-BETA later today.
Updated by → luckman212 over 7 years ago
Sounds like a fantastic change. Thanks Steve
Updated by Jim Pingle about 7 years ago
- Status changed from Feedback to New
- Target version changed from 2.4.0 to 2.4.1
I still see this but it seems less common than it did in the past. Either have bad timing or sit on the dashboard too long with the IPsec widget and it still wedges.
Kicking it forward since it isn't critical.
Updated by Jim Pingle about 7 years ago
- Target version changed from 2.4.1 to 2.4.2
There have been some IPsec widget fixes here which may be relevant, since it is so difficult to reproduce, it is difficult to know that it has been fully resolved. Moving forward.
Updated by Jim Pingle about 7 years ago
- Target version changed from 2.4.2 to 2.4.3
Updated by Jim Pingle almost 7 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
- Affected Architecture All added
- Affected Architecture deleted (
)
This appears to be fixed by other changes to the IPsec status code in recent versions. No new reports of this being caused by IPsec in some time.