Project

General

Profile

Bug #6318

IPsec dashboard widget causes GUI failure

Added by Rick Strangman over 1 year ago. Updated 5 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Web Interface
Target version:
Start date:
05/05/2016
Due date:
% Done:

0%

Affected version:
2.3.x
Affected Architecture:

Description

Since 2.3_1 the webconfigurator is continually being non responsive. Attempting to access my https website on port 444 the page hangs and eventually responds with 504 Gateway Time-out - nginx on both IE & Firefox. The nginx-error log file shows the following:

2016/05/05 18:27:54 [error] 30498#0: *254302 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.xxx.10, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "pf.xxxxx.biz:444"
2016/05/05 19:17:32 [alert] 30180#0: close() socket failed (9: Bad file descriptor)
2016/05/05 19:20:44 [error] 87811#0: *12 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.xxx.10, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "pf.xxxxx.biz:444"
2016/05/05 19:27:58 [error] 87811#0: *447 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 192.168.xxx.10, server: , request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "pf.xxxxx.biz:444"

Restarting the webconfigurator from the console does not resolve the issue.
Other than the web not functioning, the firewall is performing as normal.

php-stuck-truss-04.txt Magnifier (44.7 KB) Jim Pingle, 12/22/2016 02:39 PM

Associated revisions

Revision fa01d062
Added by Chris Buechler over 1 year ago

Set request_terminate_timeout to the same as max_execution_time in case something executed externally doesn't respond, to avoid hanging up all of php-fpm eventually. Ticket #6318 among other similar potential issues.

Revision 42d2f11a
Added by Chris Buechler over 1 year ago

Set request_terminate_timeout to the same as max_execution_time in case something executed externally doesn't respond, to avoid hanging up all of php-fpm eventually. Ticket #6318 among other similar potential issues.

Revision 062a5434
Added by Chris Buechler over 1 year ago

Set request_terminate_timeout to the same as max_execution_time in case something executed externally doesn't respond, to avoid hanging up all of php-fpm eventually. Ticket #6318 among other similar potential issues.

History

#1 Updated by Chris Buechler over 1 year ago

  • Status changed from New to Feedback
  • Affected version deleted (2.3.1)

that's 2.3(.0)_1 rather than 2.3.1. It wasn't 2.3->2.3_1 that did it, since that only upgraded ntpd, rather something that would have happened on 2.3 as well. I'm guessing it's one of two things. Either something related to #6177, or there seems to be some kind of issue with the IPsec dashboard widget causing that to happen for a few people.

If this is replicable for you, if you have the IPsec dashboard widget enabled, please try to remove that and see if that fixes the problem. That'll at least tell us where the issue resides.

#2 Updated by Brent Kerlin over 1 year ago

Chris Buechler wrote:

that's 2.3(.0)_1 rather than 2.3.1. It wasn't 2.3->2.3_1 that did it, since that only upgraded ntpd, rather something that would have happened on 2.3 as well. I'm guessing it's one of two things. Either something related to #6177, or there seems to be some kind of issue with the IPsec dashboard widget causing that to happen for a few people.

If this is replicable for you, if you have the IPsec dashboard widget enabled, please try to remove that and see if that fixes the problem. That'll at least tell us where the issue resides.

I have seen this issue frequently on clients since 2.3 rolled. I was more concerned with #6296 which was causing me many headaches, but I will try removing the IPSec widget on a few sites and report back (I have one with the webgui locked up right now who is a prime candidate). Any log dumps that would be helpful?

#3 Updated by Brent Kerlin over 1 year ago

Restarting the webconfigurator from the console does not resolve the issue.
Other than the web not functioning, the firewall is performing as normal.

Try restarting PHP-FPM from the console. That seems to clear up the issue for me...

#4 Updated by Brent Kerlin over 1 year ago

Brent Kerlin wrote:

I have seen this issue frequently on clients since 2.3 rolled. I was more concerned with #6296 which was causing me many headaches, but I will try removing the IPSec widget on a few sites and report back (I have one with the webgui locked up right now who is a prime candidate). Any log dumps that would be helpful?

I have removed the IPSec Widget from all the sites at which I was having this PHP-FPM issue. I'll report back in a couple days or if the problem persists.

#5 Updated by Rick Strangman over 1 year ago

I have no issues since removing the IPsec widget. Now on 2.3.1 and have not seen a lockup

#6 Updated by Chris Buechler over 1 year ago

  • Subject changed from pfsense webconfigurator to IPsec dashboard widget causes GUI failure
  • Status changed from Feedback to Confirmed
  • Target version set to 2.3.2
  • Affected version set to 2.3.x
  • Affected Architecture deleted (amd64)

#7 Updated by Steve Beaver over 1 year ago

I have looked through the code again and nothing really stands out.

It would be helpful to know:

  • How many tunnels do people have in cases where the issue is seen?
  • Does it make any difference if the widget is set to show Overview, Tunnels, or Mobile?

THanks!

#8 Updated by Chris Buechler over 1 year ago

Steve Beaver wrote:

I have looked through the code again and nothing really stands out.

Ditto. Heard of roughly a handful of reports of this, but never seen it myself. Additional details would be appreciated.

#9 Updated by Chris Buechler over 1 year ago

Thanks to Alex for getting me into an affected system. It's occasionally getting stuck in pfSense_ipsec_list_sa, without triggering any of the printfs there.

PHP_FUNCTION(pfSense_ipsec_list_sa) {

        vici_conn_t *conn;
        vici_req_t *req;
        vici_res_t *res;

        array_init(return_value);

        vici_init();
        conn = vici_connect(NULL);
        if (conn) {
                if (vici_register(conn, "list-sa", build_ipsec_sa_array, (void *) return_value) != 0) {
                        php_printf("VICI registration failed: %s\n", strerror(errno));
                } else {
                        req = vici_begin("list-sas");
                        res = vici_submit(req, conn);
                        if (res) {
                                vici_free_res(res);
                        }
                }
                vici_disconnect(conn);
        } else {
                php_printf("VICI connection failed: %s\n", strerror(errno));
        }

        vici_deinit();

}

What I committed on this ticket should prevent this (and many other possible failure scenarios with commands that don't return) from killing the GUI. request_terminate_timeout will kill them off after 900 seconds. It only happens once every few minutes when continually refreshing a page that uses that function, so that's been enough to keep Alex's system from killing the GUI again.

#10 Updated by Chris Buechler about 1 year ago

  • Target version changed from 2.3.2 to 2.4.0

#11 Updated by Jim Pingle 9 months ago

This also affects Status > IPsec

We have access to a customer system that has 70 tunnels defined, and it happens every 5-20 minutes (timing varies) while a browser is left on Status > IPsec. The requests are not piling up, they only take about 300ms to complete. Leaving a browser open on Status > IPsec with firebug or similar running, it's easy to spot when it stops responding.

When it happens, there are always two PHP child processes:

: ps uxawww | grep '[p]hp'
root   64113   0.5  0.9 272496 38300  -  S     1:43PM    0:02.26 php-fpm: pool nginx (php-fpm)
root     267   0.0  0.6 268400 25140  -  Ss    3:01AM    0:02.49 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
root   64043   0.0  0.9 285304 38604  -  I     1:43PM    0:00.19 php-fpm: pool nginx (php-fpm)

Attempting to run a truss on the top process (In state "S", sleeping) shows no output at all

Running truss on the other process (In state "I", idle) outputs info and then the browser gets a response. So long as the truss happens before the browser times out, everything keeps running. The truss output is attached. I have several more copies of truss output from other times I reproduced the issue, but they are all very close if not identical. I find it odd that merely attaching to the process with truss is somehow waking it up and causing it to proceed. I've tried hitting the process with other signals like kill -HUP but so far nothing brings it back to life but touching it with truss, or killing/restarting PHP-FPM.

There isn't much that happens in the AJAX request being made for Status > IPsec or the IPsec widget, it could be getting stuck in vici interaction.

#12 Updated by Jim Thompson 9 months ago

  • Assignee set to Steve Beaver

#13 Updated by Nick Wenos 7 months ago

We are also having what appears to be the same issue running on version 2.3.2 As a side affect of php-fpm going down our OpenVPN clients also lose the ability to connect until we restart php-fpm and openvpn. I don't know if this would affect all OpenVPN or just those using ssl cert authentication as is the case with our setup.

#14 Updated by Eric Machabert 7 months ago

Nick Wenos wrote:

We are also having what appears to be the same issue running on version 2.3.2 As a side affect of php-fpm going down our OpenVPN clients also lose the ability to connect until we restart php-fpm and openvpn. I don't know if this would affect all OpenVPN or just those using ssl cert authentication as is the case with our setup.

We are also seeing this on 2.3.3
Running netstat -an shows request filling up the Recv-Q for IPC connection to /var/run/php-fpm.socket.

#15 Updated by Chris Baker 5 months ago

I am also seeing this on 2.3.3. Is there any known work around other than removing the ipsec widget? Maybe changing the polling frequency?

#16 Updated by Marcio Merlone 2 months ago

I think this bug's priority should be raised since it also breaks openvpn functionality.

#17 Updated by Steve Beaver about 2 months ago

  • Target version changed from 2.4.0 to 2.4.1

#18 Updated by Steve Beaver about 1 month ago

  • Status changed from Confirmed to Feedback
  • Target version changed from 2.4.1 to 2.4.0

I have done a LOT of research into this. I believe that since most dashboard widgets have their own timer, their own buffer and their own AJAX calling functions, they are from time to time stepping on each other and causing havoc on the server side.

As an experiment (for now) I have removed all of the individual refresh stuff from the widgets and replaced them with a single, central refresh service that loops though the dashboard updating each widget one at a time.

So far, the results appear to be dramatically better. I can't guarantee that this will solve the IPSec widget issue, but I think it might. I note that the time taken to refresh the IPSec widget has reduced from 5 seconds to about 10 mS so that has got to help.

The changes will be in 2.4-BETA later today.

#19 Updated by Luke Hamburg about 1 month ago

Sounds like a fantastic change. Thanks Steve

#20 Updated by Jim Pingle 5 days ago

  • Status changed from Feedback to New
  • Target version changed from 2.4.0 to 2.4.1

I still see this but it seems less common than it did in the past. Either have bad timing or sit on the dashboard too long with the IPsec widget and it still wedges.

Kicking it forward since it isn't critical.

Also available in: Atom PDF