Project

General

Profile

Bug #6406

Web process becomes unresponsive producing 502 Bad Gateway nginx

Added by Alex Vergilis over 1 year ago. Updated about 8 hours ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Web Interface
Target version:
Start date:
05/26/2016
Due date:
% Done:

0%

Affected version:
2.3.x
Affected Architecture:

Description

Eventually the web process becomes unresponsive and produces

502 Bad Gateway

nginx

A restart of PHP-FPM addresses the issue, until it happens again about 12 hours later.

Screenshot_36.png (4.97 KB) IT IGP, 06/15/2017 06:07 AM

History

#1 Updated by Kill Bill over 1 year ago

+1; seems pretty replicable here when you leave the dashboard page open in a browser for a couple of hours. (Not 2.3.1 specific, was there with 2.3.0 as well.) Note that restarting the webserver alone (console option 11) does not help at all.

#2 Updated by Chris Buechler over 1 year ago

  • Affected version changed from 2.3.1 to 2.3.x

Kill Bill: 2.3.1_1 fixed the bulk of remaining things there that 2.3.1 didn't. There's still something to this on occasion, but upgrade.

#3 Updated by Kill Bill over 1 year ago

Well, while the original issue with the dashboard seems indeed gone, I managed to make the GUI completely unresponsive when upgrading pfBlockerNG package on two different boxes (2.3.1_1 full install, amd64). After the upgrade completed, the GUI did not return until PHP-FPM restart.

#4 Updated by Xander Venterus over 1 year ago

I am now experiencing this issue on 2.3.1-RELEASE-p1 (i386)

Ive been having intermittent Layer 7 DDoS Attacks for a day or few now, and it seems that after each wave of the attack the web configurator is only returning 502s from inside or outside.

Restarting web configurator does nothing, i have to restart php-fpm to fix the issue.

the flood causing this is the recent WordPress pingback amplification attack exploit, i have mitigated the flooding nodes via cloudflare now, but it would be nice if the web configurator didnt go all 502 error every time a new set of IPs tries the same attack....

#5 Updated by BBcan177 . over 1 year ago

Kill Bill wrote:

Well, while the original issue with the dashboard seems indeed gone, I managed to make the GUI completely unresponsive when upgrading pfBlockerNG package on two different boxes (2.3.1_1 full install, amd64). After the upgrade completed, the GUI did not return until PHP-FPM restart.

By any chance, did you use the "view" button in the Update Tab? Something has recently changed that is affecting that button's "End View" function...

#6 Updated by Kill Bill over 1 year ago

BBcan177 . wrote:

By any chance, did you use the "view" button in the Update Tab? Something has recently changed that is affecting that button's "End View" function...

Yeah, not sure it's related to this issue but I noticed that button got broken as well.

#7 Updated by Xander Venterus over 1 year ago

Confirming this has happenned again on my unit, and this time without any attacks having hit us, i just had to restart the FPM again.

#8 Updated by Kill Bill over 1 year ago

Xander Venterus wrote:

Confirming this has happenned again on my unit, and this time without any attacks having hit us, i just had to restart the FPM again.

Yeah, seen again multiple times on multiple 2.3.1_1 boxes.

#9 Updated by Chris Buechler about 1 year ago

  • Target version changed from 2.3.2 to 2.4.0

#10 Updated by Jim Thompson 11 months ago

  • Assignee set to Steve Beaver

#11 Updated by Alex Vergilis 11 months ago

FYI - Still happening on 2.3.2-RELEASE-p1 systems.

#12 Updated by Steve Beaver 11 months ago

Sorry to re-hash this, but since it has just been assigned to me I need an update.

Some of the above responses would indicate this issue is PfBlockerNG specific. Is that the case? Is the problem present if pfB is not installed/active?

#13 Updated by Jim Pingle 11 months ago

There is no known consistent single cause. Some have it with nothing else installed, some other pfBlocker, some with the IPsec widget, others hit it during HA XMLRPC sync. It's possible there is one root cause, or several, but so far it's not been simple to reproduce reliably under controlled conditions.

#14 Updated by Michele Di Maria 9 months ago

Well, to me it started to happen when I readded the "Traffic Graphs" widget. It never happened before without that.

#15 Updated by Romain Cabassot 8 months ago

We upgraded 2 days ago from 2.2.x to 2.3.1p1.
Same issue and no pfB installed.
We have upgraded only one of our two pfsense (the 2.2.x is halted) so we have many sync errors.

List of installed packages:
- Cron
- freeradius2
- Lightsquid
- nmap
- nrpe
- openvpn-client-export
- snort
- squid
- squidGuard

#16 Updated by Bryan Fehl 7 months ago

Steve Beaver wrote:

Sorry to re-hash this, but since it has just been assigned to me I need an update.

Some of the above responses would indicate this issue is PfBlockerNG specific. Is that the case? Is the problem present if pfB is not installed/active?

I just ran into this myself. Strangely, this issue causes all clients who try to connect with OpenVPN to just hang indefinitely. Restarting Web Configurator & PHP fixes the issue. It seems to only happen when i leave the PFsense web gui open in my browser for an extended period of time, like if i leave the Dashboard tab open overnight. The only package i have installed is openvpn-client-export. I hope this helps.

Edit: This is on version 2.3.2-RELEASE

#17 Updated by Jim Pingle 7 months ago

Bryan Fehl wrote:

I just ran into this myself. Strangely, this issue causes all clients who try to connect with OpenVPN to just hang indefinitely.

That's normal, OpenVPN uses PHP scripts for authentication and some certificate verification. So if PHP is wedged, then OpenVPN can't authenticate.

Restarting Web Configurator & PHP fixes the issue. It seems to only happen when i leave the PFsense web gui open in my browser for an extended period of time, like if i leave the Dashboard tab open overnight. The only package i have installed is openvpn-client-export. I hope this helps.

Edit: This is on version 2.3.2-RELEASE

Which dashboard widgets do you have visible?

#18 Updated by Bryan Fehl 7 months ago

Jim Pingle wrote:

Which dashboard widgets do you have visible?

Right now I have the following widgets open:
  • System Information
  • Picture
  • Interfaces
  • S.M.A.R.T. Status
  • Gateways
  • Thermal Sensors
  • CARP Status
  • NTP Status
  • Services Status
  • Traffic Graphs
  • IPSec

I'm removing the IPsec widget based on recommendations I've seen in the forum where people had similar issues. Hopefully that prevents this from reoccurring.

#19 Updated by Alex Vergilis 7 months ago

Just Restarted PHP-FPM on a system with the following (no pfblocker installed):

  • System Information
  • Traffic Graphs
  • Interfaces
  • Gateways
  • IPsec
  • Interface Statistics

#20 Updated by John Silva 7 months ago

I've seen this symptom frequently with pfBlockerNG and large lists. I also don't run the IPsec widget.

The common thread I noticed is that there is a php process consuming 100% CPU for an extended length of time. It seems like there is some resource blocking somewhere.

What I've done that appears to help is increase the webUI process limit from 2 to 4. It's not perfect, but the instances of the webUI becoming totally unresponsive (and returning the 502 gateway error) have been fewer since making this change.

#21 Updated by IT IGP 3 months ago

we are as well getting this randomly every few days for a few months now. running always latest stable.
reproduction: leave dashboard page open. we have widgets added for GW, Interfaces, Interface Stats, Traffic Graphs, IPSec.
workaround: console/SSH option "Restart PHP-FPM".

not sure when it started this time, if there is a specific/different intial message when it starts, but the following is what you see repeating itself in the logs in the state "bad gateway":

/var/log/nginx.log

...
Jun 15 12:12:48 pfs1 pfs1.xxx nginx: 2017/06/15 12:12:48 [error] 27069#100118: *322688 connect() to unix:/var/run/php-fpm.socket failed (61: Connection refused) while connecting to upstream, client: 192.168.0.5, server: , request: "GET /getstats.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php-fpm.socket:", host: "192.168.0.19", referrer: "https://192.168.0.19/" 
Jun 15 12:12:48 pfs1 pfs1.xxx nginx: 192.168.0.5 - - [15/Jun/2017:12:12:48 +0200] "GET /getstats.php HTTP/1.1" 502 568 "https://192.168.0.19/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" 
...

/var/log/system.log

...
Jun 15 12:46:39 pfs1 check_reload_status: Could not connect to /var/run/php-fpm.socket
Jun 15 12:46:40 pfs1 check_reload_status: Could not connect to /var/run/php-fpm.socket
Jun 15 12:46:40 pfs1 kernel: sonewconn: pcb 0xfffff800d61ccc30: Listen queue overflow: 193 already in queue awaiting acceptance (480 occurrences)
Jun 15 12:46:40 pfs1 check_reload_status: Could not connect to /var/run/php-fpm.socket
...

#22 Updated by Christoffer Öhman 3 months ago

I can not even use it before it locks.

As soon as I try to change something, it loads a really long time before it locks.

#23 Updated by Bryan Fehl 3 months ago

Christoffer Öhman wrote:

I can not even use it before it locks.

As soon as I try to change something, it loads a really long time before it locks.

Do you have the IPSec widget on your dashboard? I removed that widget months ago and i haven't had this issue pop up since.

#24 Updated by Christoffer Öhman 3 months ago

Bryan Fehl wrote:

Christoffer Öhman wrote:

I can not even use it before it locks.

As soon as I try to change something, it loads a really long time before it locks.

Do you have the IPSec widget on your dashboard? I removed that widget months ago and i haven't had this issue pop up since.

I'm sure I do not have IPSec up in the dashboard.

#25 Updated by Steve Beaver about 2 months ago

  • Target version changed from 2.4.0 to 2.4.1

#26 Updated by Alex Vergilis about 2 months ago

pfsense team:

Why is this bug being pushed back to another release yet again to a date that has not been determined? This issues causes an outage everyday. Lots of people are reporting this issue here and in the forums for over a year now.

I will be more than happy to volunteer my time to assist you to get to the bottom of this.

Please let me know how I can help.

#27 Updated by Steve Beaver about 2 months ago

Thanks for your offer. I have been working on this issue all week, sadly without getting very far because each diagnostic step takes so long.

What I am doing is cranking the polling frequency up to maximum (General setup page) and adding widgets one at a time. I'm trying to learn if this issue is caused by a particular widget, a combination of widgets, the polling frequency or ? I'm also watching the memory state, cpu load etc as I make each test.

If you have the ability to do that type of testing, or have any other ideas on the subject I would love to hear.

Thanks!

#28 Updated by Alex Vergilis about 2 months ago

All I have to do to cause this is just leave the dashboard web page open. The problem happens anywhere from 1 hour to a day or so - across about 75 firewalls. I have started to close the web page to minimize the burden of the outages.

I have 3 columns with 5 second updates. The following widgets are in the dashboard for majority of the systems: System Information, Interfaces, Gateways, Interface Statistics, IPSec, NTP Status, Traffic Graph (1 sec updates)

#29 Updated by Chris Collins 26 days ago

Having a fair amount of experience myself managing php hosting systems I can offer some thoughts.

On my own pfsense unit I have seen this behaviour, when I was watching the system whilst it was happening I observed it was caused by background php scripts been busy perhaps tieing up the php-fpm server processes.

As an experiment I manually adjusted the php-fpm server configuration so there is more children running and the problem went away since.

It can be adjusted in /usr/local/etc/php-fpm.conf if anyone wants to experiment.

Given 2.4.1 wont support nano type systems anymore I expect memory usage can be loosened up a bit in terms of how restrictive things are configured to save resources.

#30 Updated by Kill Bill about 8 hours ago

Chris Collins wrote:

As an experiment I manually adjusted the php-fpm server configuration so there is more children running and the problem went away since.
It can be adjusted in /usr/local/etc/php-fpm.conf if anyone wants to experiment.
Given 2.4.1 wont support nano type systems anymore I expect memory usage can be loosened up a bit in terms of how restrictive things are configured to save resources.

The low number of processes/children apparently also is an issue with busy captive portals: https://forum.pfsense.org/index.php?topic=136847

So yeah, this should be relaxed by default since nano + i386 are gone, plus a GUI knob to have this configurable would be useful.

Also available in: Atom PDF