Project

General

Profile

Bug #1534

rc.newwanip issues (CARP slave problems, package issues)

Added by Jim Pingle over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Interfaces
Target version:
Start date:
05/17/2011
Due date:
% Done:

0%

Estimated time:
Affected Version:
2.0
Affected Architecture:
All

Description

Quoted from here http://forum.pfsense.org/index.php/topic,36802.0.html

I've mentioned some rc.d problems in other thread but decided to create new one as I've debugged problem a little more. Feel free to merge threads if you think they should be merged :-)

The bugs are...

1. Function restart_packages() from rc.newwanip doesn't do restart properly
2. Packages aren't stopped on system reboot/halt
3. Backup router forks lots of php causing OOM error and system failure

Now... rc.newwanip has function restart_packages which calls rc.start_packages (in background). Restarting packages should first stop them waiting for processes to finish and then start them again. Unfortunately calling "stop" was removed from rc.start_packages long long time ago and placed in rc.stop_packages. Its fine, but rc.stop_packages is never called or grep is cheating me :-) Because of that few thing doesn't work as they should.

1. User scripts may missbehave, ie. I've placed in /usr/localt/etc/rc.d simple script that launches tcpdump into background to log some traffic. Script has proper handling of start, stop and restart commands. After reboot I have lots and lots instances of tcpdump running because stop wasn't called before start when exectued from restart_packages().
2. System services aren't restarted. Calling just "start" to restart service won't work if it checks if its already running. Thus no restart ever occurs for such services.
3. Services aren't properly stopped on system reboot/halt. That means processes are simply killed at some point. This may lead to data corruption depending on services used and how they handle signals.
4. Any service related operation (start/stop/restart) shoudn't be lanuched into background. If there are many interfaces used in system rc.start_packages is called in background many times. Even with proper implementation of start/stop it won't work because script processes will interfere with each other (one stopping, second starting, third stopping and so on).

Here is little patch that fixed bugs #1 and #2: http://pastebin.com/ARVfDvLs

Finally bug #3 about which I wrote in other thread, but back then I didn't knew reason of such behavior. Now after little research I know exactly what happens. My setup has total of 51 network interfaces reported by ifconfig. After each click on "Apply changes" in any part of pfSense web configurator on my master router following things happen on backup router:

1. /usr/local/sbin/check_reload_status is run
2. for each network interface /etc/rc.newwanip is called, by now we have 51 php processes spawned
3. each copy of /etc/rc.newwanip calls /etc/rc.start_packages shell script, we have 51 of those too but they quickly disappear because...
4. each copy of /etc/rc.start_packages executes each *.sh file from /usr/local/etc/rc.d in background so each *.sh script is spawned 51 times almost simultaneously

Lets say I'm working on adding aliases, rules, configuring VPNs etc. and lets say I'll click "Apply changes" 5 times in 5 minutes. I'll have 255 php processes which will launch each script from /usr/local/etc/rc.d 255 times. Usually thats enough to hog C2D CPU, eat 4 gigs of ram, 8 gigs of swap effectively killing machine.

I'm wondering if sources of check_reload_status are available somewhere? In worst case (no fix in mainstream pfSense) I'll be able to fix this problem on my own.

pf.diff (1.26 KB) pf.diff patch fixing start/stop issues with rc scripts Marcin Krol, 05/27/2011 02:43 AM

Associated revisions

Revision 9fce470b (diff)
Added by Chris Buechler over 11 years ago

change default kernel on upgrade to SMP. Virtually all installs are running the SMP kernel, defaulting to uniprocessor broke several systems.

related to Ticket #1534

Revision 67d78c87 (diff)
Added by Ermal Luçi over 8 years ago

Ticket #1534. Serialize all the xmlrpc requests coming to the firewall. Seems such request can stomp into each other and create either corruption of xmlrpc request or other issues.

Revision 098820e2 (diff)
Added by Ermal Luçi over 8 years ago

Ticket #1534. Check if a rc file exists before trying to run it. Also return if we execute a stop command through rc file to be consistent with the start_service function.

Revision aed6fc72 (diff)
Added by Ermal Luçi over 8 years ago

Ticket #1534. Change rc.start_packages and rc.stop_packages to php scripts so they do a proper job at start/stop packages, rather than assume every package has a .sh script which is not true. It mostly reuses code from rc.packages which is not used anywhere as of now!

Revision 51611440 (diff)
Added by Ermal Luçi about 8 years ago

Ticket #1534, #1433. Properly merge carp interfaces and do not reload carp interfaces that have not change any configuration parameter. Also make merge_config_section_xmlrpc() an alias for restore_config_section_xmlrpc() since that what it is.

Revision f51d4f98 (diff)
Added by Ermal Luçi about 8 years ago

Ticket #1534, #1433. Remove custom sync code for vip, since it array_merge() replaces same keys data when merging. But make the code for reloading only changed vips after merge better and some more checks.

Revision dfb30a89 (diff)
Added by Ermal Luçi about 8 years ago

Trigger reloading of packages through check_reload_status so it can serialize the calls to not DoS the OS with processes triggered from this. Ticket #1534

Revision a1b86994 (diff)
Added by Ermal Luçi about 8 years ago

Ticket #1534. Try to stop packages during reboot of system.

History

#1 Updated by Ermal Luçi over 8 years ago

Can you please test with latest snapshot from tomorrow and let me know.
I have done some fixes that should prevent this.

#2 Updated by Ermal Luçi over 8 years ago

  • Status changed from New to Feedback

#3 Updated by Marcin Krol about 8 years ago

Can you please test with latest snapshot from tomorrow and let me know.
I have done some fixes that should prevent this.

Sure. I did update today but changes from diffs were not applied on my install. Probably because in meantime I switched to IPv6 build. I'm currently running:

2.0-RC1-IPv6 (amd64)
built on Thu May 19 22:21:03 EDT 2011

Will these changes be available for IPv6 builds too or do I need to add them by hand?

#4 Updated by Marcin Krol about 8 years ago

Sure. I did update today

Err... before weekend :-)

#5 Updated by Jim Pingle about 8 years ago

Did you adjust your gitsync URL to point at the github location? The IPv6 tree is up-to-date with 2.0 mainline right now, if you are behind, your gitsync URL is probably still pointing at the old gitweb.pfsense.org/rcs.pfsense.org location.

#6 Updated by Marcin Krol about 8 years ago

Indeed I was using old gitsync URL. I've performed few tests yesterday on updated systems. Unfortunately applied changes have not fixed issues, but its a little bit better. I'm now running 2.0-RC2-IPv6 (amd64) built on Tue May 24 04:45:10 EDT 2011

First thing - rc scripts:

1. Still there is no "stop" issued when system is going down.
2. There is also no "stop" issued before "start" in rc.start_packages.
2. rc.newwanip still launches "start" into background.

As previously I had to create little patch to get things starting/stopping correctly (attached).

Also problem with backup router still exists. I've clicked "edit" -> "save" -> "apply changes" on 10 firewall rules (without modifying them). It took me less than 1 minute. That caused over 200 php processes on backup router. Whole ram was eaten and about half of the swap. Router recovered to normal cpu/memory/swap usage in about 20 minutes, but during that time it was hardly usable (web configurator timing out, lagged shell).

#7 Updated by Marcin Krol about 8 years ago

Disabling and re-enabling HAVP on master router causes even more severe load on backup router. It also causes other isssue, but I'll report it in separate ticket.

#8 Updated by Ermal Luçi about 8 years ago

Can you try with latest snapshots?

#9 Updated by Marcin Krol about 8 years ago

I just updated my backup router, but commits from 5 days ago are not included yet. I'm now running:

2.0-RC2-IPv6 (amd64)
built on Tue May 31 12:13:03 EDT 2011

and my gitsync URL points to github.

#10 Updated by Jim Pingle about 8 years ago

That is because you are on the IPv6 branch which hadn't been merged in a while. I just synced it back up with mainline so it should be up to date now, either by a gitsync or your next firmware update it should be ok.

#11 Updated by Marcin Krol about 8 years ago

I've updated yesterday and now I'm running 2.0-RC2-IPv6 (amd64) built on Wed Jun 1 18:03:37 EDT 2011 and last commits are included in this version. It seems that packages start/stop related problems are all gone, however slave router is still being killed by forking dozens of "/usr/local/bin/php /etc/rc.filter_configure_sync" processes on every "Apply changes" I do in firewall aliases or rules on master router.

#12 Updated by Jim Pingle about 8 years ago

How many interfaces / VIPs do you have on that box? When I apply changes on my VM, I only see 1-2 of those and they immediately exit. Something about your config must be triggering more syncs, but without more information it's hard to say what.

#13 Updated by Marcin Krol about 8 years ago

19 VIPs, 20 permanent OpenVPN tunnels, 1 OpenVPN for users, 2 gigabit NICs aggregated into 1 lagg0 interface, 14 VLANs defined on lagg0 for 2 WAN links and 12 LAN segments, 1 dedicated 100Mbit NIC for CARP link between master and slave router. Grand total is 52 network interfaces reported with ifconfig -a.

#14 Updated by Ermal Luçi about 8 years ago

Latest snapshot has more fixes in this regard.

#15 Updated by Marcin Krol about 8 years ago

After upgrade I'm now on 2.0-RC2-IPv6 (amd64) built on Fri Jun 10 01:43:14 EDT 2011 and its even more broken. Whatever changes I'll make on master router they are not synchronized to backup router. I tried forced config sync on master and it finished successully, but changes are still not visible on backup router. Even after reboot backup router still doesn't see changes.

#16 Updated by Jim Pingle about 8 years ago

That is a known issue. Fixed in the next snapshot (building now) try again with that.

#17 Updated by Marcin Krol about 8 years ago

I've upgraded to 2.0-RC2-IPv6 (amd64) built on Sat Jun 11 23:14:29 EDT 2011. Had severe problems with this update. After reboot my slave router claimed to be master and I ended up with two masters fighting for VIPs. I finally managed to get it working but some other issues have emerged in the meantime. For example after adding host alias with single IP I'm getting error on both routers saying that firewall was unable to load filter rules due to "invalid port 192.168.1.15". HAVP stopped working on backup router and doesn't start at all with no errors in logs, dhcpd started logging following error many many times on both routers "dhcpd: failover: listener: no matching state"...

So far network seems to be working and when I'm changing something on master router backup one is no longer flooded with php process, but I'm not sure if bug was fixed or simply sync doesn't occur correctly due to errors mentioned above. I also hope that next upgrade won't totally break my installation :)

#18 Updated by Ermal Luçi about 8 years ago

I do not know the state of the IPv6 code but can you try with latest snapshot of pfSense since there were some binary updates in base for which i am unsure the status in the IPv6 snapshots.

#19 Updated by Marcin Krol about 8 years ago

Unfortuntely I can't drop IPv6 support on my machines. I'm currently running 2.0-RC3-IPv6 (amd64) built on Mon Jun 27 06:02:37 EDT 2011 and it seems that php forking problem is gone. Only few php processes appear on backup router and they quickly disappear.

#20 Updated by Chris Buechler about 8 years ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF