rc.newwanip issues (CARP slave problems, package issues)
Quoted from here http://forum.pfsense.org/index.php/topic,36802.0.html
I've mentioned some rc.d problems in other thread but decided to create new one as I've debugged problem a little more. Feel free to merge threads if you think they should be merged :-)
The bugs are...
1. Function restart_packages() from rc.newwanip doesn't do restart properly
2. Packages aren't stopped on system reboot/halt
3. Backup router forks lots of php causing OOM error and system failure
Now... rc.newwanip has function restart_packages which calls rc.start_packages (in background). Restarting packages should first stop them waiting for processes to finish and then start them again. Unfortunately calling "stop" was removed from rc.start_packages long long time ago and placed in rc.stop_packages. Its fine, but rc.stop_packages is never called or grep is cheating me :-) Because of that few thing doesn't work as they should.
1. User scripts may missbehave, ie. I've placed in /usr/localt/etc/rc.d simple script that launches tcpdump into background to log some traffic. Script has proper handling of start, stop and restart commands. After reboot I have lots and lots instances of tcpdump running because stop wasn't called before start when exectued from restart_packages().
2. System services aren't restarted. Calling just "start" to restart service won't work if it checks if its already running. Thus no restart ever occurs for such services.
3. Services aren't properly stopped on system reboot/halt. That means processes are simply killed at some point. This may lead to data corruption depending on services used and how they handle signals.
4. Any service related operation (start/stop/restart) shoudn't be lanuched into background. If there are many interfaces used in system rc.start_packages is called in background many times. Even with proper implementation of start/stop it won't work because script processes will interfere with each other (one stopping, second starting, third stopping and so on).
Finally bug #3 about which I wrote in other thread, but back then I didn't knew reason of such behavior. Now after little research I know exactly what happens. My setup has total of 51 network interfaces reported by ifconfig. After each click on "Apply changes" in any part of pfSense web configurator on my master router following things happen on backup router:
1. /usr/local/sbin/check_reload_status is run
2. for each network interface /etc/rc.newwanip is called, by now we have 51 php processes spawned
3. each copy of /etc/rc.newwanip calls /etc/rc.start_packages shell script, we have 51 of those too but they quickly disappear because...
4. each copy of /etc/rc.start_packages executes each *.sh file from /usr/local/etc/rc.d in background so each *.sh script is spawned 51 times almost simultaneously
Lets say I'm working on adding aliases, rules, configuring VPNs etc. and lets say I'll click "Apply changes" 5 times in 5 minutes. I'll have 255 php processes which will launch each script from /usr/local/etc/rc.d 255 times. Usually thats enough to hog C2D CPU, eat 4 gigs of ram, 8 gigs of swap effectively killing machine.
I'm wondering if sources of check_reload_status are available somewhere? In worst case (no fix in mainstream pfSense) I'll be able to fix this problem on my own.
change default kernel on upgrade to SMP. Virtually all installs are running the SMP kernel, defaulting to uniprocessor broke several systems.
related to Ticket #1534
Ticket #1534. Serialize all the xmlrpc requests coming to the firewall. Seems such request can stomp into each other and create either corruption of xmlrpc request or other issues.
Ticket #1534. Check if a rc file exists before trying to run it. Also return if we execute a stop command through rc file to be consistent with the start_service function.
Ticket #1534. Change rc.start_packages and rc.stop_packages to php scripts so they do a proper job at start/stop packages, rather than assume every package has a .sh script which is not true. It mostly reuses code from rc.packages which is not used anywhere as of now!
Trigger reloading of packages through check_reload_status so it can serialize the calls to not DoS the OS with processes triggered from this. Ticket #1534
#3 Updated by Marcin Krol about 8 years ago
Can you please test with latest snapshot from tomorrow and let me know.
I have done some fixes that should prevent this.
Sure. I did update today but changes from diffs were not applied on my install. Probably because in meantime I switched to IPv6 build. I'm currently running:
built on Thu May 19 22:21:03 EDT 2011
Will these changes be available for IPv6 builds too or do I need to add them by hand?
#6 Updated by Marcin Krol almost 8 years ago
Indeed I was using old gitsync URL. I've performed few tests yesterday on updated systems. Unfortunately applied changes have not fixed issues, but its a little bit better. I'm now running 2.0-RC2-IPv6 (amd64) built on Tue May 24 04:45:10 EDT 2011
First thing - rc scripts:
1. Still there is no "stop" issued when system is going down.
2. There is also no "stop" issued before "start" in rc.start_packages.
2. rc.newwanip still launches "start" into background.
As previously I had to create little patch to get things starting/stopping correctly (attached).
Also problem with backup router still exists. I've clicked "edit" -> "save" -> "apply changes" on 10 firewall rules (without modifying them). It took me less than 1 minute. That caused over 200 php processes on backup router. Whole ram was eaten and about half of the swap. Router recovered to normal cpu/memory/swap usage in about 20 minutes, but during that time it was hardly usable (web configurator timing out, lagged shell).
#11 Updated by Marcin Krol almost 8 years ago
I've updated yesterday and now I'm running 2.0-RC2-IPv6 (amd64) built on Wed Jun 1 18:03:37 EDT 2011 and last commits are included in this version. It seems that packages start/stop related problems are all gone, however slave router is still being killed by forking dozens of "/usr/local/bin/php /etc/rc.filter_configure_sync" processes on every "Apply changes" I do in firewall aliases or rules on master router.
#13 Updated by Marcin Krol almost 8 years ago
19 VIPs, 20 permanent OpenVPN tunnels, 1 OpenVPN for users, 2 gigabit NICs aggregated into 1 lagg0 interface, 14 VLANs defined on lagg0 for 2 WAN links and 12 LAN segments, 1 dedicated 100Mbit NIC for CARP link between master and slave router. Grand total is 52 network interfaces reported with ifconfig -a.
#15 Updated by Marcin Krol almost 8 years ago
After upgrade I'm now on 2.0-RC2-IPv6 (amd64) built on Fri Jun 10 01:43:14 EDT 2011 and its even more broken. Whatever changes I'll make on master router they are not synchronized to backup router. I tried forced config sync on master and it finished successully, but changes are still not visible on backup router. Even after reboot backup router still doesn't see changes.
#17 Updated by Marcin Krol almost 8 years ago
I've upgraded to 2.0-RC2-IPv6 (amd64) built on Sat Jun 11 23:14:29 EDT 2011. Had severe problems with this update. After reboot my slave router claimed to be master and I ended up with two masters fighting for VIPs. I finally managed to get it working but some other issues have emerged in the meantime. For example after adding host alias with single IP I'm getting error on both routers saying that firewall was unable to load filter rules due to "invalid port 192.168.1.15". HAVP stopped working on backup router and doesn't start at all with no errors in logs, dhcpd started logging following error many many times on both routers "dhcpd: failover: listener: no matching state"...
So far network seems to be working and when I'm changing something on master router backup one is no longer flooded with php process, but I'm not sure if bug was fixed or simply sync doesn't occur correctly due to errors mentioned above. I also hope that next upgrade won't totally break my installation :)
#19 Updated by Marcin Krol almost 8 years ago
Unfortuntely I can't drop IPv6 support on my machines. I'm currently running 2.0-RC3-IPv6 (amd64) built on Mon Jun 27 06:02:37 EDT 2011 and it seems that php forking problem is gone. Only few php processes appear on backup router and they quickly disappear.