Regression #12961
closedCARP event storm when leaving persistent CARP maintenance mode
100%
Description
Hi,
this is a very weird issue so I will try my best to describe it. I think this is a regression that we are seeing after converting our box to pfSense-Plus but I am not to sure (could also be that it is simply a bug and not a regression).
We run a HA setup with many CARP IPs (basically one for every interface and we have 17 interfaces. Onto those CARP interfaces we do stack virtual IPs, around O(10) per CARP IP. I have verified that the CARP and VIP subnet masks do match the parent interface as per https://docs.netgate.com/pfsense/en/latest/firewall/virtual-ip-address-comparison.html .
When I tried to leave persistent CARP maintenance mode on the primary instance (which would then take over master status again) the system log started to fill with:
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 kernel: carp: 8@vtnet2.511: BACKUP -> INIT (hardware interface up) Mar 18 15:37:41 pfSense1 kernel: carp: 8@vtnet2.511: INIT -> BACKUP (initialization complete) Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event Mar 18 15:37:41 pfSense1 kernel: carp: 18@vtnet2.711: BACKUP -> INIT (hardware interface up) Mar 18 15:37:41 pfSense1 kernel: carp: 18@vtnet2.711: INIT -> BACKUP (initialization complete) ... long list of carp backup events and multiple message for the same "8@vtnet2.511" identifier (probably for the VIPs as well) Mar 18 15:37:41 pfSense1 kernel: carp: 13@vtnet2.720: BACKUP -> MASTER (preempting a slower master) Mar 18 15:37:41 pfSense1 kernel: carp: 200@vtnet2.708: BACKUP -> MASTER (preempting a slower master) ... list of all interfaces becoming master Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp master event Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp master event ... many carp master events
All in all this log shows:
➜ ~ rg 'Carp backup event' carp3.txt|wc 148 1184 10212 ➜ ~ rg 'Carp master event' carp3.txt|wc 17 136 1173
Here the 17 master events match the number of "physical" interfaces we have. I am not sure about the backup events. Taking 8@vtnet2.511 as an example which is a CARP interface with 4 IPs I get 24 entries. So it went 6 times for each VIP between BACKUP -> INIT -> BACKUP before it actually become MASTER. But the counts are not consistent…
Continuing with the logs:
... Interleaved with the carp master events it seems that carp handling kicks in: Mar 18 15:37:43 pfSense1 php-fpm[31225]: /rc.carpbackup: HA cluster member "(10.7.11.1@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18 Mar 18 15:37:43 pfSense1 php-fpm[87138]: /rc.carpbackup: HA cluster member "(10.7.11.1@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18 Mar 18 15:37:43 pfSense1 php-fpm[87138]: /rc.carpbackup: HA cluster member "(10.7.11.129@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18 Mar 18 15:37:43 pfSense1 php-fpm[31225]: /rc.carpbackup: HA cluster member "(10.7.11.129@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18 ... we see pages of those (275 entries), the last one roughly 2 minutes after the first one!
At this point we also see entries like this:
Mar 18 15:38:08 pfSense1 php-fpm[87138]: /rc.carpbackup: Stopping OpenVPN server instance on DMZ because of transition to CARP backup. Mar 18 15:38:08 pfSense1 php-fpm[31225]: /rc.carpbackup: Stopping OpenVPN server instance on DMZ because of transition to CARP backup. Mar 18 15:38:08 pfSense1 check_reload_status[408]: Reloading filter Mar 18 15:38:08 pfSense1 kernel: ovpns1: link state changed to UP Mar 18 15:38:08 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns1 Mar 18 15:38:08 pfSense1 check_reload_status[408]: Reloading filter Mar 18 15:38:08 pfSense1 kernel: ovpns3: link state changed to UP Mar 18 15:38:08 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns3 Mar 18 15:38:10 pfSense1 kernel: ovpns1: link state changed to DOWN Mar 18 15:38:10 pfSense1 check_reload_status[408]: Reloading filter Mar 18 15:38:10 pfSense1 kernel: ovpns1: link state changed to UP Mar 18 15:38:10 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns1 Mar 18 15:38:13 pfSense1 kernel: ovpns3: link state changed to DOWN Mar 18 15:38:13 pfSense1 kernel: ovpns1: link state changed to DOWN Mar 18 15:38:13 pfSense1 kernel: ovpns3: link state changed to UP Mar 18 15:38:13 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns3
At roughly 4 minutes after leaving persistent carp maintenance mode the system log becomes quiet again and the web interface is reachable again. According to the hypervisor and pfsense itself the CPU usage was relatively high due to all the restarts of services (?). While CARP had transitioned successfully to MASTER (as confirmed by the second firewall) pfsense did require minutes to handle the events. Routing etc during that time worked without any issues but the webiface was down etc…
I'll happily provide more information; just let me know what you need.
Related issues