Project

General

Profile

Actions

Regression #12961

closed

CARP event storm when leaving persistent CARP maintenance mode.

Added by Florian Apolloner 5 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Viktor Gurov
Category:
CARP
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
22.05
Release Notes:
Default
Affected Version:
2.6.0
Affected Architecture:

Description

Hi,

this is a very weird issue so I will try my best to describe it. I think this is a regression that we are seeing after converting our box to pfSense-Plus but I am not to sure (could also be that it is simply a bug and not a regression).

We run a HA setup with many CARP IPs (basically one for every interface and we have 17 interfaces. Onto those CARP interfaces we do stack virtual IPs, around O(10) per CARP IP. I have verified that the CARP and VIP subnet masks do match the parent interface as per https://docs.netgate.com/pfsense/en/latest/firewall/virtual-ip-address-comparison.html .

When I tried to leave persistent CARP maintenance mode on the primary instance (which would then take over master status again) the system log started to fill with:

Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 kernel: carp: 8@vtnet2.511: BACKUP -> INIT (hardware interface up)
Mar 18 15:37:41 pfSense1 kernel: carp: 8@vtnet2.511: INIT -> BACKUP (initialization complete)
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 kernel: carp: 18@vtnet2.711: BACKUP -> INIT (hardware interface up)
Mar 18 15:37:41 pfSense1 kernel: carp: 18@vtnet2.711: INIT -> BACKUP (initialization complete)
... long list of carp backup events and multiple message for the same "8@vtnet2.511" identifier (probably for the VIPs as well)
Mar 18 15:37:41 pfSense1 kernel: carp: 13@vtnet2.720: BACKUP -> MASTER (preempting a slower master)
Mar 18 15:37:41 pfSense1 kernel: carp: 200@vtnet2.708: BACKUP -> MASTER (preempting a slower master)
... list of all interfaces becoming master
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp master event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp master event
... many carp master events

All in all this log shows:

➜  ~ rg 'Carp backup event' carp3.txt|wc
    148    1184   10212
➜  ~ rg 'Carp master event' carp3.txt|wc
     17     136    1173

Here the 17 master events match the number of "physical" interfaces we have. I am not sure about the backup events. Taking as an example which is a CARP interface with 4 IPs I get 24 entries. So it went 6 times for each VIP between BACKUP -> INIT -> BACKUP before it actually become MASTER. But the counts are not consistent…

Continuing with the logs:

... Interleaved with the carp master events it seems that carp handling kicks in:
Mar 18 15:37:43 pfSense1 php-fpm[31225]: /rc.carpbackup: HA cluster member "(10.7.11.1@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
Mar 18 15:37:43 pfSense1 php-fpm[87138]: /rc.carpbackup: HA cluster member "(10.7.11.1@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
Mar 18 15:37:43 pfSense1 php-fpm[87138]: /rc.carpbackup: HA cluster member "(10.7.11.129@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
Mar 18 15:37:43 pfSense1 php-fpm[31225]: /rc.carpbackup: HA cluster member "(10.7.11.129@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
... we see pages of those (275 entries), the last one roughly 2 minutes after the first one!

At this point we also see entries like this:
Mar 18 15:38:08 pfSense1 php-fpm[87138]: /rc.carpbackup: Stopping OpenVPN server instance on DMZ because of transition to CARP backup.
Mar 18 15:38:08 pfSense1 php-fpm[31225]: /rc.carpbackup: Stopping OpenVPN server instance on DMZ because of transition to CARP backup.
Mar 18 15:38:08 pfSense1 check_reload_status[408]: Reloading filter
Mar 18 15:38:08 pfSense1 kernel: ovpns1: link state changed to UP
Mar 18 15:38:08 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns1
Mar 18 15:38:08 pfSense1 check_reload_status[408]: Reloading filter
Mar 18 15:38:08 pfSense1 kernel: ovpns3: link state changed to UP
Mar 18 15:38:08 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns3
Mar 18 15:38:10 pfSense1 kernel: ovpns1: link state changed to DOWN
Mar 18 15:38:10 pfSense1 check_reload_status[408]: Reloading filter
Mar 18 15:38:10 pfSense1 kernel: ovpns1: link state changed to UP
Mar 18 15:38:10 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns1
Mar 18 15:38:13 pfSense1 kernel: ovpns3: link state changed to DOWN
Mar 18 15:38:13 pfSense1 kernel: ovpns1: link state changed to DOWN
Mar 18 15:38:13 pfSense1 kernel: ovpns3: link state changed to UP
Mar 18 15:38:13 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns3

At roughly 4 minutes after leaving persistent carp maintenance mode the system log becomes quiet again and the web interface is reachable again. According to the hypervisor and pfsense itself the CPU usage was relatively high due to all the restarts of services (?). While CARP had transitioned successfully to MASTER (as confirmed by the second firewall) pfsense did require minutes to handle the events. Routing etc during that time worked without any issues but the webiface was down etc…

I'll happily provide more information; just let me know what you need.


Related issues

Related to Bug #12227: Changing VHID on CARP VIP does not update VHID of related IP Alias VIPsResolvedViktor Gurov

Actions
Actions

Also available in: Atom PDF