Regression #12961: CARP event storm when leaving persistent CARP maintenance mode - pfSense - pfSense bugtracker

Actions

Copy link

Regression #12961

closed

CARP event storm when leaving persistent CARP maintenance mode

Added by Florian Apolloner over 3 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Viktor Gurov

Category:

CARP

Target version:

2.7.0

Start date:

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

22.05

Release Notes:

Default

Affected Version:

2.6.0

Affected Architecture:

Description

Hi,

this is a very weird issue so I will try my best to describe it. I think this is a regression that we are seeing after converting our box to pfSense-Plus but I am not to sure (could also be that it is simply a bug and not a regression).

We run a HA setup with many CARP IPs (basically one for every interface and we have 17 interfaces. Onto those CARP interfaces we do stack virtual IPs, around O(10) per CARP IP. I have verified that the CARP and VIP subnet masks do match the parent interface as per https://docs.netgate.com/pfsense/en/latest/firewall/virtual-ip-address-comparison.html .

When I tried to leave persistent CARP maintenance mode on the primary instance (which would then take over master status again) the system log started to fill with:

Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 kernel: carp: 8@vtnet2.511: BACKUP -> INIT (hardware interface up)
Mar 18 15:37:41 pfSense1 kernel: carp: 8@vtnet2.511: INIT -> BACKUP (initialization complete)
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp backup event
Mar 18 15:37:41 pfSense1 kernel: carp: 18@vtnet2.711: BACKUP -> INIT (hardware interface up)
Mar 18 15:37:41 pfSense1 kernel: carp: 18@vtnet2.711: INIT -> BACKUP (initialization complete)
... long list of carp backup events and multiple message for the same "8@vtnet2.511" identifier (probably for the VIPs as well)
Mar 18 15:37:41 pfSense1 kernel: carp: 13@vtnet2.720: BACKUP -> MASTER (preempting a slower master)
Mar 18 15:37:41 pfSense1 kernel: carp: 200@vtnet2.708: BACKUP -> MASTER (preempting a slower master)
... list of all interfaces becoming master
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp master event
Mar 18 15:37:41 pfSense1 check_reload_status[408]: Carp master event
... many carp master events

All in all this log shows:

➜  ~ rg 'Carp backup event' carp3.txt|wc
    148    1184   10212
➜  ~ rg 'Carp master event' carp3.txt|wc
     17     136    1173

Here the 17 master events match the number of "physical" interfaces we have. I am not sure about the backup events. Taking 8@vtnet2.511 as an example which is a CARP interface with 4 IPs I get 24 entries. So it went 6 times for each VIP between BACKUP -> INIT -> BACKUP before it actually become MASTER. But the counts are not consistent…

Continuing with the logs:

... Interleaved with the carp master events it seems that carp handling kicks in:
Mar 18 15:37:43 pfSense1 php-fpm[31225]: /rc.carpbackup: HA cluster member "(10.7.11.1@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
Mar 18 15:37:43 pfSense1 php-fpm[87138]: /rc.carpbackup: HA cluster member "(10.7.11.1@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
Mar 18 15:37:43 pfSense1 php-fpm[87138]: /rc.carpbackup: HA cluster member "(10.7.11.129@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
Mar 18 15:37:43 pfSense1 php-fpm[31225]: /rc.carpbackup: HA cluster member "(10.7.11.129@vtnet2.711): (INTERN)" has resumed CARP state "BACKUP" for vhid 18
... we see pages of those (275 entries), the last one roughly 2 minutes after the first one!

At this point we also see entries like this:

Mar 18 15:38:08 pfSense1 php-fpm[87138]: /rc.carpbackup: Stopping OpenVPN server instance on DMZ because of transition to CARP backup.
Mar 18 15:38:08 pfSense1 php-fpm[31225]: /rc.carpbackup: Stopping OpenVPN server instance on DMZ because of transition to CARP backup.
Mar 18 15:38:08 pfSense1 check_reload_status[408]: Reloading filter
Mar 18 15:38:08 pfSense1 kernel: ovpns1: link state changed to UP
Mar 18 15:38:08 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns1
Mar 18 15:38:08 pfSense1 check_reload_status[408]: Reloading filter
Mar 18 15:38:08 pfSense1 kernel: ovpns3: link state changed to UP
Mar 18 15:38:08 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns3
Mar 18 15:38:10 pfSense1 kernel: ovpns1: link state changed to DOWN
Mar 18 15:38:10 pfSense1 check_reload_status[408]: Reloading filter
Mar 18 15:38:10 pfSense1 kernel: ovpns1: link state changed to UP
Mar 18 15:38:10 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns1
Mar 18 15:38:13 pfSense1 kernel: ovpns3: link state changed to DOWN
Mar 18 15:38:13 pfSense1 kernel: ovpns1: link state changed to DOWN
Mar 18 15:38:13 pfSense1 kernel: ovpns3: link state changed to UP
Mar 18 15:38:13 pfSense1 check_reload_status[408]: rc.newwanip starting ovpns3

At roughly 4 minutes after leaving persistent carp maintenance mode the system log becomes quiet again and the web interface is reachable again. According to the hypervisor and pfsense itself the CPU usage was relatively high due to all the restarts of services (?). While CARP had transitioned successfully to MASTER (as confirmed by the second firewall) pfsense did require minutes to handle the events. Routing etc during that time worked without any issues but the webiface was down etc…

I'll happily provide more information; just let me know what you need.

Related issues

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

pfSense

Custom queries

Regression #12961

CARP event storm when leaving persistent CARP maintenance mode

Updated by Steffen Wagner over 3 years ago

Updated by Florian Apolloner about 3 years ago

Updated by Florian Apolloner about 3 years ago

Updated by Florian Apolloner about 3 years ago

Updated by Viktor Gurov about 3 years ago

Updated by Viktor Gurov about 3 years ago

Updated by Viktor Gurov about 3 years ago

Updated by Jim Pingle about 3 years ago

Updated by Florian Apolloner about 3 years ago

Updated by Viktor Gurov about 3 years ago

Updated by Florian Apolloner about 3 years ago

Updated by Viktor Gurov about 3 years ago

Updated by Jim Pingle about 3 years ago

Updated by Viktor Gurov about 3 years ago

Updated by Florian Apolloner about 3 years ago

Updated by Bill Hughes about 3 years ago

Updated by Jim Pingle about 3 years ago

Updated by Jim Pingle about 2 years ago