Bug #12926: Changing LAGG type on CARP interfaces makes VIPs go to an "init" State - pfSense - pfSense bugtracker

Actions

Copy link

Bug #12926

closed

Changing LAGG type on CARP interfaces makes VIPs go to an "init" State

Added by Kris Phillips about 3 years ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Interfaces

Target version:

Start date:

Due date:

% Done:

Estimated time:

Plus Target Version:

Release Notes:

Default

Affected Version:

2.6.0

Affected Architecture:

All

Description

When changing a LAGG from any mode to another mode while it has child interfaces that are something like VLANs and CARP VIPs on them will cause CARP to go to an init status and never recover.

If you re-save the interface or reboot the firewall, they will go back to MASTER or BACKUP (depending on if it's primary or secondary). This can be worked around by failing over manually before making these changes and rebooting the firewall you're working on to avoid disruption, but this seems like a bug that should not present itself.

Bug was present in 2.6 of CE and 22.01 of pfSense Plus.

Files

Download all files

first_error_after_changing_LAGG_type.png (214 KB) first_error_after_changing_LAGG_type.png		Azamat Khakimyanov, 12/15/2022 02:16 AM
after_pressing_Reset_CARP_Demotion_Status.png (268 KB) after_pressing_Reset_CARP_Demotion_Status.png		Azamat Khakimyanov, 12/15/2022 02:20 AM
after_about_10_minutes.png (240 KB) after_about_10_minutes.png		Azamat Khakimyanov, 12/15/2022 02:23 AM
gateway_timeout.png (30.4 KB) gateway_timeout.png		Azamat Khakimyanov, 12/15/2022 02:33 AM

Related issues

Actions

Copy link

Updated by Viktor Gurov about 3 years ago

Status changed from New to Feedback
Affected Version set to 2.6.0

Unable to reproduce:

lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: OPT2
    options=6900bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    ether de:e1:87:63:8d:8a
    inet6 fe80::dce1:87ff:fe63:8d8a%lagg0 prefixlen 64 scopeid 0xc
    laggproto lacp lagghash l2,l3,l4
    laggport: vtnet3 flags=0<>
    groups: lagg
    media: Ethernet autoselect
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0.11: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: VLAN11
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    ether de:e1:87:63:8d:8a
    inet6 fe80::dce1:87ff:fe63:8d8a%lagg0.11 prefixlen 64 scopeid 0xd
    inet 10.22.22.100 netmask 0xffffff00 broadcast 10.22.22.255
    inet 10.22.22.22 netmask 0xffffff00 broadcast 10.22.22.255 vhid 1
    groups: vlan
    carp: MASTER vhid 1 advbase 1 advskew 0
    vlan: 11 vlanpcp: 0 parent interface: lagg0
    media: Ethernet autoselect
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

after changing the LAGG mode from LACP to ROUNDROBIN:

lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: OPT2
    options=6900bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    ether de:e1:87:63:8d:8a
    inet6 fe80::dce1:87ff:fe63:8d8a%lagg0 prefixlen 64 scopeid 0xc
    laggproto roundrobin lagghash l2,l3,l4
    laggport: vtnet3 flags=4<ACTIVE>
    groups: lagg
    media: Ethernet autoselect
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0.11: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
    description: VLAN11
    options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
    ether de:e1:87:63:8d:8a
    inet6 fe80::dce1:87ff:fe63:8d8a%lagg0.11 prefixlen 64 scopeid 0xd
    inet 10.22.22.100 netmask 0xffffff00 broadcast 10.22.22.255
    inet 10.22.22.22 netmask 0xffffff00 broadcast 10.22.22.255 vhid 1
    groups: vlan
    carp: MASTER vhid 1 advbase 1 advskew 0
    vlan: 11 vlanpcp: 0 parent interface: lagg0
    media: Ethernet autoselect
    status: active
    nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

Please provide more details about this issue.

Actions

Copy link

Updated by Kris Phillips about 3 years ago

Viktor Gurov wrote in #note-1:

Unable to reproduce:
[...]

after changing the LAGG mode from LACP to ROUNDROBIN:
[...]

Please provide more details about this issue.

Viktor,

When testing with a customer, they had several dozen VLAN interfaces (around 50). Likely this is needed to trigger this.

Actions

Copy link

Updated by Azamat Khakimyanov over 2 years ago

File first_error_after_changing_LAGG_type.png first_error_after_changing_LAGG_type.png added
File after_pressing_Reset_CARP_Demotion_Status.png after_pressing_Reset_CARP_Demotion_Status.png added
File after_about_10_minutes.png after_about_10_minutes.png added
File gateway_timeout.png gateway_timeout.png added
Status changed from Feedback to Confirmed

Tested on 22.01

I was able to reproduce this bug.
I've created HA cluster with LAGG interface on each node and 30 VLANs on these LAGG ports.
Changing LAGG type made HA cluster very unstable:
- after changing LAGG type 'Interface LAGG edit' page started to reload and finally I got '504 Gateway timeout'('gateway_timeout.png') but in /Status/Interfaces I saw that LAGG type were changed correctly and were changed immediately.

- I got 'CARP has detected a problem and this unit has a non-zero demotion status...' message and most of VLAN CARP VIPs were in INIT state (see attached 'first_error_after_changing_LAGG_type.png')

- after pressing 'Reset CARP Demotion Status' most of VLAN CARPs were in INIT state ('after_pressing_Reset_CARP_Demotion_Status.png')

- HA cluster were unstable - some ports of Primary were Master, then they became Backup and then Master again. I always got 'CARP has detected a problem and this unit has a non-zero demotion status...' messages. Step by step, after each pressing 'Reset CARP Demotion Status' the amount of VLAN CARP VIPs in INIT state became less and less ('after_about_10_minutes.png')

Primary node continued to switch to Backup mode and back again and again.

- and after about 20 minutes all VLAN CARP VIPs on Primary became Master and HA cluster became stable.

- pressing Save (without changing anything) for LAGG port, while HA cluster were unstable, change nothing (it didn't make HA cluster stable).

- after rebooting Primary node just after changing LAGG type on it (and getting unstable HA cluster), all VLAN CARP VIPs became Master and HA cluster became stable.

Actions

Copy link

Updated by Azamat Khakimyanov over 2 years ago

Tested on 22.05.

I restored the same HA cluster on current 22.05 and got the same result - after changing LAGG type HA cluster became unstable with lots of VLAN CARP VIPs in INIT state and then after 10-15 minutes HA cluster returned to normal.

These tests were made on Linux KVM (Intel i5-11400 with 24Gb of RAM and 512 Gb SSD, Ubuntu 22.10).
VMs had 4CPUs and 4Gb of RAM.
KVM doesn't support LACP, so I used switching between Failover and LoadBalance LAGG modes.

Actions

Copy link