Feature #11070
closedDesign and Simplicity Changes to High Availability
0%
Description
Currently there are several "pitfalls" to setting up an HA pair with pfSense that may benefit from some design changes to simplify configuration and maintenance.
Here is a few struggles:
1. Currently primary and secondary nodes do not sync interfaces. As such, if interfaces are built on the primary and secondary nodes in different order (especially a problem with a lot of interfaces), it can get the interfaces out of sync and force users to recreate them and make sure the order matches exactly.
2. If the primary fails and the secondary becomes MASTER, users are not able to make any changes to the secondary until they restore the primary to the MASTER role. Otherwise they risk their configuration changes being overwritten when the primary comes back online or, if they added or changed interfaces, completely breaking their HA configuration.
3. HA overall is currently a very step-by-step heavy process that takes a lot of actions being done correctly to avoid misconfiguration and often end users will miss a step or try to "half bake" their setups into an unsupported configuration. Since we only support doing it one way, we should only allow that one way rather than forcing all of the steps by hiding any unsupported options or provide an "advanced options (unsupported" button so that they can feed their own values in, but makes it clear its not supported.
Here is a revised step-by-step to simplify the process:
1. Add option during initial setup in pfSense to setup a failover node with a different onboarding. This should create ONLY the PFSYNC interface on the secondary unit and bypass all other steps. This should include instructions to only connect the SYNC interface and no other interfaces at this time and should provide a randomly generated and secure pfsync username and password.
2. Connect up Primary and Secondary to their shared sync interface ONLY
3. Log into Primary unit and enter the secondary unit's sync interface IP address and the randomly generated pfsync username and password.
4. Press "enable" and prompt user to verify interface matches across units. Then ask for a shared CARP VIP for each interface and interface IP address for the secondary unit to use with validation that the address is unused and that they didn't enter the same address in two places.
5. The primary unit should copy its configuration to the secondary unit in its entirety with the exception of changing the VIP skew and the interface IP addresses to the defined address in step 4. This should also update any DHCP pools to have the CARP VIP changed for the gateway and update any other relevant fields with a service restart and add a flag in the config for it being a backup unit.
6. The secondary unit should prompt for all other interfaces to be attached to the firewall and then reboot. Then, the primary unit should validate that its able to send and receive CARP advertisements between the units to avoid MASTER/MASTER situations and that there is no other odd CARP/VRRP messages appearing on any interfaces. Once this is verified, the primary unit and secondary unit can begin operating normally in their HA configuration. This verifies that things such as badly behaving ISP equipment or similar doesn't interfere with CARP advertisements or that another HA device isn't broadcasting messages on the same interface. If there is a problem receiving CARP messages in either direction or abnormal messages, the user should be prompted at this step to verify connectivity and HA should be disabled temporarily until the user presses a "retry" button.
Additional changes:
1. Upon primary unit failure, the secondary unit should allow for configuration changes that are then re-synced back to the primary in a "one time" reverse sync before it re-assumes MASTER status. However, any configuration changes that would cause issues, such as changing interface assignments, should be warned against or blocked entirely. This could be achieved with a "isbackupunit" flag or something that is present in the secondary unit's configuration file and triggers PHP to present a message under the Interfaces --> Assignments page and any interface configuration pages stating that changes here could break HA. There should be an option to go to the HA configuration screen to disable or break the HA configuration.
2. There should be an option in the HA configuration screen on the secondary unit to "break" the HA pair and switch the secondary unit to a primary or solo unit. This would change the CARP VIP skews to match what a primary would have and turn off all HA features except the CARP VIPs (so that connectivity isn't broken). Users can then go through the HA setup process again with a new secondary after they put in a replacement unit or manually migrate away from the CARP VIPs and delete them later on.
Important to note that not all of these things needs to be done for improvement. Even adding warnings to various screens on the secondary unit when a flag is added that the secondary unit is the backup unit to not make changes until the primary is restored would be helpful.
Updated by Jim Pingle over 4 years ago
- Tracker changed from Bug to Feature
- Category set to High Availability
- Status changed from New to Rejected
- Affected Architecture deleted (
All)
I don't see most of these as being feasible. Some are more error prone than the current design, others would be quite difficult to code. A lot of effort for not a lot of gain.
I imagine a lot of this will naturally change once we have a viable API, probably get redesigned in a much more fundamental way.