Bug #16864
closedKea DHCP Server in HA Mode Completely Unreliable - Fails Constantly
0%
Description
- Netgate Support Ticket — Kea DHCP HA Lease Conflict Failures
- Environment
- pfSense Version: CE 2.8.1-RELEASE
- Hardware: Two-node HA pair
- Primary (pfs601i): 192.168.156.1 / 10.200.0.2
- Standby (pfs602i): 192.168.156.2 / 10.200.0.3
- HA Mode: Hot-standby
- DHCP Backend: Kea DHCP
- Subnets: 10.200.0.0/16, 172.16.108.0/22, 172.16.0.0/22, 192.168.200.0/24
- Client Base: Mixed — Apple devices (iOS 14+/macOS Ventura+ with MAC randomization), Android, Windows, UniFi U7 Pro APs
- Problem Description
Every morning during peak hours, users are unable to obtain IP addresses via DHCP, resulting in complete loss of connectivity. The failures are caused by `HA_LEASE_UPDATE_CONFLICT` errors with `ResourceBusy` (error code 4) on the standby node, causing the HA pair to accumulate conflicts until it reaches the `terminated` state and stops serving leases entirely.
The issue is reproducible daily and affects all client types on the primary subnet.
- Root Causes Identified
- 1. HA Resync Failure After Any Interruption
Any restart of either Kea node — whether planned or unplanned — causes the lease databases to diverge. The standby node does not automatically resync from the primary when it comes back online. Subsequent lease update attempts from the primary are rejected by the standby with `ResourceBusy` because the standby holds conflicting lease records. There is no automatic mechanism to resolve this divergence without manual intervention.
- 2. Apple MAC Address Randomization Interaction
iOS 14+ and macOS Ventura+ rotate MAC addresses per network by default. With the default 24-hour maximum lease time, old leases from rotated MACs accumulate in the standby's database overnight. When clients reconnect in the morning with new MACs, the primary attempts to issue new leases but the standby rejects updates because it still holds the old conflicting records.
- 3. wait-backup-ack Defaults to True With No GUI Exposure
The default `wait-backup-ack: true` causes the primary to hold DHCP responses until the standby acknowledges each lease update. When the standby is in a conflict state or restarting, this blocks all DHCP responses to clients — a complete outage rather than a degraded state. This setting is not exposed anywhere in the pfSense GUI and requires direct editing of `/etc/inc/services.inc` to change.
- 4. ip-reservations-unique Hardcoded to False
The pfSense-generated `kea-dhcp4.conf` hardcodes `ip-reservations-unique: false` with three identifier types (`hw-address`, `client-id`, `duid`). For networks with no static reservations this is unnecessary and actively harmful — it allows Kea to maintain multiple conflicting lease records for the same client across different identifier types, compounding the MAC rotation problem. This setting is also not exposed in the GUI.
- 5. Low Default HA Thresholds
The default `max-rejected-lease-updates` of 10 (GUI default 15) is far too low for environments with Apple devices or infrastructure that reprovisioning simultaneously. On a busy morning, this threshold is reached within seconds, causing the HA pair to transition to `terminated` state and stop serving leases to all clients.
- 6. No Automatic Post-Restart Resync
When a node restarts and rejoins the HA pair, there is no automatic mechanism to sync the lease database from the primary to the standby before the standby begins processing lease updates. The standby immediately starts rejecting updates it cannot reconcile, rather than completing a sync first.
- Impact
- Complete DHCP outage for all clients during morning peak hours
- Manual intervention required daily (service restart or forced ha-sync)
- Restarting services to resolve conflicts triggers additional conflict storms, worsening the outage
- ISP environment with paying customers affected
- Workarounds Applied
The following changes partially mitigated the issue but did not fully resolve it:
| Change | Method | Result |
| -------- | -------- | -------- |
| Lease times reduced to 14400 (4hr) | pfSense GUI | Reduced overnight stale lease accumulation |
| Max Rejected Updates raised to 100 | pfSense GUI | Prevented premature terminated state |
| Max Unacked Clients raised to 50 | pfSense GUI | Reduced false partner-down transitions |
| `wait-backup-ack: false` | Direct edit of `/etc/inc/services.inc` | Prevented client blocking during standby issues |
| `ip-reservations-unique: true` | Direct edit of `/etc/inc/services.inc` | Reduced duplicate lease record conflicts |
| Static reservations for all infrastructure APs | pfSense GUI | Eliminated AP reprovisioning conflicts |
| Kea HA disabled entirely | pfSense GUI | Final resolution — single node now serving DHCP |
Note: The `services.inc` edits are overwritten by firmware upgrades and require reapplication after every pfSense update.
- Suggested Fixes / Feature Requests
- Fix 1 — Automatic Resync on Standby Recovery
When the standby node transitions from any non-operational state back to `ready`, it should automatically perform a full lease sync from the primary before entering `load-balancing` or `hot-standby` mode. This would prevent the database divergence that causes conflict storms after any restart.
Kea config parameter: `sync-leases: true` should be enforced on standby recovery, not just at initial startup.
- Fix 2 — Expose wait-backup-ack in the GUI
`wait-backup-ack` should be a configurable option in the pfSense HA settings GUI. The default of `true` is inappropriate for most production environments — when the standby has issues, clients should not be blocked from receiving IP addresses. Defaulting to `false` or exposing the setting prominently would prevent outages caused by this behavior.
Suggested GUI location: Services > DHCP Server > Settings > High Availability > Advanced Options
- Fix 3 — Expose ip-reservations-unique in the GUI
`ip-reservations-unique` should not be hardcoded to `false` in `services.inc`. For networks without static reservations this setting actively causes harm. It should default to `true` and only be set to `false` when the operator explicitly configures multiple reservation identifier types.
Suggested GUI location: Services > DHCP Server > Settings > General or Advanced
- Fix 4 — Nightly Lease Reclamation and Resync
pfSense should provide a built-in scheduled task option to perform lease reclamation and ha-sync from primary to standby on a configurable schedule (e.g., 4am daily). This would clear stale lease records before morning peak hours and prevent the overnight accumulation that causes conflict storms.
Suggested GUI location: Services > DHCP Server > Settings > Maintenance
- Fix 5 — Raise Default HA Thresholds
The default `max-rejected-lease-updates` of 10 is too aggressive for networks with any significant client churn (Apple devices, BYOD, IoT). Recommend raising the default to at least 50, with clear documentation on the consequences of the terminated state.
- Fix 6 — Conflict Resolution on Resync
When `ha-sync` is executed, conflicts on the standby should be automatically resolved in favor of the primary rather than requiring manual `lease4-del` commands for each conflicting record. The primary should be authoritative and the standby should accept its state unconditionally during a sync operation.
- Relevant Log Entries
- Typical morning conflict storm (pfs601i logs):
```
WARN [kea-dhcp4.ha-hooks] HA_LEASE_UPDATE_CONFLICT pfs601i: lease update
[hwtype=1 60:3e:5f:80:81:8b], cid=[01:60:3e:5f:80:81:8b] sent to pfs602i
returned conflict status code: ResourceBusy: IP address:10.200.14.140
could not be updated. (error code 4)
```
- Standby rejecting updates after restart (pfs602i logs):
```
WARN [kea-dhcp4.lease-cmds-hooks] LEASE_CMDS_UPDATE4_CONFLICT lease4-update
command failed due to conflict (parameters: { "hostname": "pauls-mbp",
"hw-address": "60:3e:5f:80:81:8b", "ip-address": "10.200.14.140",
"origin": "ha-partner", "valid-lft": 86400 },
reason: ResourceBusy: IP address:10.200.14.140 could not be updated.)
```
- Heartbeat failure triggering cascade:
```
WARN [kea-dhcp4.ha-hooks] HA_HEARTBEAT_COMMUNICATIONS_FAILED pfs602i:
failed to send heartbeat to pfs601i: Operation timed out
WARN [kea-dhcp4.ha-hooks] HA_COMMUNICATION_INTERRUPTED pfs602i:
communication with pfs601i is interrupted
```
- Community Reports of Same Issue
Multiple users are reporting the same Kea HA instability:
- https://forum.netgate.com/topic/197056/kea-dhcp-server-in-ha-mode-drops-50-of-dhcp-requests
- https://forum.netgate.com/topic/187408/so-many-issues-with-kea-dhcp
- https://forum.netgate.com/topic/188337/kea-dhcp-stops-working
- https://forum.pfsense.com/topic/195347/seeing-kea-dhcp-issues-after-upgrade-to-24-11/20
- https://redmine.pfsense.org/issues/15956
- https://redmine.pfsense.org/issues/15328
- Current Status
Kea HA has been disabled. pfs601i is serving DHCP as a single node. DHCP is stable but redundancy has been sacrificed. We will re-enable HA when the above issues are addressed in a future pfSense release.
We are happy to provide full debug logs if useful for diagnosis.
Updated by Jim Nitterauer 17 days ago
Why is this rejected? Zero explanation. The HA service still fails with others experiencing similar issues. UI really don't understand. there is a race condition that cause the service to quit doling out IPs and crash. The new service starts before the old one is gracefully shut down and chaos ensues. Seems like you don't want to investigate and fix the issue.
Updated by Jim Nitterauer 14 days ago
Makes it easy to just punt and avoid fixing things. Thanks for making it so easy for your end users to report issues with the platform. Guess your processes are more important than actually fixing issues within the platform. I am not the only one who has experienced these problems.