Bug #16836: IPsec daemon can crash if a peer initiates two rekeys for the same child SA - pfSense - pfSense bugtracker

Actions

Copy link

Bug #16836

closed

IPsec daemon can crash if a peer initiates two rekeys for the same child SA

Added by David Hiebert 3 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Christian McDonald

Category:

IPsec

Target version:

2.9.0

Start date:

Due date:

% Done:

100%

Estimated time:

Plus Target Version:

26.03.1

Release Notes:

Default

Affected Version:

Affected Architecture:

Description

Product / version
- pfSense Plus 25.11.1-RELEASE
- strongSwan version on 25.11.1: `strongswan-6.0.3` (confirmed via `pkg info strongswan`)
- strongSwan version on 26.03: `strongswan-6.0.3_1` (confirmed by launching the Netgate pfSense Plus 26.03 AWS Marketplace AMI and querying `pkg info strongswan`)
- The `_1` is a FreeBSD port revision bump; the CPE string still identifies the package as `strongswan:6.0.3`, `port_checkout_unclean: no`, and the upstream fix is not present.

Summary
Reproducible pattern of charon crashes on a pfSense Plus 25.11.1 IPsec concentrator. The crash signature matches upstream strongSwan issue strongswan/strongswan#2945 ("Crash caused if confused peer initiates two rekeyings for the same Child SA"), which was fixed in strongSwan 6.0.4 (released 2025-12-12). The crash has now been observed at least twice on the same host.

We have independently confirmed pfSense Plus 26.03 still bundles strongSwan 6.0.3 (port revision `_1`, no relevant patches). Request is that strongSwan >= 6.0.4 be shipped in a future pfSense Plus release or backported to the 25.11.x train.

Evidence

Kernel-level exit
```
kernel: pid <pid> (charon), jid 0, uid 0: exited on signal 6 (core dumped)
```
Signal 6 (SIGABRT) is charon's own abort() call from its internal signal handler after catching a critical signal (SIGBUS, signal 10 on FreeBSD).

charon in-process stack (from ipsec.log immediately before abort)
Fatal frame chain on the crashing worker thread:
```
child_delete_create+0x31a
<- task_manager_v2_create+0x2b22
<- delete_child_sa_job_create_id+0x103
<- processor_create
<- thread_create
```

A coredump was preserved on the host but will not be shared (process memory of an IPsec daemon — contains session key material). A sanitized symbolic backtrace can be provided on request.

Sequence at time of crash
1. ~7 minutes before the crash: CHILD_SA on a site-to-site tunnel completed a rekey cycle cleanly (SPI A → SPI B; old SA transitioned REKEYED → DELETING → DELETED).
2. A second rekey cycle on the same tunnel entered REKEYED → DELETED state.
3. A CHILD_DELETE job was dispatched on the already-rekeyed CHILD_SA.
4. Worker thread faulted inside `child_delete_create`.
5. strongSwan's signal handler caught SIGBUS, logged "killing ourself, received critical signal", dumped the stack, and called abort().

Matches the mechanism described in strongswan/strongswan#2944: a peer driving two sequential rekeys on the same CHILD_SA, leaving the original SA destroyed while a delete job still references it.

Upstream references
- https://github.com/strongswan/strongswan/issues/2945 — fixed in 6.0.4 ("Prevent a crash if a confused peer rekeys a Child SA twice before sending a delete")
- https://github.com/strongswan/strongswan/discussions/2944 — mechanism description
- 6.0.5 adds a defensive follow-on fix: "Avoid an incorrect down event if deleting a rekeyed Child SA fails"
- 6.0.6 (2026-04-22) includes several unrelated CVE fixes

Requests

1. Ship strongSwan >= 6.0.4 in a future pfSense Plus release. 6.0.5 preferred for the follow-on fix; 6.0.6 adds CVE fixes worth having.
2. Backport consideration: a targeted backport of the 6.0.4 child-rekey fix to a 25.11.x package update would let deployments on the current train avoid a major version upgrade. Is this feasible?
3. Interim mitigation: are there `charon.strongswan.conf` tuning options (rekey margins, `delete_rekeyed` behavior, related options) that would reduce exposure while awaiting a fixed version?

Impact
Production IPsec concentrator serving site-to-site VPN tunnels. A charon crash drops all tunnels on the host until the daemon is restarted, causing service interruption for every tunnel on the concentrator.

What can be provided on request
- Sanitized backtrace (`thread apply all bt`, `info locals` on the failing frame) — can be shared via a non-public channel if needed
- Timing of prior occurrence
- Peer IKE implementation / vendor (we have identified the specific peer driving the double-rekey pattern)

Actions

Copy link

Updated by Christian McDonald 3 months ago

Assignee set to Christian McDonald
Plus Target Version set to 26.03.1

Actions

Copy link

Updated by Christian McDonald 3 months ago

Status changed from New to Feedback

Actions

Copy link

Updated by Jim Pingle 3 months ago

Subject changed from charon (strongSwan) SIGBUS crash in child_delete_create during CHILD_SA rekey-delete on pfSense Plus 25.11.1-RELEASE — matches upstream strongswan#2945 (fixed in 6.0.4) to IPsec daemon can crash if a peer initiates two rekeys for the same child SA
Target version set to 2.9.0

Actions

Copy link

Updated by David Hiebert 3 months ago

Update: this issue has been recurring across our pfSense Plus VPN concentrator fleet. Several occurrences to date on more than one host; the original report covered the first occurrence we caught with a usable core dump, and we have now captured a second.

Recurrence pattern¶

Multiple crashes observed across two pfSense Plus VPN concentrators in our fleet (both 25.11.1-RELEASE, both strongSwan 6.0.3). Earlier occurrences were detected reactively (tunnels down, peer reports) without a captured core; only two have full cores so far.
Most recent captured core: 2026-05-16, on a different concentrator than the host in the original report.
The peer involved in the most recent crash is not the same peer as in the original report — different vendor, different site. The trigger is not specific to a single peer's IKE stack.

Same crash signature in every case we've inspected: SIGBUS caught by charon's in-process signal handler, charon then sends itself SIGABRT, the {{ipsec.log}} stack trace bottoms out in the CHILD_SA rekey/delete path of {{libcharon}}.

Core dump analysis (most recent occurrence)¶

Inspected the most recent core with gdb against the stripped {{libcharon.so.0}} from pfSense Plus 25.11.1. Sysroot assembled from base libraries off the affected host. (Local gdb 17.1 on a different platform read the FreeBSD amd64 core fine — cross-platform core inspection works for this purpose.)

Fault location: identical to the original core, byte-for-byte.

Faulting RIP: same offset within {{libcharon.so.0}} as the first core ({{libcharon+0x7786a}})
Faulting instruction: indirect virtual call through the CHILD_SA at vtable slot {{0x168}}
Immediately following call in the emitted code: {{child_rekey_conclude_rekeying@plt}}

Sanitized disassembly of the fault region (no customer-identifying state — this is just the compiled instruction sequence, identical between cores):

mov    %rbx,%rdi                  ; rdi = CHILD_SA pointer
xor    %esi,%esi
call   *0x168(%rbx)               ; *** FAULT ***  (indirect call through vtable)
mov    %r12,%rdi
mov    %rbx,%rsi
call   child_rekey_conclude_rekeying@plt

Use-after-free confirmed at the byte level. In the most recent core, the memory pointed to by {{%rbx}} (the CHILD_SA the fault is calling a method on) is filled with {{0x5a}} — the jemalloc free-poison pattern. The vtable slot at {{*0x168(%rbx)}} therefore reads garbage, the call jumps to unmapped memory, and SIGBUS fires. The surviving sibling object passed as the second argument to {{child_rekey_conclude_rekeying}} is intact and still has a normal vtable.

This is exactly the failure mode strongSwan #2944 / #2945 describes: "a peer initiates two sequential rekeyings for the same Child SA... When the first replacement gets deleted, the code attempts to finalize rekeying with an already-destroyed original SA, causing the crash." The destroyed-original SA in our most recent core is the one filled with poison bytes; the not-yet-destroyed replacement is still live next to it.

What this changes for prioritization¶

Multiple recurrences across more than one host, more than one peer — this is reachable from at least two independent IKEv2 implementations under normal operation.
Will continue to fire across this fleet until strongSwan 6.0.4+ ships in the pfSense Plus packages.
No peer-side mitigation is available — there is nothing wrong with what the peers sent; the upstream code path is incorrect.

We are operating under a plus-target of 26.03.1 per Christian's earlier note, which is appreciated. If there is any opportunity to ship the strongSwan 6.0.4 (or later) bump on the 25.11.x train as well, that would close the window faster for sites that haven't moved to 26.x yet.

Available on request¶

Happy to provide via a non-public channel:

Sanitized full disassembly of the fault region (both cores)
Sanitized 17-thread backtrace from each captured core
Approximate timing of recurring occurrences
Other host-side state that may help — e.g. swanctl version output, kernel ring snippets, build provenance for the strongSwan package as installed

Core files themselves contain peer-identifying data (preshared keys, IP addresses, identities) and will not be attached.

Actions

Copy link