Bug #16824
closeddpinger gateway monitoring fails after IPsec VTI reload
0%
Description
Environment¶
- pfSense 2.8.1-RELEASE on FreeBSD 15.0
- dpinger 3.3 (FreeBSD port, built 2025-05-22)
- Multiple IPsec VTI tunnels with gateways monitored by dpinger
- Standard pfSense gateway monitoring (no custom configuration)
Symptoms¶
Gateways monitored over IPsec VTI interfaces eventually stop reporting status.
The pfSense UI shows them as "Pending" or "Unknown" indefinitely. The only
remediation through normal channels is to restart dpinger
(Status > Services, or setup_gateways_monitor() via PHP shell).
The failure presents in three distinct modes:
- Process missing: dpinger process is gone entirely. No process, no PID
file, no Unix status socket on disk. pfSense never respawns it. - Process hung: dpinger process is running and the Unix status socket
exists, but querying the socket times out (theusocket_threadis dead
or blocked). - Process zombie: dpinger process is running and the status socket
responds, but it reportslatency=0 stddev=0 loss=100. A manual ICMP
ping with the exact same-S bind_addr monitor_addrparameters that
dpinger uses succeeds with normal latency. The send/recv threads are
orphaned -- they hold file descriptors that are no longer wired to a
live interface.
The third mode is the most insidious because every external indicator
(process, socket, ICMP reachability) appears healthy.
Root cause¶
The dpinger daemon itself appears correct. Its send_thread handlessendto errors by logging them and continuing -- it does not exit onEHOSTUNREACH or any other transient error. The threads run inwhile(1) loops and main() blocks on pthread_join of the last
thread created (the usocket_thread in pfSense's invocation).
The bug is in pfSense's gateway / IPsec interaction:
- An IPsec gateway briefly fails monitoring (real packet loss, alarm,
or DPD event). dpinger fires/etc/rc.gateway_alarm. rc.gateway_alarmcallspfSctl -c "service reload ipsec ${GW}".- That ultimately invokes
ipsec_configure()in/etc/inc/ipsec.inc. ipsec_configure()callsipsec_setup_gwifs(), which callsinterface_ipsec_vti_configure()in/etc/inc/interfaces.inc.- That function unconditionally destroys and recreates each VTI
interface:
if (does_interface_exist($ipsecif)) {
mwexec("/sbin/ifconfig " . escapeshellarg($ipsecif) . " destroy");
}
mwexec("/sbin/ifconfig " . escapeshellarg($ipsecif) . " create reqid ...");
- The dpinger process for that gateway is bound to an IP on the now-
destroyed interface. Its raw ICMP socket is left in a broken state.
When the new interface is created with the same IP, the old fd does
not transparently re-bind. dpinger keeps loggingsendto error: 65(EHOSTUNREACH) and reports 100% loss forever. - After
ipsec_configure()completes, no call tosetup_gateways_monitor()is made. The dpinger processes are
never restarted, even though the interfaces they were monitoring
were torn down and recreated underneath them.
The "process missing" and "process hung" modes appear to be downstream
consequences of the same destroy/recreate cycle (e.g. the process being
SIGTERMed during cleanup but the respawn step being skipped, or the
process exiting due to socket state dpinger does not handle).
Reproduction¶
Trigger packet loss on an IPsec VTI gateway sufficient to cause an
alarm. After the IPsec reload completes, observe that dpinger for that
gateway reports 100% loss permanently, while a manual ping using the
same -S bind_addr monitor_addr parameters succeeds.
Suggested fix (upstream)¶
Either of:
- Add
setup_gateways_monitor()to the end ofipsec_configure()in/etc/inc/ipsec.inc(or to/etc/rc.ipsecafter theipsec_configure()call). - Avoid the unconditional ifconfig destroy in
interface_ipsec_vti_configure()when the interface configuration
is unchanged.
The first is the smaller, lower-risk change.
Workaround¶
A shell script (see attached) run from cron once a minute detects all three failure
modes and restarts gateway monitoring:
- For each gateway pfSense expects to monitor, check that a dpinger
socket and process exist. If not, flag as missing. - For each socket that does exist, query it with a 5-second timeout.
If the query fails or returns empty, flag as hung. - If a socket returns 100% loss, manually probe the same monitor_addr
from the same bind_addr withping -c 4. If pings succeed, flag as
zombie. If pings also fail, the outage is real and dpinger is
correct -- leave alone.
If anything is flagged, run setup_gateways_monitor() via PHP, which
cleanly stops and respawns all dpinger processes.
This has been running on pfSense 2.8.1 against an environment with
four IPsec VTI gateways and one DHCP WAN gateway. It correctly catches
the zombie case (most common in this environment), avoids restarting
during real outages, and runs to completion in well under a second
when nothing is wrong.
Related observations¶
- The
sendto error: 65log spam is a useful early indicator but
does not by itself cause the hang. - "exiting on signal 15" entries in syslog correlate with the
process-missing case and confirm something external sends SIGTERM
but no respawn follows. - The existing pfSense Service Watchdog package does not catch this
because the dpinger service entry represents the collection of
dpinger processes, not individual ones -- when even one is running
the service is considered up.
Files
Updated by Marcos M about 1 month ago
- Status changed from New to Not a Bug
This works as expected in tests with 26.03-RELEASE. Even with the VTI being recreated the existing dpinger socket was still valid and monitoring continued after the tunnel re-initiated.
Updated by Chris Baker about 1 month ago
Marcos M wrote in #note-1:
This works as expected in tests with 26.03-RELEASE. Even with the VTI being recreated the existing dpinger socket was still valid and monitoring continued after the tunnel re-initiated.
Thanks for testing this. Before the ticket is closed I'd like to lay out a couple of points — the technical findings here are pretty firm, but I don't want to overreach on the version-comparison side.
Version difference¶
The ticket was filed against CE 2.8.1 (FreeBSD 15.0-CURRENT, released 2025-09-04). 26.03-RELEASE is pfSense Plus on FreeBSD 16.0-CURRENT, released 2026-04-01. I don't know how aligned the relevant IPsec/gateway code is between the two branches, so I can't say with certainty whether a clean result on 26.03 should carry over to 2.8.1. Could you confirm whether the 26.03 result was expected to apply to 2.8.1, and whether "Fixed" (with target version) might be a more accurate disposition than "Not a Bug" if the two branches differ here? I'd just like to avoid the CE branch being closed off from a potential backport.
The "dpinger socket was still valid" observation¶
I want to flag this carefully, because I think the original report may not have made the symptom clear enough. dpinger has two sockets that behave very differently:
- The Unix-domain status socket (
/var/run/dpinger_*.sock) served byusocket_thread. Lives on local disk, not affected by the VTI being destroyed. - The raw ICMP send/recv sockets, bound via
-Bto an IP on the VTI interface. These are what actually probe the gateway.
The dominant failure mode on 2.8.1 here is that the Unix status socket keeps responding with fresh-looking data, returning latency=0 stddev=0 loss=100, while a manual ping -c 4 -S bind_addr monitor_addr from the same firewall succeeds with normal latency. The status socket being valid does not contradict the bug — the bug is precisely that the status socket remains valid and continues reporting while the underlying ICMP path is dead. Running setup_gateways_monitor() immediately restores monitoring.
There are also two less common modes: process gone entirely (exiting on signal 15 in syslog around an IPsec reload, no respawn), and Unix socket itself becoming unresponsive. All three have been observed on this firewall.
Reproducibility on 2.8.1¶
A watchdog has been running here that, when dpinger reports 100% loss, probes ping -c 4 -S bind_addr monitor_addr and only flags the gateway if those pings succeed (i.e. dpinger is wrong). It has triggered repeatedly on CE 2.8.1, each trigger correlating with sendto error: 65 bursts in syslog around IPsec reload events. Happy to attach watchdog logs and a syslog excerpt.
Would you be willing to reopen pending another look on 2.8.1, or let me know what specifically to test there to either confirm or rule this out?