Bug #16824
opendpinger gateway monitoring fails after IPsec VTI reload
0%
Description
Environment¶
- pfSense 2.8.1-RELEASE on FreeBSD 15.0
- dpinger 3.3 (FreeBSD port, built 2025-05-22)
- Multiple IPsec VTI tunnels with gateways monitored by dpinger
- Standard pfSense gateway monitoring (no custom configuration)
Symptoms¶
Gateways monitored over IPsec VTI interfaces eventually stop reporting status.
The pfSense UI shows them as "Pending" or "Unknown" indefinitely. The only
remediation through normal channels is to restart dpinger
(Status > Services, or setup_gateways_monitor() via PHP shell).
The failure presents in three distinct modes:
- Process missing: dpinger process is gone entirely. No process, no PID
file, no Unix status socket on disk. pfSense never respawns it. - Process hung: dpinger process is running and the Unix status socket
exists, but querying the socket times out (theusocket_threadis dead
or blocked). - Process zombie: dpinger process is running and the status socket
responds, but it reportslatency=0 stddev=0 loss=100. A manual ICMP
ping with the exact same-S bind_addr monitor_addrparameters that
dpinger uses succeeds with normal latency. The send/recv threads are
orphaned -- they hold file descriptors that are no longer wired to a
live interface.
The third mode is the most insidious because every external indicator
(process, socket, ICMP reachability) appears healthy.
Root cause¶
The dpinger daemon itself appears correct. Its send_thread handlessendto errors by logging them and continuing -- it does not exit onEHOSTUNREACH or any other transient error. The threads run inwhile(1) loops and main() blocks on pthread_join of the last
thread created (the usocket_thread in pfSense's invocation).
The bug is in pfSense's gateway / IPsec interaction:
- An IPsec gateway briefly fails monitoring (real packet loss, alarm,
or DPD event). dpinger fires/etc/rc.gateway_alarm. rc.gateway_alarmcallspfSctl -c "service reload ipsec ${GW}".- That ultimately invokes
ipsec_configure()in/etc/inc/ipsec.inc. ipsec_configure()callsipsec_setup_gwifs(), which callsinterface_ipsec_vti_configure()in/etc/inc/interfaces.inc.- That function unconditionally destroys and recreates each VTI
interface:
if (does_interface_exist($ipsecif)) {
mwexec("/sbin/ifconfig " . escapeshellarg($ipsecif) . " destroy");
}
mwexec("/sbin/ifconfig " . escapeshellarg($ipsecif) . " create reqid ...");
- The dpinger process for that gateway is bound to an IP on the now-
destroyed interface. Its raw ICMP socket is left in a broken state.
When the new interface is created with the same IP, the old fd does
not transparently re-bind. dpinger keeps loggingsendto error: 65(EHOSTUNREACH) and reports 100% loss forever. - After
ipsec_configure()completes, no call tosetup_gateways_monitor()is made. The dpinger processes are
never restarted, even though the interfaces they were monitoring
were torn down and recreated underneath them.
The "process missing" and "process hung" modes appear to be downstream
consequences of the same destroy/recreate cycle (e.g. the process being
SIGTERMed during cleanup but the respawn step being skipped, or the
process exiting due to socket state dpinger does not handle).
Reproduction¶
Trigger packet loss on an IPsec VTI gateway sufficient to cause an
alarm. After the IPsec reload completes, observe that dpinger for that
gateway reports 100% loss permanently, while a manual ping using the
same -S bind_addr monitor_addr parameters succeeds.
Suggested fix (upstream)¶
Either of:
- Add
setup_gateways_monitor()to the end ofipsec_configure()in/etc/inc/ipsec.inc(or to/etc/rc.ipsecafter theipsec_configure()call). - Avoid the unconditional ifconfig destroy in
interface_ipsec_vti_configure()when the interface configuration
is unchanged.
The first is the smaller, lower-risk change.
Workaround¶
A shell script (see attached) run from cron once a minute detects all three failure
modes and restarts gateway monitoring:
- For each gateway pfSense expects to monitor, check that a dpinger
socket and process exist. If not, flag as missing. - For each socket that does exist, query it with a 5-second timeout.
If the query fails or returns empty, flag as hung. - If a socket returns 100% loss, manually probe the same monitor_addr
from the same bind_addr withping -c 4. If pings succeed, flag as
zombie. If pings also fail, the outage is real and dpinger is
correct -- leave alone.
If anything is flagged, run setup_gateways_monitor() via PHP, which
cleanly stops and respawns all dpinger processes.
This has been running on pfSense 2.8.1 against an environment with
four IPsec VTI gateways and one DHCP WAN gateway. It correctly catches
the zombie case (most common in this environment), avoids restarting
during real outages, and runs to completion in well under a second
when nothing is wrong.
Related observations¶
- The
sendto error: 65log spam is a useful early indicator but
does not by itself cause the hang. - "exiting on signal 15" entries in syslog correlate with the
process-missing case and confirm something external sends SIGTERM
but no respawn follows. - The existing pfSense Service Watchdog package does not catch this
because the dpinger service entry represents the collection of
dpinger processes, not individual ones -- when even one is running
the service is considered up.
Files
No data to display