Bug #3692
closed
apinger loss % gets stuck
Added by Jeroen van Gelderen over 10 years ago.
Updated almost 10 years ago.
Affected Architecture:
All
Description
I have noticed on multiple (5) independent pfSense installs/locations/ISPs (both active/passive and single-node, x86 and x64) running the pfSense 2.1.x series that the apinger packet loss percentage has a tendency to get "stuck" on a particular baseline value after a high-loss episode.
I have attached two RRD graphs from one router for illustration. As you can see in the attached monthly graph this problem seemingly gets triggered by a real episode of high packet loss but instead of recovering to a baseline of zero percent in this particular example the baseline stays at 6% forever and any actual packet loss is added to that 6%.
Restarting apinger solves the problem and resets the baseline to 0%.
The incorrect packet-loss percentage is reflected both in the RRD graphs as well as the Gateways dashboard widget.
Files
I noticed this yesterday. For a period of time I had a bad episode of packetloss on a WAN gateway and even though the issue was resolved a long time ago it is still indicating there is a problem even more than 24 hours later:
Warning, Packetloss: 82%
I'm having the same issue with 2.1.4.
I have to restart the entire pfsense box to correct it, just restarting the apinger service doesn't help.
Packet loss seems to get stuck at different values, I have no idea what's causing it but there isn't anywhere near that amount of packet loss on my line, which is confirmed after a restart when it drops to 0 again.
Attached is what my rrd history looks like.
I just got bit by this again running 2.1.4. For me, it happens every few weeks. and is always associated with an elevated latency event.
Has anyone looked at this?
i have the same problem on two pfsense machines.
Just got bit with this again. Different symptoms this time... Gateway status (home page) showed 102% loss. RRD graphs showed nan. Restart of apinger service corrected the issue.
And again. Is there any diagnostic information that we can gather to help with this???
I have one location that consistently has this problem if you need us to help you test something. Since the packet loss percent never resets this makes failover 1000% non-functional because interface is eliminated as a failover candidate because of the fake packet loss level. This caused an outage for our client last week because failover didn't work when the other interface REALLY went down.
The 184% packet loss in the attached image accumulated over just 12 hours.
Sorry. This was happening in 2.1.3 and after an upgrade to 2.1.4 it is still happening.
Since upgrading to 2.1.5, I've also had two instances of apinger getting stuck on elevated latency rather than loss. Two successive days. Interestingly, both occurrences began at exactly the same time of day. The first was cured by restarting the firewall (I did not try just restarting apinger) and the second by restarting apinger.
Apinger is such an important service and appears to be growing non-functional. Is there anything we can do?
Denny Page wrote:
Apinger is such an important service and appears to be growing non-functional. Is there anything we can do?
Make it no-op. An excerpt from today:
Sep 2 06:46:25 php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use HEIPV6_TUNNELV6.
Sep 2 06:46:24 php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 2 06:46:17 check_reload_status: Reloading filter
Sep 2 06:46:17 check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 2 06:46:17 check_reload_status: Restarting ipsec tunnels
Sep 2 06:46:17 check_reload_status: updating dyndns HEIPV6_TUNNELV6
Sep 2 06:46:07 apinger: alarm canceled: HEIPV6_TUNNELV6(2001:470:xx:xx::1) *** delay ***
Sep 2 06:45:47 apinger: alarm canceled: WANGW(37.xx.xx.xx) *** delay ***
Sep 2 06:45:14 apinger: ALARM: WANGW(37.xx.xx.xx) *** delay ***
Sep 2 06:45:10 apinger: ALARM: HEIPV6_TUNNELV6(2001:470:xx:xx::1) *** delay ***
Sorry, all the IPs are static on both sides. None of the endpoints may have changed its IP
Summary
- generally broken, showing nonsensical >100% values for packet loss
- getting stuck randomly
- causing totally useless services restart
And noone from devs can reproduce? Gimme a break.
Please update latest version of 2.2 of rebuilt apinger manually and retry.
- Target version set to 2.2
- Status changed from New to Feedback
Please try again with latest snapshots.
People have confirmed that the behaviour is improved.
Only the graph part needs improvement.
- Status changed from Feedback to Resolved
seems this has been resolved. I haven't been able to replicate the circumstances here since Ermal's last round of fixes and the "please fix apinger" thread ended weeks ago with resolutions after Ermal's fixes, or the problem being a screwed up timecounter in a VM.
I hate to say it but in a new pfsense 2.2 installation (with two wan load balancing and high availability) I have now 2190% packet loss on second wan (and usually more than 100% on first wan).
this is back for me and I dont't know why this suddenly showed up and why restarting apinger no longer fixes this. also seeing ping times of 0.1 ms too
Also available in: Atom
PDF