Project

General

Profile

Actions

Bug #3692

closed

apinger loss % gets stuck

Added by Jeroen van Gelderen over 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Gateways
Target version:
Start date:
06/04/2014
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
2.1.x
Affected Architecture:
All

Description

I have noticed on multiple (5) independent pfSense installs/locations/ISPs (both active/passive and single-node, x86 and x64) running the pfSense 2.1.x series that the apinger packet loss percentage has a tendency to get "stuck" on a particular baseline value after a high-loss episode.

I have attached two RRD graphs from one router for illustration. As you can see in the attached monthly graph this problem seemingly gets triggered by a real episode of high packet loss but instead of recovering to a baseline of zero percent in this particular example the baseline stays at 6% forever and any actual packet loss is added to that 6%.

Restarting apinger solves the problem and resets the baseline to 0%.

The incorrect packet-loss percentage is reflected both in the RRD graphs as well as the Gateways dashboard widget.


Files

apinger-8hrs.png (23.5 KB) apinger-8hrs.png Jeroen van Gelderen, 06/04/2014 04:43 PM
apinger-1mo.png (27.6 KB) apinger-1mo.png Jeroen van Gelderen, 06/04/2014 04:44 PM
Screenshot 2014-07-05 11.01.11.png (38 KB) Screenshot 2014-07-05 11.01.11.png Anonymous, 07/05/2014 03:57 AM
Beyond-100-percent-packet-loss.png (40.2 KB) Beyond-100-percent-packet-loss.png Sam E, 08/27/2014 08:37 AM
Actions #2

Updated by Jason Ross over 10 years ago

I noticed this yesterday. For a period of time I had a bad episode of packetloss on a WAN gateway and even though the issue was resolved a long time ago it is still indicating there is a problem even more than 24 hours later:

Warning, Packetloss: 82%

Actions #3

Updated by Anonymous over 10 years ago

I'm having the same issue with 2.1.4.

I have to restart the entire pfsense box to correct it, just restarting the apinger service doesn't help.

Packet loss seems to get stuck at different values, I have no idea what's causing it but there isn't anywhere near that amount of packet loss on my line, which is confirmed after a restart when it drops to 0 again.

Attached is what my rrd history looks like.

Actions #4

Updated by Denny Page over 10 years ago

I just got bit by this again running 2.1.4. For me, it happens every few weeks. and is always associated with an elevated latency event.

Has anyone looked at this?

Actions #5

Updated by Ralf Hauptmann over 10 years ago

i have the same problem on two pfsense machines.

Actions #6

Updated by Denny Page over 10 years ago

Just got bit with this again. Different symptoms this time... Gateway status (home page) showed 102% loss. RRD graphs showed nan. Restart of apinger service corrected the issue.

Actions #7

Updated by Denny Page over 10 years ago

And again. Is there any diagnostic information that we can gather to help with this???

Actions #8

Updated by Sam E about 10 years ago

I have one location that consistently has this problem if you need us to help you test something. Since the packet loss percent never resets this makes failover 1000% non-functional because interface is eliminated as a failover candidate because of the fake packet loss level. This caused an outage for our client last week because failover didn't work when the other interface REALLY went down.

The 184% packet loss in the attached image accumulated over just 12 hours.

Actions #9

Updated by Sam E about 10 years ago

Sorry. This was happening in 2.1.3 and after an upgrade to 2.1.4 it is still happening.

Actions #10

Updated by Denny Page about 10 years ago

Since upgrading to 2.1.5, I've also had two instances of apinger getting stuck on elevated latency rather than loss. Two successive days. Interestingly, both occurrences began at exactly the same time of day. The first was cured by restarting the firewall (I did not try just restarting apinger) and the second by restarting apinger.

Apinger is such an important service and appears to be growing non-functional. Is there anything we can do?

Actions #11

Updated by Doktor Notor about 10 years ago

Denny Page wrote:

Apinger is such an important service and appears to be growing non-functional. Is there anything we can do?

Make it no-op. An excerpt from today:

Sep 2 06:46:25    php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use HEIPV6_TUNNELV6.
Sep 2 06:46:24    php: rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
Sep 2 06:46:17    check_reload_status: Reloading filter
Sep 2 06:46:17    check_reload_status: Restarting OpenVPN tunnels/interfaces
Sep 2 06:46:17    check_reload_status: Restarting ipsec tunnels
Sep 2 06:46:17    check_reload_status: updating dyndns HEIPV6_TUNNELV6
Sep 2 06:46:07    apinger: alarm canceled: HEIPV6_TUNNELV6(2001:470:xx:xx::1) *** delay ***
Sep 2 06:45:47    apinger: alarm canceled: WANGW(37.xx.xx.xx) *** delay ***
Sep 2 06:45:14    apinger: ALARM: WANGW(37.xx.xx.xx) *** delay ***
Sep 2 06:45:10    apinger: ALARM: HEIPV6_TUNNELV6(2001:470:xx:xx::1) *** delay ***

Sorry, all the IPs are static on both sides. None of the endpoints may have changed its IP

Summary
- generally broken, showing nonsensical >100% values for packet loss
- getting stuck randomly
- causing totally useless services restart

And noone from devs can reproduce? Gimme a break.

Actions #12

Updated by Ermal Luçi about 10 years ago

Please update latest version of 2.2 of rebuilt apinger manually and retry.

Actions #13

Updated by Ermal Luçi about 10 years ago

  • Target version set to 2.2
Actions #14

Updated by Ermal Luçi about 10 years ago

  • Status changed from New to Feedback

Please try again with latest snapshots.

Actions #16

Updated by Ermal Luçi about 10 years ago

People have confirmed that the behaviour is improved.
Only the graph part needs improvement.

Actions #17

Updated by Chris Buechler about 10 years ago

  • Status changed from Feedback to Resolved

seems this has been resolved. I haven't been able to replicate the circumstances here since Ermal's last round of fixes and the "please fix apinger" thread ended weeks ago with resolutions after Ermal's fixes, or the problem being a screwed up timecounter in a VM.

Actions #18

Updated by Mario Giammarco almost 10 years ago

I hate to say it but in a new pfsense 2.2 installation (with two wan load balancing and high availability) I have now 2190% packet loss on second wan (and usually more than 100% on first wan).

Actions #19

Updated by Michael Kellogg almost 10 years ago

this is back for me and I dont't know why this suddenly showed up and why restarting apinger no longer fixes this. also seeing ping times of 0.1 ms too

Actions

Also available in: Atom PDF