Project

General

Profile

Actions

Bug #3191

closed

Quality RRD inaccuracies and failure to update status in some circumstances

Added by Chris Buechler over 10 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Category:
Gateway Monitoring
Target version:
Start date:
09/09/2013
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
All
Affected Architecture:

Description

There are some circumstances in which apinger puts in incorrect quality RRD data and fails to notice downtime. This needs some more specifics, but here are known examples of circumstances that fail.

1) use a monitor IP with 80% packet loss, with "Down" at the default 10, and it won't detect any packet loss and graphs no packet loss.
2) use a monitor IP with 80% packet loss, with "Down" at 30, and it'll take the gateway down appropriately but the RRD data will be wrong. Shows 100% loss much of the time.

There may be other specific circumstances that need to be added here.

This is true across all versions that have ever used apinger.

Actions #1

Updated by Ermal Luçi about 10 years ago

  • Target version changed from 2.1.1 to 2.2

I am not sure this is something to be fixed for 2.1.1 so putting to 2.2.

Actions #2

Updated by Jim Thompson over 9 years ago

  • Assignee set to Chris Buechler

need more info on 'when' this happens (why would be great)

Actions #3

Updated by Chris Buechler over 9 years ago

  • Assignee changed from Chris Buechler to Ermal Luçi

There are a few descriptions of problems in tickets in Kayako under the apinger-badstats tag.

Actions #4

Updated by Ermal Luçi over 9 years ago

  • Status changed from New to Feedback

Patched apinger, need some feedback if the issue is solved now.

Actions #5

Updated by Ermal Luçi over 9 years ago

For the record, properly recover from disconnected sockets patch put in.

Actions #6

Updated by Chris Buechler over 9 years ago

  • Status changed from Feedback to New

The first issue as noted originally is still a problem as described. Throw a limiter on an upstream system that drops 80% of the monitor pings, and the packet loss and latency reported by apinger are completely wrong. Reports somewhere around 3 ms with 0% loss when in reality, reading back a pcap with tshark:

ICMP Service Response Time (SRT) Statistics (all times in ms):
Filter: ip.addr==192.0.2.202

Requests  Replies   Lost      % Loss
38        9         29         76.3%

Minimum   Maximum   Mean      Median    SDeviation     Min Frame Max Frame
90.927    94.974    92.275    92.089    1.379          54        34        

Its pings are getting replies in ~92-94 ms (not the 3 it's showing), and the loss doesn't show up at all.

Change "Down" to 30, and the RRD data is correct for loss, but not latency.

Actions #8

Updated by Chris Buechler over 9 years ago

  • Assignee changed from Ermal Luçi to Chris Buechler

to me to re-test

Actions #9

Updated by Chris Buechler over 9 years ago

  • Status changed from New to Feedback
  • Target version changed from 2.2 to 2.3

things are much better with apinger in general after fixes in the past 1-2 months. I can still replicate some issues here, but it's unlikely they're ones that would occur in the real world. Given things are better than in current stable releases as is, and touching apinger is fraught with peril, I'm setting this out to 2.3 for feedback and further review.

Actions #10

Updated by Michael Kellogg over 8 years ago

things are still an issue on troublesome connections

Actions #11

Updated by Chris Buechler over 8 years ago

  • Category changed from Gateways to Gateway Monitoring
  • Status changed from Feedback to Confirmed
Actions #12

Updated by Chris Buechler over 8 years ago

  • Status changed from Confirmed to Resolved

this was resolved by replacing apinger

Actions

Also available in: Atom PDF