Bug #4081
closed
Apinger reporting incorrect latency
Added by Denny Page almost 10 years ago.
Updated almost 9 years ago.
Category:
Gateway Monitoring
Affected Architecture:
All
Description
If a gateway has an explicit monitor address, apinger will stop reporting latency to the monitor address and switch to reporting latency for the gateway address. Very obvious if the gateway address is local (~ 1ms) and the monitor address is across a remote link (~10ms). Once the switch occurs, apinger appears to remain stuck reporting the lower value.
It's notable that (hours) after the event, the pings are going to the correct (monitor) address even though the latency being reported corresponds to the local gateway address.
It's unclear what the trigger for this is. I haven't caught the event live yet. I have a tcpdump running in hopes of catching the event live.
Files
I have a bare metal box that I believe this or something related is happing chris has access info if any of the other devs want to look at it
- Category set to Gateway Monitoring
- Status changed from New to Confirmed
it's not reporting latency to the gateway, its calculations become wrong under some circumstance.
also affects rrd graph at same time
- Target version set to 2.2.1
with this about 20% of the time causes a mail storm makes the box inaccessible from webgui a reset of web configurator doesn't help it seems to require a reboot to recover
could this issue be aggravated by using google dns as monitor addresses as they are anycast?
I had this problem on a clean plain install of 2.2 using a cable modem DHCP WAN with no explicitly set or override of the gateway monitor.
When it occurs, the RTT value drops to about 10-20% of the actual value but continues to follow the actual value proportionally.
I was able to "fix" it by rebooting the cable modem or disconnecting and reconnecting the WAN interface cable. I couldn't correlate anything with the trigger.
I've attached RDD graphs of it occurring and the "fix"
- Target version changed from 2.2.1 to 2.2.2
- Target version changed from 2.2.2 to 2.2.3
I have this bug here. It screws up our Multi-WAN setup. Ideally we should use losses/highping to switch gateways, but as these data are almost always wrong/surreal, we disabled it. The only temporary fix I've seen is restarting apinger service every 5-10 minutes :/
I must add that this always happened, with 2.1.4, 2.1.5, 2.2.0, 2.2.1 and 2.2.2. In a XEN or VMWare environment.
I never saw this happen on several 2.1.5 installs but it happened on every (3) 2.2.0 install I tried.
I'll have a test machine I can connect to a cable modem next week in hopes to track this problem down. I've been reviewing the code and have some things to look for when it gets into this state. I need reliable latency and loss tracking so this is a high priority bug for me.
Sending ping #1584 to GWPDP (200.160.6.214)
Recently lost packets: 0
Sending ping #3607 to GWEBT (200.230.251.210)
Recently lost packets: 0
Polling, timeout: 0.999s
#3607 from GWEBT delay: 52.327ms/43.812ms/496.876ms received = 3360
(avg: 49.688ms)
(avg. loss: 0.0%)
Polling, timeout: 0.947s
#1584 from GWPDP delay: 67.941ms/65.886ms/658.732ms received = 1577
(avg: 65.873ms)
(avg. loss: 0.0%)
Polling, timeout: 0.931s
Sending ping #1585 to GWPDP (200.160.6.214)
Recently lost packets: 0
Sending ping #3608 to GWEBT (200.230.251.210)
Recently lost packets:* 1*
Why did it increment GWEBT loss from 0 to 1 if it received the ping sucessfully ?
- Status changed from Confirmed to Needs Patch
- Target version deleted (
2.2.3)
apinger is being replaced in 2.3, which will resolve outstanding issues here.
this does not show up to be tracked for 2.3
Only workaround is cron job every 3-5 minutes or so when is apinger update version gonna be released any idea?
For me, apinger suddenly goes from 5ms to 50ms.
Restarting it recovers.
- Status changed from Needs Patch to Resolved
- Target version set to 2.3
- Affected Version changed from 2.2 to All
fixed in 2.3 by replacing apinger
Thank you, Denny. Appreciate the help getting rid of our apinger woes. I've been through a variety of test scenarios that used to make apinger lose its mind and fail basic math, and have yet to find any inaccuracies in dpinger output.
Also available in: Atom
PDF