Project

General

Profile

Actions

Bug #1127

closed

bug in apinger halts failover and load balancing

Added by Luis Soltero almost 14 years ago. Updated almost 14 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
12/23/2010
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
All
Affected Architecture:
All

Description

on pfSense 1.2.3-RELEASE

Running in Failover mode between WAN and OPT1 I noticed that once in a while monitoring of the pool stopped after the following error is displayed in the slbd.log

Dec 18 01:15:53 webxaccelerator apinger: 208.67.222.222: Lost packet count mismatch (-20!=0)!
Dec 18 01:15:53 webxaccelerator apinger: 208.67.222.222: Received packets buffer: ################################################## ####################

ps aux | grep apinger shows that apinger is no longer running. This causes failover and loadbalacing to stop working since there is no process monitoring the interfaces.

Looking at the source code to apinger.c we see on line 854 we note that apinger exits on error.

if (t->recently_lost!=really_lost){
fprintf(f," lost packet count mismatch (%i!=%i)!\n",t->recently_lost,really_lost);
logit("%s: Lost packet count mismatch (%i!=%i)!",t->name,t->recently_lost,really_lost);
logit("%s: Received packets buffer: %s %s\n",t->name,buf2,buf1);
err=1;
}
free(buf1);
free(buf2);
fprintf(f,"\n");
}
fclose(f);
if (err) abort();

Patching apinger.c as follows

vmmail3# diff apinger.c apinger.c.orig
858,859c858
< t->recently_lost = really_lost = 0;
< // err=1;
---

err=1;

prevents apinger from exiting on error. Load balancing and failover now work as expected even when a condition occurs to flag this error.

Dec 18 20:52:37 webxaccelerator apinger: 208.67.222.222: Lost packet count mismatch (-21!=0)!
Dec 18 20:52:37 webxaccelerator apinger: 208.67.222.222: Received packets buffer: ################################################## ####################
Dec 18 21:05:55 webxaccelerator apinger: ALARM: 208.67.220.220(208.67.220.220) * down
Dec 18 21:06:03 webxaccelerator apinger: alarm canceled: 208.67.220.220(208.67.220.220)
down *

So I have looked at this a little more closely. The version of apinger included in the pfPorts seems to have the same issue. Basically if an inconsistency is found in the number of packets lost then apinger exits. In my mind apinger should * NEVER * exit.

It seems that the apinger in pfPorts is used when building pfSense 2.0. 1.2.3-RELEASE uses the FreeBSD ports version.

Following is a patch against the FreeBSD ports version of apinger that resolves my issues with failover pools halting when inconsistent packet loss is detected. I don't currently do any work with 2.0 but it would be good if one of the maintainers applied the following patch to apinger included in pfPorts.

--- apinger.c 2010-12-21 08:47:22.000000000 0000
++ apinger.c.new 2010-12-21 08:47:15.000000000 +0000
@ -787,7 +787,6 @
time_t tm;
int i,qp,really_lost;
char *buf1,*buf2;
-int err=0;

if (config->status_file==NULL) return;

@ -855,7 +854,7 @
fprintf(f," lost packet count mismatch (%i!=%i)!\n",t->recently_lost,really_lost);
logit("%s: Lost packet count mismatch (%i!=%i)!",t->name,t->recently_lost,really_lost);
logit("%s: Received packets buffer: %s %s\n",t->name,buf2,buf1);
- err=1;
+ t->recently_lost = really_lost = 0;
}
free(buf1);
free(buf2);
@ -863,7 +862,6 @
fprintf(f,"\n");
}
fclose(f);
- if (err) abort();
}

#ifdef FORKED_RECEIVER
Actions

Also available in: Atom PDF