Project

General

Profile

Bug #4031

Notifications mail bomb in some gateway failure circumstances

Added by Chris Buechler over 2 years ago. Updated 7 days ago.

Status:
Confirmed
Priority:
Normal
Category:
Notifications
Target version:
Start date:
11/20/2014
Due date:
% Done:

0%

Affected version:
Affected Architecture:

Description

In certain gateway failure scenarios where things are flapping, a significant number of emails can be generated via notifications. On occasion, this can send dozens to 100+ emails within a few minutes. The usual duplicate notification checking doesn't accommodate this circumstance.

message_count.png - showing the # of alerts sent out (16.2 KB) Luke Hamburg, 05/19/2016 11:38 AM

gateways.png - dpinger log (517 KB) Luke Hamburg, 05/19/2016 11:38 AM

graph.png - only a small blip from dpinger on 1 gateway... (75.2 KB) Luke Hamburg, 05/19/2016 11:38 AM

email2.png - stuffed inbox! (941 KB) Luke Hamburg, 05/19/2016 11:38 AM

History

#1 Updated by Chris Buechler over 2 years ago

  • Target version changed from 2.2 to 2.2.1

this doesn't seem to be as bad as it used to be, will revisit.

#2 Updated by Chris Buechler over 2 years ago

  • Target version changed from 2.2.1 to 2.2.2

#3 Updated by Chris Buechler over 2 years ago

  • Target version changed from 2.2.2 to 2.2.3

#4 Updated by Chris Buechler about 2 years ago

  • Target version changed from 2.2.3 to 2.3

#5 Updated by Jim Thompson over 1 year ago

  • Tracker changed from Bug to Feature
  • Target version changed from 2.3 to 2.4.0

#6 Updated by Michael Kellogg over 1 year ago

this still happens to me along with other issues when wan goes down on 2.3 does not seem related to flapping

#7 Updated by Miikka Karhuluoma about 1 year ago

I am also experiencing this, with 2.3 and now also with 2.3.1. My absolute worst case was 1,500 emails within couple of hours. Typical burst is 50-100 emails within 1-2 minutes, then 10-20 minutes pause and again a new burst.

#8 Updated by Luke Hamburg about 1 year ago

I was going to post again about this as well -- 2.3.x still doing this quite often and it's really crazy bad sometimes. Last night one of my gateways flapped a few times and I got 729 emails in a 6 minute span!

(See attached screenshots)

Is there anything that can be done? I had an idea of how to "fix" this problem, just a conceptual one (I don't know if I could actually translate it to quality PHP code) but basically:

1- choose a "sane" minimum interval (call it "min_int") between alerts (bonus points if this is user-configurable)
2- when an alert is triggered/queued, first hash the message and check it against a simple db (sqlite?) of recently sent messages. The db would have 2 columns, "hash" and "timestamp" (epoch time)
3- first thing would be simple housekeeping on the db: prune away any records where "timestamp" is < ("now" - "min_int")
4- search the db for "hash" -- if there is a matching hash found AND the diff between "now" and "timestamp" is < "min_int" from step 1, then die
5- if there is NOT a matching hash (or the timediff is >min_int) then store the timestamp and hash in the db, and queue the message to be sent

I know this over-simplifies things and makes it look easy but some variation of the logic above would be a great stress reliever.

#9 Updated by Nick Peelman about 1 year ago

It would be nice if something similar could be baked in for CARP notifications as well. Our relatively small HA setup can generate a few dozen emails in a couple of minutes just making VLAN changes either on the firewall(s) or the switch.

Would like to turn off email notifications completely, but there isn't an easy way to monitor CARP status via SNMP (that I have found).

#10 Updated by Luke Hamburg about 1 year ago

I noticed the target version was bumped to 2.4.0 and the assignee is still cmb — this one bit me again this morning so I just stopped by to ask if this is going to get reassigned what with Chris leaving etc. I wish I had the skills to code this fix myself. My last post has some simple logic that I believe should work if someone could translate that into PHP code.

#11 Updated by Jim Thompson about 1 year ago

  • Assignee changed from Chris Buechler to Renato Botelho

Luke Hamburg wrote:

I noticed the target version was bumped to 2.4.0 and the assignee is still cmb — this one bit me again this morning so I just stopped by to ask if this is going to get reassigned what with Chris leaving etc. I wish I had the skills to code this fix myself. My last post has some simple logic that I believe should work if someone could translate that into PHP code.

We'll look into it.

#12 Updated by Michael Kellogg about 1 year ago

I too have seen this I shut off emails cause it makes gui inaccessible when it starts bombing no coding skills here but if i can help test let me know ive got some real garbage internet connections here (2)

#13 Updated by Renato Botelho 11 months ago

  • Tracker changed from Feature to Bug

It was implemented a check that prevents mail notification system to send the same message multiple times. It should be enough to get it fixed

#14 Updated by Renato Botelho 11 months ago

  • Status changed from Confirmed to Feedback

#15 Updated by Luke Hamburg 11 months ago

Thank you Renato! How can we test this? Is there a commit hash you can reference?

#16 Updated by Renato Botelho 11 months ago

  • Status changed from Feedback to Confirmed

Not too fast, it was my mistake. I'll work on a proper fix

#17 Updated by Jim Pingle 9 months ago

Looking at a customer box today it made me realize a good path here would be to queue up the notifications in a file and batch send them every few minutes asynchronously. If the SMTP notifications cannot be sent, attempting to send them blocks other functions that are forced to wait until they time out. On a unit with ~60 VIPs, it can take 15+ minutes for the system to recover from attempting notifications that all timeout.

Another bonus of using a notifications queue is that repeated identical notifications could be reduced to a count, such as "Repeated X times".

#18 Updated by Pi Ba 7 days ago

This could help quite a bit imho :) https://github.com/pfsense/pfsense/pull/3768

Also available in: Atom PDF