Project

General

Profile

Actions

Bug #4031

closed

Notifications mail bomb in some gateway failure circumstances

Added by Chris Buechler over 9 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
Notifications
Target version:
Start date:
11/20/2014
Due date:
% Done:

100%

Estimated time:
Plus Target Version:
Release Notes:
Affected Version:
All
Affected Architecture:
All

Description

In certain gateway failure scenarios where things are flapping, a significant number of emails can be generated via notifications. On occasion, this can send dozens to 100+ emails within a few minutes. The usual duplicate notification checking doesn't accommodate this circumstance.


Files

message_count.png (16.2 KB) message_count.png showing the # of alerts sent out → luckman212, 05/19/2016 11:38 AM
gateways.png (517 KB) gateways.png dpinger log → luckman212, 05/19/2016 11:38 AM
graph.png (75.2 KB) graph.png only a small blip from dpinger on 1 gateway... → luckman212, 05/19/2016 11:38 AM
email2.png (941 KB) email2.png stuffed inbox! → luckman212, 05/19/2016 11:38 AM
Actions #1

Updated by Chris Buechler over 9 years ago

  • Target version changed from 2.2 to 2.2.1

this doesn't seem to be as bad as it used to be, will revisit.

Actions #2

Updated by Chris Buechler about 9 years ago

  • Target version changed from 2.2.1 to 2.2.2
Actions #3

Updated by Chris Buechler about 9 years ago

  • Target version changed from 2.2.2 to 2.2.3
Actions #4

Updated by Chris Buechler almost 9 years ago

  • Target version changed from 2.2.3 to 2.3
Actions #5

Updated by Jim Thompson about 8 years ago

  • Tracker changed from Bug to Feature
  • Target version changed from 2.3 to 2.4.0
Actions #6

Updated by Michael Kellogg about 8 years ago

this still happens to me along with other issues when wan goes down on 2.3 does not seem related to flapping

Actions #7

Updated by Miikka Karhuluoma almost 8 years ago

I am also experiencing this, with 2.3 and now also with 2.3.1. My absolute worst case was 1,500 emails within couple of hours. Typical burst is 50-100 emails within 1-2 minutes, then 10-20 minutes pause and again a new burst.

Actions #8

Updated by → luckman212 almost 8 years ago

I was going to post again about this as well -- 2.3.x still doing this quite often and it's really crazy bad sometimes. Last night one of my gateways flapped a few times and I got 729 emails in a 6 minute span!

(See attached screenshots)

Is there anything that can be done? I had an idea of how to "fix" this problem, just a conceptual one (I don't know if I could actually translate it to quality PHP code) but basically:

1- choose a "sane" minimum interval (call it "min_int") between alerts (bonus points if this is user-configurable)
2- when an alert is triggered/queued, first hash the message and check it against a simple db (sqlite?) of recently sent messages. The db would have 2 columns, "hash" and "timestamp" (epoch time)
3- first thing would be simple housekeeping on the db: prune away any records where "timestamp" is < ("now" - "min_int")
4- search the db for "hash" -- if there is a matching hash found AND the diff between "now" and "timestamp" is < "min_int" from step 1, then die
5- if there is NOT a matching hash (or the timediff is >min_int) then store the timestamp and hash in the db, and queue the message to be sent

I know this over-simplifies things and makes it look easy but some variation of the logic above would be a great stress reliever.

Actions #9

Updated by Nick Peelman almost 8 years ago

It would be nice if something similar could be baked in for CARP notifications as well. Our relatively small HA setup can generate a few dozen emails in a couple of minutes just making VLAN changes either on the firewall(s) or the switch.

Would like to turn off email notifications completely, but there isn't an easy way to monitor CARP status via SNMP (that I have found).

Actions #10

Updated by → luckman212 almost 8 years ago

I noticed the target version was bumped to 2.4.0 and the assignee is still cmb — this one bit me again this morning so I just stopped by to ask if this is going to get reassigned what with Chris leaving etc. I wish I had the skills to code this fix myself. My last post has some simple logic that I believe should work if someone could translate that into PHP code.

Actions #11

Updated by Jim Thompson almost 8 years ago

  • Assignee changed from Chris Buechler to Renato Botelho

Luke Hamburg wrote:

I noticed the target version was bumped to 2.4.0 and the assignee is still cmb — this one bit me again this morning so I just stopped by to ask if this is going to get reassigned what with Chris leaving etc. I wish I had the skills to code this fix myself. My last post has some simple logic that I believe should work if someone could translate that into PHP code.

We'll look into it.

Actions #12

Updated by Michael Kellogg almost 8 years ago

I too have seen this I shut off emails cause it makes gui inaccessible when it starts bombing no coding skills here but if i can help test let me know ive got some real garbage internet connections here (2)

Actions #13

Updated by Renato Botelho over 7 years ago

  • Tracker changed from Feature to Bug

It was implemented a check that prevents mail notification system to send the same message multiple times. It should be enough to get it fixed

Actions #14

Updated by Renato Botelho over 7 years ago

  • Status changed from Confirmed to Feedback
Actions #15

Updated by → luckman212 over 7 years ago

Thank you Renato! How can we test this? Is there a commit hash you can reference?

Actions #16

Updated by Renato Botelho over 7 years ago

  • Status changed from Feedback to Confirmed

Not too fast, it was my mistake. I'll work on a proper fix

Actions #17

Updated by Jim Pingle over 7 years ago

Looking at a customer box today it made me realize a good path here would be to queue up the notifications in a file and batch send them every few minutes asynchronously. If the SMTP notifications cannot be sent, attempting to send them blocks other functions that are forced to wait until they time out. On a unit with ~60 VIPs, it can take 15+ minutes for the system to recover from attempting notifications that all timeout.

Another bonus of using a notifications queue is that repeated identical notifications could be reduced to a count, such as "Repeated X times".

Actions #18

Updated by Pi Ba almost 7 years ago

This could help quite a bit imho :) https://github.com/pfsense/pfsense/pull/3768

Actions #19

Updated by Renato Botelho over 6 years ago

  • Target version changed from 2.4.0 to 2.4.1
Actions #20

Updated by Jim Pingle over 6 years ago

  • Target version changed from 2.4.1 to 2.4.2

Moving target to 2.4.2 as we need 2.4.1 sooner than anticipated.

Actions #21

Updated by Jim Pingle over 6 years ago

  • Target version changed from 2.4.2 to 2.4.3
Actions #22

Updated by Jim Pingle over 6 years ago

  • Status changed from Confirmed to Feedback
  • % Done changed from 0 to 100
  • Affected Version set to All
  • Affected Architecture All added
  • Affected Architecture deleted ()

PR 3768 was merged a while back and it's working well. Could use some additional testing/feedback but it looks good to me.

Actions #23

Updated by Jim Pingle about 6 years ago

  • Status changed from Feedback to Resolved

This has been working great since it was merged.

Actions

Also available in: Atom PDF