Bug #4031
closedNotifications mail bomb in some gateway failure circumstances
100%
Description
In certain gateway failure scenarios where things are flapping, a significant number of emails can be generated via notifications. On occasion, this can send dozens to 100+ emails within a few minutes. The usual duplicate notification checking doesn't accommodate this circumstance.
Files
Updated by Chris Buechler almost 10 years ago
- Target version changed from 2.2 to 2.2.1
this doesn't seem to be as bad as it used to be, will revisit.
Updated by Chris Buechler almost 10 years ago
- Target version changed from 2.2.1 to 2.2.2
Updated by Chris Buechler over 9 years ago
- Target version changed from 2.2.2 to 2.2.3
Updated by Chris Buechler over 9 years ago
- Target version changed from 2.2.3 to 2.3
Updated by Jim Thompson almost 9 years ago
- Tracker changed from Bug to Feature
- Target version changed from 2.3 to 2.4.0
Updated by Michael Kellogg over 8 years ago
this still happens to me along with other issues when wan goes down on 2.3 does not seem related to flapping
Updated by Miikka Karhuluoma over 8 years ago
I am also experiencing this, with 2.3 and now also with 2.3.1. My absolute worst case was 1,500 emails within couple of hours. Typical burst is 50-100 emails within 1-2 minutes, then 10-20 minutes pause and again a new burst.
Updated by → luckman212 over 8 years ago
- File gateways.png gateways.png added
- File message_count.png message_count.png added
- File email2.png email2.png added
- File graph.png graph.png added
I was going to post again about this as well -- 2.3.x still doing this quite often and it's really crazy bad sometimes. Last night one of my gateways flapped a few times and I got 729 emails in a 6 minute span!
(See attached screenshots)
Is there anything that can be done? I had an idea of how to "fix" this problem, just a conceptual one (I don't know if I could actually translate it to quality PHP code) but basically:
1- choose a "sane" minimum interval (call it "min_int") between alerts (bonus points if this is user-configurable)
2- when an alert is triggered/queued, first hash the message and check it against a simple db (sqlite?) of recently sent messages. The db would have 2 columns, "hash" and "timestamp" (epoch time)
3- first thing would be simple housekeeping on the db: prune away any records where "timestamp" is < ("now" - "min_int")
4- search the db for "hash" -- if there is a matching hash found AND the diff between "now" and "timestamp" is < "min_int" from step 1, then die
5- if there is NOT a matching hash (or the timediff is >min_int) then store the timestamp and hash in the db, and queue the message to be sent
I know this over-simplifies things and makes it look easy but some variation of the logic above would be a great stress reliever.
Updated by Nick Peelman over 8 years ago
It would be nice if something similar could be baked in for CARP notifications as well. Our relatively small HA setup can generate a few dozen emails in a couple of minutes just making VLAN changes either on the firewall(s) or the switch.
Would like to turn off email notifications completely, but there isn't an easy way to monitor CARP status via SNMP (that I have found).
Updated by → luckman212 over 8 years ago
I noticed the target version was bumped to 2.4.0 and the assignee is still cmb — this one bit me again this morning so I just stopped by to ask if this is going to get reassigned what with Chris leaving etc. I wish I had the skills to code this fix myself. My last post has some simple logic that I believe should work if someone could translate that into PHP code.
Updated by Jim Thompson over 8 years ago
- Assignee changed from Chris Buechler to Renato Botelho
Luke Hamburg wrote:
I noticed the target version was bumped to 2.4.0 and the assignee is still cmb — this one bit me again this morning so I just stopped by to ask if this is going to get reassigned what with Chris leaving etc. I wish I had the skills to code this fix myself. My last post has some simple logic that I believe should work if someone could translate that into PHP code.
We'll look into it.
Updated by Michael Kellogg over 8 years ago
I too have seen this I shut off emails cause it makes gui inaccessible when it starts bombing no coding skills here but if i can help test let me know ive got some real garbage internet connections here (2)
Updated by Renato Botelho about 8 years ago
- Tracker changed from Feature to Bug
It was implemented a check that prevents mail notification system to send the same message multiple times. It should be enough to get it fixed
Updated by Renato Botelho about 8 years ago
- Status changed from Confirmed to Feedback
Updated by → luckman212 about 8 years ago
Thank you Renato! How can we test this? Is there a commit hash you can reference?
Updated by Renato Botelho about 8 years ago
- Status changed from Feedback to Confirmed
Not too fast, it was my mistake. I'll work on a proper fix
Updated by Jim Pingle about 8 years ago
Looking at a customer box today it made me realize a good path here would be to queue up the notifications in a file and batch send them every few minutes asynchronously. If the SMTP notifications cannot be sent, attempting to send them blocks other functions that are forced to wait until they time out. On a unit with ~60 VIPs, it can take 15+ minutes for the system to recover from attempting notifications that all timeout.
Another bonus of using a notifications queue is that repeated identical notifications could be reduced to a count, such as "Repeated X times".
Updated by Pi Ba over 7 years ago
This could help quite a bit imho :) https://github.com/pfsense/pfsense/pull/3768
Updated by Renato Botelho about 7 years ago
- Target version changed from 2.4.0 to 2.4.1
Updated by Jim Pingle about 7 years ago
- Target version changed from 2.4.1 to 2.4.2
Moving target to 2.4.2 as we need 2.4.1 sooner than anticipated.
Updated by Jim Pingle about 7 years ago
- Target version changed from 2.4.2 to 2.4.3
Updated by Jim Pingle almost 7 years ago
- Status changed from Confirmed to Feedback
- % Done changed from 0 to 100
- Affected Version set to All
- Affected Architecture All added
- Affected Architecture deleted (
)
PR 3768 was merged a while back and it's working well. Could use some additional testing/feedback but it looks good to me.
Updated by Jim Pingle over 6 years ago
- Status changed from Feedback to Resolved
This has been working great since it was merged.