In some cases, we have seen one successful check clearing out an incident only to have it reopened shortly after because the service was in a recovery state but not a fully recovered state. In the heat of the moment, muting or disabling multiple checks to avoid this is not a feasible option for us.
Rather than having 1 successful check being deemed a recovery, it would be nice to say we want to have 3 or some configurable number of successful checks in a row before issuing the all clear. This way we can avoid where a server comes back up and is immediately inundated with requests that it goes back down and retriggers the check failure and then alerts our paging system again and keeps sending out alerts to our on-calls. This is the same kind of theory as an AWS scaling event, X number of triggered events to scale out a service and then having Y number of triggered events to scale that service back in.
Please authenticate to join the conversation.
In Review
π‘ Feature Request
About 1 year ago

Rick Clymer
Get notified by email when there are changes.
In Review
π‘ Feature Request
About 1 year ago

Rick Clymer
Get notified by email when there are changes.