Smarter alerting based on error rates

We had an incident recently where our alerts didn't fire due to an issue with a bad instance behind a load balancer. The system was not "down enough" to trigger an alert; however, the rate of errors was dramatically increased vs normal.

Alerting when error rates are above a threshold should help to counter many scenarios where simpler triggers do not work well. In a cloud-hosted environment, transient errors are common, being able to filter out transient errors from a genuine issue is critically important to avoid incidents but also false positives.

Please authenticate to join the conversation.

Upvoters
Status

In Review

Board

πŸ’‘ Feature Request

Tags

Alerting

Date

Over 2 years ago

Author

shahiddev

Subscribe to post

Get notified by email when there are changes.