Escalations allow users to receive a notification that a check has stayed down longer than a designated period of time. Escalate alerts when a downtime event severity increases, so someone on your team best equipped to provide rapid response can act.
Escalations should send a new alert to a to the same team at intervals you define. Thus escalating the sense of urgency first with email, then with SMS, then with phone calls. In short, escalating an incident gives both a sense of urgency, and those with the highest permission levels the data they need to act quickly after known solutions have been explored.
This guide will provide best practices on escalations to provide a more thorough alert system.
Creating Smarter Escalations
Smarter escalations allow your team to get alerts sent to various locations based on a sense of urgency:
- Create a Contact for each “tier” of maintenance/response (IE first response, higher permissions, admin, etc.)
- Establish multiple methods of communication for a single person or team (Contact) that trigger as the outage time extends. For example create a Contact for Tier I SMS, Tier I email to escalate after 15 minutes, and then Tier I voice call when the 30-minute mark has passed.
- Integrate with your existing services (We provide push notifications including metrics and alerts to 18 providers, including customized webhooks for your internal configuration)
- Create multiple checks (Reduce false positives, confirm outages faster and perform root cause analysis)
An example of smart escalations in practice. Admin receives notification of extended downtime after 2 hours pass, but Tiers I and II receive notifications several times before that happens.
The keys to smart escalation are a robust monitoring system, and designated contacts for specific levels of response with an escalating sense of urgency.
On-call hours will send an alert to a contact when that contact is On-Call in the designated time zone. If downtime occurs outside of those on-call hours, the contact will receive an “Up” notification when on-call hours resume on the next day. It’s best to ensure a contact is designated as Always on Call (the default), or that schedules overlap so someone will always receive a downtime alert in real-time, where response time matters.
Improve Alert Response Rate
- Reduce response time
- Ensure recipients get data at a time and in a place to put that data into action
After you have defined tiers of response, you need to tell Uptime.com where to send each escalation. A check may have more than one escalation assigned to it.
You can escalate a severe outage to a second external data source, such as Slack or your internal dashboard, in case the first outage report delivered via email is missed.