Uptime.com monitors your infrastructure for failure, notifying you the moment something goes down. Sometimes, the failure of a check does not signify a real problem is occurring and you need to refine the conditions under which an alert is issued. DNS checks, for example, will query multiple servers that may go down at any given moment. Some systems, like DNS, are designed with multiple layers that safeguard us from these singular failures.
When a check fails, IT personnel could waste effort researching false positives that didn’t signify a real problem. Uptime.com’s Advanced Check Options provide a number of methods to reduce the likelihood of receiving alerts for failed checks that aren’t problematic and better controlling the conditions under which a check functions.
Setting Check Conditions
Once you have created a check, there are a number of options under Advanced, Escalation and Maintenance that control when and how a check alert is issued. Let’s begin with the Advanced tab, which is used to set your Sensitivity and IP Version.
Sensitivity and Timeout
Sensitivity is the number of Locations that can fail before a check is considered failed. We recommend that every user monitor from at least three Locations, and then sets the Sensitivity to a value that matches or accounts for the majority of those Locations. This way, Uptime.com provides the best balance between alert speed & accuracy, while avoiding the false positives that low sensitivity could create.
Users can also designate a Timeout, measured in seconds, to further control when and how alerts are issued. A timeout error typically signifies a problem with the connection to a specific function of a website. There may be too many users attempting to access a single resource, or a localized outage affecting connection time. Use of the Timeout (available only for HTTPS, API and Transaction checks) provides first indication, with technical data, that there is a problem connecting to your website that requires investigation.
All checks, except for API and Transaction checks, default to IPv4 for connections unless IPv6 is specified. IPv6 is gaining in popularity as more routers and consumer devices utilize the addressing scheme. In specific instances, such as monitoring uptime to an interconnected device or a specific usage of API, it’s important to utilize an IPv6 address. To use any available address, leave this option on Any.
Escalations allow users to receive a notification that a check has stayed down longer than a designated period of time. You can choose Escalations from the check screen to designate the amount of time that must pass and who will be notified. You can escalate any check with a wide range of options available as to how your escalation will work.
The critical infrastructure that keeps our websites live requires periodic planned maintenance. Servers need security patches, updates, or hardware replacements to ensure a flawless customer experience. Setting Maintenance allows Uptime.com users to temporarily ignore all failed checks for a specified period of time.
It is an alternative to pausing the check, which requires more human input that is prone to error. Maintenance windows can be set to Under Maintenance Now, where all failed checks are ignored until the feature is returned to No Maintenance Window, or you can Use Maintenance Schedule to specify a timeframe.
Let’s explore these options with a detailed use case that better illustrates these options.
Use Case - Escalating a Transaction Check Under Certain Circumstances
Transaction checks can be programmed to follow certain steps that complete a goal, creating a tool that continually monitors the availability of key infrastructure. In our Transaction Check use case, we utilized this tool to ping support with a specific test email we can filter for on our side that reports valuable uptime data back to us.
Support functionality is invaluable when building customer trust and providing resolutions to complex problems. Yet, even the most experienced engineers understand that tools fail under specific conditions. In this use case, we will amend our existing Transaction Check to make sure it only sends an alert under the following conditions.
- Three geographical locations fail
- A timeout of 30 seconds occurs
We will use an escalation to send an additional alert to our IT team if the problem persists for longer than one hour. Then we’ll explore some additional options to designate maintenance windows, and further escalate checks as needed
This shot of our Advanced screen illustrates the first step in this process:
With our Sensitivity set to “3,” there must be three geographic locations reporting failure, including timeout for 30 seconds, for an alert to issue.
We can also add notes with some specific instruction for employees, which is useful for testing purposes. For our example, we’ll assume the reason this outage occurred is that we’ve moved our support to a different system and we’re having some integration problems. The Notes section can include recommended troubleshooting steps to take in the event of a crash.
Next, let’s make sure our IT team receives an escalation after one hour. If the problem requires more eyes, it’s important to set that response in motion as quickly as possible.
Here, we’ve designated our in-house IT team will receive a notification when the Transaction Check we’ve created fails for 1 hour.
This is the sequence of events that occurs when a check times out under the conditions we’ve set so far in this example:
- Our Default contact receives the Failed Check/Timeout alert when it occurs
- One hour passes with no successful check
- Escalation is issued to the Contact “IT”
Since we will need to test more than one point of fault to determine a solution, we will look at how to schedule a four-hour maintenance block for our team to do their work.
With Maintenance enabled, Uptime.com will ignore all failed checks. No alerts or escalations will occur during this time period, and since we will schedule this block of time we won’t need to further interact with it in any way. Click on Maintenance and then Use Maintenance Schedule. After our settings are filled out, our screen should look something like the example below:
We’ve designated maintenance occurs at 5 PM local time, ends at 9 PM, and repeats every Monday.
We’re not using the Under Maintenance Now option since we anticipate our maintenance to last beyond work hours. We would need to resume monitoring by choosing No Maintenance Window in order for Uptime functionality to continue. Since our maintenance is planned to go late into the evening, we don’t want a tired human in charge of this critical step.
When we select Use Maintenance Schedule, Uptime.com resumes its monitoring automatically once your block of time has passed on the specified day. However, please be sure to disable maintenance if you do not plan to repeat it.
Tips and Ideas
These advanced options are intended to add improved monitoring and reporting functionality and to better control the conditions under which reports are received.
Beginning with Sensitivity, you might consider multiple checks designed to monitor specific infrastructure in specific regions. If your company is International, checks that are designed only for the UK, Asia, or the US would provide location-specific data about which regions are out.
Escalations are useful for tracking significant errors related to infrastructure that might be prone to small outages. TCP/UDP, API, and even Transaction Checks all monitor infrastructure that may experience small outages. It’s not ideal for a shopping cart to go down for five minutes, but the functionality may be restored by the time IT reacts to the outage.
Utilize Escalations for this kind of infrastructure so your team is shielded from false positives.
Escalations can be useful when your team has tiers of support and monitoring around the clock. A low-level outage may take support technicians an hour or so to fix, but your team should keep apprised of their efforts. You can use Escalations to reduce the email exchanges diagnosing the problem, and give designated personnel access to critical data after a certain time has passed.