Private Location monitoring is a complex system with many moving parts. Some things may be self-diagnosable, while others may require a ticket to support@uptime.com. We will document some common issues and solutions here.
Please note: Confirm that your container is using the most recent image (4.X) to ensure that checks run properly. Also note that the oldest version Uptime.com does support is 3.2.
Table of Contents
Understanding Container Status Outputs
Expected Behavior
Docker and Private Location containers can return a status output in JSON format, which will provide details on the Private Location itself, and any associated errors.
These outputs are called via the Command Line Interface (CLI), and are explained in detail on our Github under “Troubleshooting (via CLI)”.
Below is an example of a log file that may seem like an error, but is actually fine:
Adding password for user nagiosadmin
ERROR:systemctl: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ERROR:systemctl: Oops, 1 unsupported directory settings. You need to create those
before using the service.
ERROR:systemctl: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ERROR:systemctl: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ERROR:systemctl: Oops, 1 unsupported directory settings. You need to create those
before using the service.
ERROR:systemctl: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:systemctl: nagios-nrpe-server.service: the use of /bin/kill is not recommended
for ExecReload as it is asynchronous.
That means all the dependencies will perform the reload simultaneously / out of order.
In this scenario, there are extra configurations that containers may have, but our slimmed-down image doesn't need.
For further assistance with understanding these outputs, please contact support@uptime.com.
Graceful Shutdown
Another message to note is Graceful Shutdown. This message will occasionally appear in the logs to inform you of a past event. This operation is intended to prevent memory leaks, and is completely normal.
Restarting your Container
In the scenario where a container restart is required, it may take up to an hour for tasks to settle and return to normal. Some checks may fail during this period, but they should be corrected automatically; checks that remain down should be specifically investigated.
For further help understanding any logs generated by a Private Location, or if checks remain down over an hour after container restart, please reach out to support@uptime.com.
Understanding Container Status Outputs
Docker and Private Location containers can return a status output in JSON format, which will provide details on the Private Location itself, and any associated errors.
These outputs are called via the Command Line Interface (CLI), and are explained in detail on our Github under “Troubleshooting (via CLI)”.
For further assistance with understanding these outputs, please contact support@uptime.com.
Common Issues
Not working after upgrading from v2.x and v3.x
Usually, this error indicates that Docker volumes weren’t cleared during an upgrade. In those cases, run the following command to delete containers and volumes:
docker ps
docker stop *container id*
docker container prune
docker volume prune -a
check_nag and check_stalled_check_detection_log failing consistently in the PLM status
This error is common in Kubernetes or old Docker versions. It’s likely due to the container not having permission to bind to port 80 and 443. From the logs:
# supervisord.log
2024-02-19 11:53:21,786 INFO spawned: 'apache' with pid 159
2024-02-19 11:53:21,841 INFO exited: apache (exit status 1; not expected)
2024-02-19 11:53:22,843 INFO gave up: apache entered FATAL state, too many start retries too quickly
# apache2.log
(13)Permission denied: AH00072: make_sock: could not bind to address [::]:80
(13)Permission denied: AH00072: make_sock: could not bind to address 0.0.0.0:80
no listening sockets available, shutting down
- If using Docker, upgrade to a newer version or add --cap-add NET_BIND_SERVICE to the docker run command.
- If using Kubernetes, ensure the following block is included in the yaml config (as shown in the sample in Github):
containers:
securityContext:
capabilities:
add:
- NET_BIND_SERVICE
Run Test Isn’t Working
# taskqueue.log
sl = self._semlock = _multiprocessing.SemLock(
OSError: [Errno 28] No space left on device
This error may be caused by insufficient memory, specifically /dev/shm is either full or too small. To confirm the amount of space allocated to /dev/shm, run the command df -h from inside of the container.
If /dev/shm does have insufficient memory, please ensure that the –shm-size=2048m portion of the run command, found on Uptime.com’s Github page, has been entered correctly.
Private Location is running but Transaction Checks are Unreliable
This issue is likely caused by the script detecting too many CPU cores, and thus spawning an abnormal amount of Chrome browsers, taking up excessive resources.
To troubleshoot this, please first ensure that –shm-size=2048m is correctly specified in the Docker run command/k8s YAML. Additionally, it is recommended to review /home/uptime/logs/supervisord.log for any specific error messages.
To fix this issue, please include –env UPTIME_AVAILABLE_CPU_CORES=2 in the Docker run command or equivalent in the k8s YAML. See sample YAML on Uptime.com’s Github.
SSL/OpenSSL error when running checks from Private Location
Errors of this type typically are the result of outdated SSL libraries.
You may solve this error one of two ways:
- Mount the local SSL certificate folder to the container in the Docker run command with the following:
--mount type=bind,source=/etc/ssl/certs,target=/etc/ssl/certs
- Override the local openssl.cnf file and mount it to the container using the following:
--mount type=bind,source=/etc/ssl/openssl.cnf,target=/etc/ssl/openssl.cnf
Reducing Memory Usage
By default the container allocates 3 chrome browsers per available CPU for running transaction and page speed checks. If you have more than 2 CPUs allocated and/or you don't run large numbers of these checks on the private location, this may use more memory than necessary.
It is possible to limit the number of chrome browsers allocated by setting the UPTIME_AVAILABLE_CPU_CORES environment variable via the docker run command used to start the container.
For example: --env UPTIME_AVAILABLE_CPU_CORES=2
Creating backup.tgz
Backup.tgz is a crucial log file that is important for the Uptime.com Support team to troubleshoot problems with a Private Location. If possible, please generate this file and attach it to your support ticket to expedite the troubleshooting process.
These are the instructions for generating the backup.tgz file:
- Use the command docker ps to retrieve the running container’s PID.
- With the container’s PID, run the command docker run --rm --volumes-from <RUNNING_CONTAINER_PID> -v $(pwd):/backup ubuntu:latest tar -zcvf /backup/backup.tgz /home/uptime/logs /usr/local/nagios/etc /usr/local/nagios/var to create the backup.tgz file in the listed directory.
- Save this file from the directory, and attach it to the ticket with support@uptime.com for further assistance.
FAQ
What type of encryption does Uptime.com use for communication from internal checks to the Uptime.com central check servers?
TLS 1.2
Is my Private Location not working? When checking my Private Location status, I see warnings such as:
“Several alerts were not sent successfully to Uptime.com. This may indicate intermittent network issues. Restarting the container after network issues are resolved should resolve this warning.”
“The process to synchronize check states with the Uptime.com system failed to run or returned with error.”
The errors mentioned above indicate that previous alerts could not be sent. They usually occur when the Private Location has no outgoing network access for a certain amount of time. Private location servers may show warnings or errors during the first hour of operation following a restart, but they should disappear after that.
When checking my Private Location status, I see the following error: “The private location has failed to check for and/or sync new configuration. Please check that your UPTIME _ API_TOKEN setting is correct.”
Although there is a possibility of using an incorrect token, in most cases, upgrading to the latest stable image resolves the issue. If it doesn't, you can find the correct API token by setting up a Private Location communication with Support or by contacting support@uptime.com.
When reviewing my Private Location status, I encounter the following error: “No checks have been run in the last 15 minutes. Please check whether any checks have been assigned to this location, and/or restart this private location container. If this error remains, please ensure your container has the NET_BIND_SERVICE capability and that the apache service within is running.”
This is also likely related to the container not having permission to bind to port 80 and 443. Please refer to the steps here.
When running a test, I see the following error “The test queue is currently full, please try again later or try using a different test server”
While this may happen for other reasons, this error is common when upgrading from 3.x. To fix it, follow the steps to delete Docker containers and volumes.
For further troubleshooting help, please contact support@uptime.com.
Comments
0 comments