When monitoring a VM hosting the Hyperscience application, you will want to monitor all of the standard things, as well as a few application-specific things.
Standard Monitoring
RAM
The application is tuned to maximize usage of available RAM when processing submissions, but never more than 95% of available RAM. Alerts should be set up for if RAM usage goes above 95%.
Storage
Alerts should be set up for if storage starts to fill up (95%). There are multiple types of storage used:
-
Local storage on the VM
- All volumes should be independently monitored.
-
Networked file storage
- Check that NFS is mounted correctly, and that it is accessible from all of the VMs.
- DB storage.
CPU
This may be useful to have when debugging an issue, but is not a good candidate for alerting since Hyperscience is designed to maximize utilization of all CPU resources available when processing submissions. 100% CPU usage over an extended period of time does not necessarily mean anything is wrong.
Health check and Web UI Availability
A health check should be set up to periodically make a request to our health check API endpoint documented at https://docs.hyperscience.com/#health-check-status. If the health check determines that there is a problem with a given host, all processing for that host will pause until the error is resolved. This will not affect other hosts.
Application-Specific Monitoring
Submissions are processed asynchronously through a series of background jobs. Rarely, these background jobs can fail.
Sometimes jobs can fail due to transient conditions, such as a worker dying while processing, likely due to a VM being rebooted or shut down. In this case, the job will automatically be rerun by another worker after a short delay, and no manual intervention is required.
Other times, jobs can fail due to a condition that requires manual intervention. These conditions include:
- a file store filling up
- an outage of an external service such as a message queue
- an unexpected application bug
In this case, the job will not be rerun automatically, since it would just fail again when it reruns. Once the error condition has been remedied, the failed jobs can be manually rerun by using the retry functionality on the jobs dashboard in the web UI.
To monitor for job failures, a log-based trigger should be set up for all of the application’s containers. A string of either WORKER_FAIL or WORKER_JOB_FAIL indicates that a failure has occurred that needs manual intervention. If possible, container restarts should be monitored for as well as they may indicate a similar type of issue.
If a background job fails, a system admin should navigate to /processes/jobs?state=HALTED, click on the date filter, and select Last 7 Days. They should see some items in the list. If you click on the ID, you will see a modal with a section called “State Description.” This will generally provide insight into what has gone wrong. If you’re unable to resolve the error on your own, you should reach out to Hyperscience support and include the contents of that “State Description” section.