Application Uptime and Defacement Monitoring with AWS

ESTL
ESTL Lab Notes
Published in
7 min readJan 12, 2023

--

Text: Delon | Content: Delon, Evelyn

Introduction

The DevSecOps team in ESTL manages and maintains our infrastructure in Amazon Web Services (AWS). Part of the work we do here is to monitor our applications for any downtime and website defacement attacks. Apart from compliance, it is in our interest to do so to improve the reliability of our applications and notify ourselves of any security breaches. The end goal is to promptly notify the application teams of any incidents that occur so that they can take the necessary remedial actions. To achieve this, we decided to leverage on AWS services for uptime and defacement monitoring rather than rely on external services.

Technical Design

Our design for this monitoring system must meet the following user requirements:

  1. The application teams should be able to monitor their applications for uptime and defacement.
  2. The application teams should be notified promptly via phone for immediate action and via Slack for the entire team’s awareness.
  3. The application teams should have the ability to turn the monitoring system on and off independently.

We then decided to integrate the following AWS service to achieve the above:

  • Amazon CloudWatch Synthetics
  • Amazon CloudWatch Events
  • AWS Systems Manager Incident Manager

The diagram below shows how we integrate these services:

Overview of the services we use

We’ll now deep-dive into the various components and explore some of the challenges and design decisions we made along the way.

Amazon CloudWatch Synthetics

AWS CloudWatch Synthetics is a monitoring service that allows users to create canaries (AWS Lambda functions running pre-written scripts or blueprints) to monitor applications and endpoints. Custom scripts can also be written for more specific monitoring needs, but the provided canary blueprints often suffice.

The visual monitoring blueprint, which uses NodeJS and Puppeteer to compare screenshots of an application with a baseline, was chosen for our uptime and defacement monitoring needs. It can detect both downtimes and defacement attacks by comparing the delta between screenshots, so we did not need a separate heartbeat monitoring canary.

The visual monitoring canary additionally provides two configurable features: the ability to specify a variance threshold and to set boundaries within the baseline to ignore. These settings allow for dynamic content on the application’s home page to be accounted for and reduce false positives.

Bypassing IP Whitelists

We created a dedicated visual monitoring canary for each of our applications, which runs at one minute intervals. To ensure the canaries can access our applications, which have IP whitelisting in-place, we placed them within our Virtual Private Cloud (VPC) and in a subnet behind our Network Address Translation (NAT) gateway. This ensures that the canaries have a static IP address which our load balancers can whitelist.

Custom IP address resolution

We encountered a situation where a service had the same DNS name in both the private and public AWS Route53 Hosted Zones. We needed the canary to resolve the domain name to the private IP, but by default it would resolve to the public IP. To fix this, we added the following code block to the visual monitoring canary’s code:

if (process.env.RESOLVE_IP) {
const defaultOptions = await synthetics.getDefaultLaunchOptions();
const launchArgs = [...defaultOptions.args, `--host-rules=MAP ${process.env.URL.replace("https://", "")} ${process.env.RESOLVE_IP}`];
await synthetics.launch({
args: launchArgs,
ignoreHTTPSErrors: true
});
}

This makes use of the — host-rules option of the “synthetics” library to allow the manual mapping of a domain name to an IP address.

Terraform Modules

To streamline the process of creating multiple visual monitoring canaries with different URLs, we created a custom Terraform module for reusability. Here is a snippet of the module that creates the canary:

resource "aws_synthetics_canary" "defacement" {
name = var.name
artifact_s3_location = "s3://${var.artefact_s3_bucket_name}"
execution_role_arn = var.execution_role_arn
handler = "pageLoadBlueprint.handler"
runtime_version = "syn-nodejs-puppeteer-3.6"
zip_file = data.archive_file.defacement_function_code.output_path
start_canary = true

schedule {
expression = "rate(1 minute)"
}

vpc_config {
subnet_ids = var.subnet_ids
security_group_ids = var.security_group_ids
}

run_config {
timeout_in_seconds = 30
environment_variables = {
URL = var.url
VARIANCE = var.variance
RESOLVE_IP = var.resolve_ip # Leave as empty string to use Route53
}
}
}

Note: Other resources to integrate with CloudWatch Alarms and AWS Systems Manager have been omitted for brevity.

Unfortunately, we can only set the boundaries to ignore via the AWS CloudWatch Synthetics Canaries console.

Amazon CloudWatch Alarms

To ensure timely notification of any incidents, we set up CloudWatch Alarms to monitor for the “Failed” metric published by AWS CloudWatch Synthetics. These alarms trigger an incident in AWS Systems Manager Incident Manager via an alarm action, using a specified response plan. The following is the Terraform code block for creating the CloudWatch Alarm in our Terraform module:

resource "aws_cloudwatch_metric_alarm" "defacement" {
alarm_name = var.name
comparison_operator = "GreaterThanOrEqualToThreshold"
evaluation_periods = "2"
datapoints_to_alarm = "2"
metric_name = "Failed"
namespace = "CloudWatchSynthetics"
period = "60"
statistic = "Maximum"
threshold = "1"
treat_missing_data = "notBreaching"
alarm_description = "This metric monitors defacement for ${var.url}"
alarm_actions = [var.response_plan_arn]

dimensions = {
"CanaryName" = var.name
}
}

Note: The “treat_missing_data” is set to “notBreaching” to prevent missing data points from triggering the incident response plan.

CloudWatch Alarms act as a bridge between the canaries and AWS Systems Manager Incident Manager, allowing for timely notification and response to any issues detected by the canaries.

AWS Systems Manager Incident Manager

AWS Systems Manager (SSM) Incident Manager is a service within SSM that helps users detect and respond to operational issues. It provides a range of tools, including response and escalation plans, runbook automation, and incident tracking. In this case, we used SSM Incident Manager primarily to notify our application teams of incidents.

Setting up

To set up SSM Incident Manager, we first created contacts for each application team member and selected phone calls as the notification method. Adding a contact for phone calls required the contact to receive a verification call from AWS with a 6-digit code to verify the phone number.

We then defined escalation plans to dictate the order in which the contacts are called. We kept it simple by calling one contact first, and if there is no response, the second backup contact would be called.

Finally, we created response plans (which is referenced in the CloudWatch alarms from the previous section) to bring everything together. It allows us to specify one or more escalation plans and chat channels for additional notification. In our case, we set up Slack notifications to be sent to the team in their respective Slack channels in addition to phone calls via AWS Chatbot.

We only used a few of the features offered by SSM Incident Manager. It also has runbooks for automating tasks during an incident, such as restarting an EC2 instance, and a comprehensive incident tracking feature that allows you to add notes and track details for post-mortem analysis.

Unfortunately, these features cannot currently be set up using Terraform, as the AWS provider for Terraform does not yet support AWS Incident Manager (as of writing).

Importing Contact Details

One issue we encountered with receiving phone calls from AWS SSM Incident Manager is that the calls may be mistaken for scam calls as the numbers originate from the US and are different each time. To address this issue, we notified the SSM Incident Manager team at AWS and they created a virtual card format (.vcf) file that can be imported onto the team’s mobile devices, allowing the calls to be easily identified as legitimate and, agonising to whoever receives the phone call.

Putting it to the test

A script has been created to enable the application teams to turn CloudWatch Synthetics canaries on and off as needed. This allows them to disable monitoring during planned maintenance on their applications and then re-enable it once the maintenance is complete. The script utilises the AWS CLI to directly alter the state of the canaries, which is possible because the application teams already have limited access to make AWS API calls.

So far, we have been fortunate that we have not encountered any actual instances of downtime or defacement. However, we have had a few false positives to test the effectiveness of our monitoring system. These false positives occurred when maintenance was performed on the application without disabling the monitoring and when the application’s homepage was updated without updating the baseline. In both cases, the monitoring system was triggered and the team was promptly notified and discovered that they were false alarms upon investigation.

Summary

By integrating AWS CloudWatch Synthetics, CloudWatch Alarms, and SSM Incident Manager, we were able to achieve the goals of our monitoring system. CloudWatch Synthetics made it easy to set up the necessary monitoring, while CloudWatch Alarms enabled us to perform minute-by-minute monitoring and to activate the SSM Incident Manager’s response plan. SSM Incident Manager gave us the ability to make phone calls to on-duty personnel and send Slack notifications to the entire team. Additionally, the application teams were able to turn the monitoring on and off independently using a custom script leveraging on AWS API calls. As a result, we were able to successfully implement an incident monitoring system that meets all of our requirements.

In conclusion, we have found the incident monitoring and response tools offered by AWS to have matured enough for us to have the confidence to transition from our existing commercial solutions to these native AWS services. This allows us to streamline our incident monitoring and response processes and eliminate the need to (procure and) manage multiple services. We also appreciate the proactive approach of the SSM Incident Manager team in addressing user feedback.

P.S. If you like what we are doing here in ESTL, do check out our careers page!

--

--

A product team within the Ministry of Education, Singapore. We solve real problems in the education sector. Learn more: www.estl.edu.sg