Explore Caylent’s Activities at AWS re:Invent

Amazon CloudWatch: Deep Dive

Managed Services
Infrastructure & DevOps Modernization

Uncover how AWS CloudWatch works under the hood — from metrics and logs to dashboards, alarms, and insights — with best practices for monitoring, troubleshooting, and optimizing your AWS workloads.

This blog was originally written and published by Trek10, which is now part of Caylent.

This is the second in a series of posts about monitoring your production workloads in AWS. In the first post, we did a high level overview of cloud monitoring and broke it down into six types of metrics you should be monitoring. Here we’ll dive deeper into one of those areas, Amazon CloudWatch metrics, and give you a few tips for getting the most out of CloudWatch in production.

CloudWatch Metrics is a pretty well-known and straightforward AWS service. If you’re monitoring a production environment in AWS it should be at the top of your list for diving into and getting comfortable with. Particularly if you’re building & running apps that increasingly focus on the non-EC2 world (aka platform services or serverless), CloudWatch is the new Linux top: the most fundamental and basic insight into your running environment. For the uninitiated, first we’ll do a quick overview.

Amazon CloudWatch Metrics Overview

CloudWatch is actually comprised of three (only loosely-related) services: CloudWatch Metrics, CloudWatch Logs, and CloudWatch Events. We will only focus on Metrics here.

Here is a short summary of CloudWatch Metrics:

  • CloudWatch metrics are simply time series data points emitted from AWS services or put into AWS by the API.
  • With EC2, CloudWatch gives you metrics from “outside the VM”… i.e. the hypervisor level. With other services where no VM is exposed, CloudWatch data gives you your only insight into the service’s operation.
  • You can access metrics from the interactive console explorer, console dashboards, the API, or you can pull them into your own monitoring tool.
  • Just about every monitoring tool on the planet now supports importing CloudWatch metrics. If yours doesn’t, try our favorite, Datadog.
  • One minute resolution data is stored for 15 days, 5 minute resolution for 3 months, and 1 hour resolution for 15 months
  • Most services deliver metrics at one minute resolution but some are less frequent.
  • You can push custom metrics into CloudWatch and those can be stored with up to 1 second resolution.
  • You can trigger CloudWatch Alarms off of metrics or import them into your own tool for alerting.

If you’re interested in diving deeper, ACloudGuru has a great set of lessons on CloudWatch metrics as part of its Certified SysOps Administrator Associate course.

Some Tips For Getting More out of Amazon CloudWatch

Enough with the basics. Let’s get into a few more interesting notes and tricks that come from Trek10’s experience with CloudWatch.

Metric Visibility Delays

We often get questions about this from people that are used to seeing their VM metrics in near real time. We find that CloudWatch metrics typically have about a 2 minute delay from showing up in AWS (in the console and API)… in other words the data point for 10:15 will be visible roughly at 10:17.

If you are using an external monitoring tool to import your CloudWatch metrics, this polling for import adds additional delay. We believe that having a tool that can aggregate all of your metrics is well worth this downside as long as the delay is minimal. With Trek10’s monitoring platform of choice, Datadog, the total delay from metric origination to being available in Datadog is about 10-12 minutes. Crucially (and we salute Datadog for developing this awesome feature), they can speed up your polling behind the scenes so that the total delay is only about 4-5 minutes (or just about 2 minutes longer than being able to access the data natively in CloudWatch). We find this to be just fine for almost all use cases. Contact Datadog Support if you’d like this feature enabled. One key warning… this will increase your AWS CloudWatch costs. Keep reading…

Watch GetMetrics Costs

If you are using a external monitoring tool, watch out for the cost of GetMetricData API calls. This call costs $0.01 per 1000 requests. There are some details about what you can get out of one request, but the bottom line is that your costs will increase multiplicatively with the number of AWS services you use, metric dimensions within those services, and frequency of polling. For example: With Lambda, a typical function has four CloudWatch metrics emitted: number of invocations, duration, errors, and throttles. If you have 50 Lambda functions in your account, your monitoring tool needs to do GetMetrics API calls on 50 x 4 = 200 metric/dimension combinations. This math applies to any dimension used by CloudWatch: autoscaling groups, S3 buckets, SNS topic, and on and on. It is worth a brief browse of the CloudWatch console to understand the metrics that can affect this cost:

If you’re polling AWS once every couple minutes for hundreds or even thousands of metric/dimension combinations you can see how this cost can quickly add up to hundreds of dollars per month.

AWSWishList: Polling for CloudWatch metrics is remarkably inefficient: AWS really needs to create a better system for bulk export of metrics at high frequency.

Be Thorough

The key to a good CloudWatch monitoring plan is depth. If you monitor just a few obvious things like RDS CPU and Lambda errors you will likely miss out on some critical warning signs of production problems. Every AWS service has thorough documentation of the CloudWatch metrics available to it. To give you an idea, here is the list for IoT Core and another for AWS Step Functions. For every service you are working with, dive deep into this list and understand what is available and why it matters.

Some metrics are obvious candidates for alerting, like DynamoDB throttles: This is a critical production issue if it happens. But even for those where you may not alert, you can build incredibly insightful dashboards to analyze problems when they arise. For example, imagine you have a simple serverless REST API with API Gateway, Lambda, and DynamoDB. Your critical metric might be rate of HTTP 5XX errors on API Gateway, but when this rate hits a concerning threshold you need to be able to quickly dig deeper. Your dashboard might contain API Gateway error rates and request volume as well as Lambda error rates, Lambda throttles, and a variety of DynamoDB error metrics such as Read and WriteThrottleEvents and SystemErrors. Seeing all of these CloudWatch metrics on a single screen will let you quickly drill in on the source of the problem.

Trusted Advisor Metrics

One of our favorite hidden CloudWatch metrics is something that just came out relatively recently: Trusted Advisor metrics. Trusted Advisor is AWS’s service that is available with Business or Enterprise Support and checks a wide variety of usage details across your AWS account to deliver insights into cost optimization, performance, security, and fault tolerance.

There are two groups of CloudWatch Trusted Advisor metrics. Green/red/yellow metrics simply count up the number of checks or resources checked that fit each alert level. So you can easily set up an alarm, for example, if you have at least one red check. More interesting, though, are the second group: Service Limit Metrics. There are a wide variety of service limits across AWS and hitting one of these limits in production is a surprisingly common cause of outages. These metrics report the percent of utilization against that service limit, giving you a simple one-stop-shop for warning against these issues. Just set your warning threshold at, say, 75% of each ServiceLimit, Service, and Region and you’re all set.

Managed Services
Infrastructure & DevOps Modernization
Trek10 Team

Trek10 Team

Founded in 2013, Trek10 helped organizations migrate to and maximize the value of AWS by designing, building, and supporting cloud-native workloads with deep technical expertise. In 2025, Trek10 joined Caylent, forming one of the most comprehensive AWS-only partners in the ecosystem, delivering end-to-end services across strategy, migration and modernization, product innovation, and managed services.

View Trek10's articles

Learn more about the services mentioned

Caylent Catalysts™

IoT

Connect, understand, and act on data from industrial devices at scale to improve uptime, efficiency, and reliability across manufacturing, energy, and utilities.

Caylent Services

Managed Services

Reliably Operate and Optimize Your AWS Environment

Caylent Services

Infrastructure & DevOps Modernization

Quickly establish an AWS presence that meets technical security framework guidance by establishing automated guardrails that ensure your environments remain compliant.

Accelerate your cloud native journey

Leveraging our deep AWS expertise

Get in touch

Related Blog Posts

Datadog Event Mapping

Learn how Datadog Event Mapping works — how to correlate logs, events, and alerts into meaningful context, improve observability, and reduce noise so your team can quickly detect and respond to issues.

Managed Services
Infrastructure & DevOps Modernization

CloudFormation Nested Stacks Primer

Get a practical introduction to AWS CloudFormation nested stacks — how they work, when to use them, and best practices for organizing and managing reusable infrastructure templates at scale.

Managed Services
Infrastructure & DevOps Modernization

Dedicated Hosts vs. Dedicated Instances on AWS: What is the Difference?

Understand the differences between AWS Dedicated Hosts and Dedicated Instances — when to use each, how they impact compliance and licensing, and best practices for controlling tenancy and cost in your cloud environment.

Managed Services
Infrastructure & DevOps Modernization