AWS Unpacked #12: Monitoring and Auditing

Categories: AWS

Tools That Keep Your AWS Systems Healthy and Compliant

Blog header image
TL;DR
AWS provides a powerful set of tools for monitoring and auditing your cloud environment. CloudWatch is your go-to for metrics, logs, alarms, and real-time visibility. EventBridge allows event-driven automation and routing across services. CloudTrail records all API activity for security and auditing. AWS Config tracks resource configurations and compliance status over time. Each tool has a distinct role, and together they give you full operational and security visibility into your AWS environment. In any well-architected AWS environment, monitoring and auditing aren't just nice to have—they're critical. They help you respond to incidents, maintain visibility, meet compliance requirements, and ensure the health and security of your systems. In this post, we'll break down the essential AWS services for monitoring and auditing, when to use them, and how they work together.

CloudWatch: Your Operations Control Room

CloudWatch Metrics: Taking the Pulse of Your Environment

CloudWatch Metrics tracks the health of your systems the way a fitness tracker monitors your steps, heart rate, or sleep quality. For example, if you’re running an EC2 instance (a virtual server), CloudWatch automatically measures how hard the CPU is working, how much memory is being used, or how many network requests it’s handling.

You can also send custom metrics—like the number of users currently logged into your app or the duration of a checkout process. These numbers are gathered over time and displayed on graphs, making trends easier to spot. If a spike or drop looks odd, you can dig deeper.

When you launch an EC2 instance, CloudWatch automatically collects a basic set of metrics—these are included by default and require no extra setup. Think of them as the “vital signs” of your instance: CPU utilization, disk I/O, and network traffic. But if you want to monitor things like memory usage, disk space, or the number of running processes, you’ll need to install the CloudWatch Agent. These are considered custom metrics because they require additional configuration and are not exposed by the system unless you explicitly send them to CloudWatch. In short: basic performance comes standard, but deeper insights need a bit of extra setup.

If you need to get your CloudWatch metrics out of AWS and into another system—like a third-party monitoring tool, data lake, or custom analytics platform—CloudWatch Metric Streams is the feature for you. It pushes metric data in near real-time to destinations like Kinesis Data Firehose, which can then send it to S3, Redshift, Splunk, or others.

CloudWatch Logs: Capturing Conversations

Every application leaves a trail of digital breadcrumbs - log files. Think of these like security camera footage or chat transcripts between different systems. CloudWatch Logs gathers these files from your servers or AWS services and stores them centrally so you can search, analyze, and retain them for troubleshooting or audit purposes.

The Live Tail feature is like watching the logs scroll in real-time, great for developers trying to fix bugs or see how users are interacting with their system in the moment.

CloudWatch Agent vs Logs Agent: Which One’s the Messenger?

By default, no logs from your EC2 instance will go to CloudWatch. To get logs and metrics from an EC2 instance into CloudWatch, you need an agent installed:

  • CloudWatch Unified Agent is the modern, all-in-one solution. It sends both logs and system-level metrics. Think of it like a walkie-talkie with a health monitor built in.
  • CloudWatch Logs Agent is the older tool - it only sends logs, no metrics. It’s like sending a written report rather than real-time updates.

CloudWatch Alarms: Your Virtual Smoke Detector

Metrics alone aren’t enough—you need to know when something goes wrong. CloudWatch Alarms let you define thresholds, and when they’re crossed, they alert you or take action.

For example, you might say: “If CPU usage goes above 80% for 5 minutes, send me an SMS or trigger an auto-scaling action.” These alarms are proactive, not reactive—like a smoke detector that calls the fire department when it detects smoke.

You can also set up Composite Alarms that are monitoring the states of multiple other alarms.

Logs Insights: Forensic Toolkit for Your Logs

Let’s say you get a CloudWatch Alarm saying your application’s latency is spiking. Now what?

CloudWatch Logs Insights lets you query your log files with a SQL-like language. It’s like opening your security camera footage and scanning for the exact 10 minutes when something went wrong.

For example, you might search for:

  • Errors from a specific IP address
  • Slow response times between 2–3pm
  • A specific customer ID across log streams

It’s the difference between having a pile of receipts versus having a searchable spreadsheet of all your transactions.

CloudWatch Container Insights: Monitoring the Whole Swarm

If you’re running containers - say, using ECS, EKS, or Kubernetes, then CloudWatch Container Insights is your centralized health dashboard. It tracks memory and CPU usage, number of running containers, and more across your cluster. Imagine managing a fleet of taxis. Container Insights tells you how many are on the road, how much fuel each is using, and whether any are stuck in traffic. It works by installing CloudWatch Agent and Fluent Bit as DaemonSets in your Kubernetes cluster. Once enabled, it streams performance metrics and logs to CloudWatch for each container, task, pod, and service. Container Insights is available for:

  • Amazon ECS (EC2 and Fargate)
  • Amazon EKS
  • Kubernetes on EC2

Use cases:

  • Spotting underperforming containers
  • Resource optimization
  • Debugging cluster issues

CloudWatch Lambda Insights: Peeking Inside the Black Box

Serverless apps can be mysterious: you deploy a Lambda function, but what happens under the hood? That’s where CloudWatch Lambda Insights comes in. It provides out-of-the-box visibility into:

  • Invocation counts and durations
  • Memory usage
  • Cold start frequency
  • Errors and throttles

Imagine it as putting a diagnostic chip inside each drone in a fleet. You don’t control how the drone flies, but now you can monitor how well it performs. To enable it, you attach the CloudWatch Lambda Insights extension layer to your Lambda function and grant the necessary IAM permissions. It then sends enhanced metrics and structured logs to CloudWatch. Use cases:

  • Identifying which functions are running hot or cold
  • Troubleshooting performance bottlenecks
  • Cost optimization

CloudWatch Contributor Insights: Finding the Loudest Voices

Contributor Insights helps you analyze patterns in high-volume logs by identifying the “top talkers.” These could be:

  • The top IPs making requests
  • The most common error codes
  • The busiest endpoints

It’s like reviewing crowd noise to figure out which person is shouting the most during a debate. Contributor Insights works on:

  • CloudWatch Logs
  • CloudTrail events
  • Amazon DynamoDB Streams
  • VPC Flow Logs

You define rules that specify what fields to analyze and group by. It then visualizes the top N contributors over time, making anomalies or abusive behaviors easier to spot.

Use cases:

  • Detecting a single user flooding your API
  • Spotting top-performing customers or endpoints
  • Analyzing failure patterns in server logs

CloudWatch Application Insights: Smart Setup for App Monitoring

CloudWatch Application Insights is like a monitoring consultant - it scans your environment and sets up relevant dashboards, alarms, and logs for you. You tell it what type of application you’re running—like a .NET app on Windows or a Java app on Linux—and it auto-detects the resources and configures monitoring for them. It understands typical failure patterns and watches for them. Supported resource types include:

  • EC2-based applications (Java, .NET, SAP, SQL Server)
  • RDS, ALB, Lambda, and more

It’s ideal for people who want deep insights without manually configuring everything. Use cases:

  • Fast-track observability setup
  • Automated anomaly detection
  • Simplified root cause analysis

EventBridge: The Dispatcher of the Cloud

Amazon EventBridge is like the front desk in a well-run hotel. It listens for all kinds of activity—like someone checking in, a package being delivered, or the fire alarm going off—and routes those messages to the right departments to handle them quickly and appropriately.

For example, if a new file is uploaded to an S3 bucket, EventBridge can:

  • Trigger a Lambda function to scan the file for viruses
  • Notify the compliance team via email or Slack
  • Update a database with metadata about the file

This makes your cloud environment more automated, reactive, and efficient. And it’s not limited to AWS services—SaaS platforms like Zendesk, Datadog, and PagerDuty can also send events to EventBridge. That’s like your hotel front desk getting notifications from Uber Eats, DHL, or a laundry service.

Archiving and Replaying Events

Sometimes, you need to look back at what happened—whether for auditing, troubleshooting, or recovery. EventBridge allows you to archive events sent to your event bus, either all events or only those matching specific filters.

This is like keeping a digital ledger at the front desk of every visitor or incident. If something breaks, or a bug is discovered, you can replay archived events to reprocess them using updated logic—no need to simulate or guess what happened.

Understanding the Shape of Your Events: Schema Discovery

Not all events look the same - some are simple, others are complex. EventBridge can analyze incoming events and infer their “schema”, which is basically the blueprint or structure of the data (like a form template: “Name”, “Timestamp”, “EventType”). This is useful when you’re integrating with unfamiliar services or building new systems, you don’t need to guess what the event looks like.

EventBridge stores these blueprints in a Schema Registry, where schemas:

  • Can be automatically discovered and cataloged
  • Are versioned, so if the shape of an event changes, you can track what’s new or different
  • Can be used to generate code bindings (in Java, Python, TypeScript, etc.) so your applications know ahead of time what data to expect and how to handle it

This schema registry removes the guesswork from building integrations and makes your applications more reliable.

CloudTrail: Your AWS Security Camera System

Where CloudWatch tells you what is happening, CloudTrail tells you who did what and when. It records every API call made in your AWS account, whether that’s from the console, the CLI, or an SDK. Think of it as a CCTV logbook for your cloud environment. If someone deletes an EC2 instance, updates an IAM policy, or opens an S3 bucket to the public, CloudTrail has the receipts.

For each event, it tells you:

  • What action was taken
  • Which resource was affected
  • Who made the request
  • When and from where the request came

This is invaluable for audits, security investigations, and even debugging accidental changes. But not all events are created equal - CloudTrail gives you three types of visibility.

Management Events vs Data Events vs Insights Events

Let’s break it down:

Management Events

These cover the control plane - the “what happened” of your AWS infrastructure. Creating an EC2 instance, deleting an S3 bucket, changing IAM permissions - all of these are management actions. Think of this as logging whenever someone opens or locks a door in a building. It’s essential for tracking configuration changes.

  • Default: Yes, they’re captured by default
  • Use cases: Auditing access, compliance, tracing user activity
  • Cost: First copy free

Data Events

These cover the data plane - what users are doing with the data in AWS. Reading or writing objects in S3, querying a DynamoDB table, or accessing Lambda function code are all examples. This is like keeping a log of not just who entered the building, but what files they read or what rooms they visited.

  • Default: Not enabled by default (because of volume and cost)
  • Use cases: Detecting sensitive data access, understanding app behavior
  • Cost: Charged per 100,000 events

Insights Events

These are high-level, intelligent alerts that help you spot unusual activity - like if a user suddenly makes far more API calls than normal, or does things they’ve never done before. It’s like having a guard who says, “Hey, this person has never been in the server room at midnight before. That’s odd.”

  • Default: Off by default; must be enabled explicitly
  • Use cases: Threat detection, anomaly alerts, early warnings
  • Cost: Charged per 100,000 insight events analyzed

Retention

By default, events are stored in CloudTrail Event History for 90 days, easily searchable in the AWS console, but limited to management events only. For longer-term storage or deeper analysis, you need to set up a CloudTrail trail that delivers events to Amazon S3. Once in S3, you can retain the data indefinitely (based on your own bucket lifecycle policies) and even analyze it using Amazon Athena.

  • Amazon S3: Long-term, cost-effective storage
  • Athena: Lets you run SQL-like queries on your logs without moving them anywhere, like searching through archived CCTV footage without pulling it from tape
  • Glue: You can also catalog the logs with AWS Glue to make Athena queries easier to write

AWS Config: Your Cloud’s Compliance Inspector

Imagine you run a large fleet of vehicles and want to ensure that every one of them has its oil changed every 10,000 km and its tire pressure above 30 PSI. That’s what AWS Config does for your AWS resources.

It tracks the configuration of resources like EC2 instances, security groups, IAM roles, S3 buckets, and more—and tells you:

  • What the current configuration is
  • How that configuration has changed over time
  • Whether it complies with the rules you’ve defined

Further features of AWS Config includes:

  • You can use managed rules (e.g., “S3 buckets must not allow public access”) or define custom rules with Lambda (e.g., “EC2 instances in production must use a specific AMI”).
  • It gives you a timeline view of resource history—like an audit trail for configuration, which is different from CloudTrail’s audit trail of activity.
  • You can even set up auto-remediation, like having a mechanic immediately fix any vehicle that’s out of compliance.
  • You can use EventBridge to trigger notifications when AWS resources are non-compliant
  • You have the ability to send configuration changes and compliance state notifications to SNS

Note that AWS Config Rules does not prevent actions from happening (no deny)

Comparing CloudWatch, CloudTrail, and AWS Config

Here’s a breakdown to clarify their roles:

Feature CloudWatch CloudTrail AWS Config
Focus Monitoring runtime metrics, logs and dashboards (performance monitoring such as CPU, network) Logging API activity (who did what) Tracking configuration over time, evaluate resources against compliance rules
Real-time? Yes Near real-time (w/ EventBridge) Near real-time
Retains History? Short-term by default (unless saved) Yes (long-term in S3) Yes (config snapshots + timeline)
Alerting? Yes (via Alarms) Yes (via EventBridge) Yes (via compliance rules)
Automation? Yes (e.g., auto scale) Yes (w/ EventBridge) Yes (auto-remediation w/ Lambda)
Main Use Case Operational Monitoring Security/Audit Logging Compliance & Drift Detection

Final Thoughts

Monitoring and auditing in AWS isn’t a one-size-fits-all task. It’s more like building an observability ecosystem:

  • Use CloudWatch for live visibility: track system health, detect issues early, and trigger automated responses.
  • Use Logs Insights to investigate incidents and understand your application’s behavior.
  • Use EventBridge to connect systems, respond to events in real time, and build automation.
  • Use CloudTrail to keep a full record of what actions were taken, by whom, and when.
  • Use AWS Config to ensure your infrastructure stays in a compliant, known-good state.

Together, they help you move from simply reacting to problems toward a posture of proactive control and visibility—a core goal of any well-architected cloud environment.

About the Author

Dawie Loots is a data scientist with a keen interest in using technology to solve real-world problems. He combines data science expertise with his background as a Chartered Accountant and executive leader to help organizations build scalable, value-driven solutions.

Back to Blog