The EC2 Monitoring Detective: When Default Metrics Aren't Enough

Situation: The Black Friday Breakdown 🛍️

Picture this: You're the lead developer at "SnapBuy," a thriving e-commerce platform. Everything seems perfect until Black Friday arrives. Despite CloudWatch showing normal CPU usage and network traffic, customers are abandoning their carts in droves, and checkout times are crawling.

The standard CloudWatch metrics painted a deceiving picture:

✅ CPU Utilization: Normal
✅ Network Traffic: Stable
✅ Disk Usage: Within limits
❌ Yet customers are experiencing significant delays

Task: Uncover the Hidden Metrics 🔍

Our mission became clear: We needed to:

Identify what metrics we weren't seeing
Set up comprehensive monitoring beyond CloudWatch defaults
Implement automated, scalable monitoring solutions
Prevent future Black Friday disasters

The investigation revealed several "unmonitorable" suspects:

Memory usage patterns
Database connection pool status
Application-level response times
Shopping cart processing metrics
System-level performance indicators

Action: Implementing the Solution 🛠️

Phase 1: Quick Investigation with Python

First, we deployed a rapid response solution using Python to track critical metrics:

import boto3
import psutil
import time

class CheckoutMonitor:
    def __init__(self):
        self.cloudwatch = boto3.client('cloudwatch')

    def track_checkout_metrics(self):
        # Get current checkout stats
        active_carts = self.get_active_shopping_carts()
        memory_usage = psutil.virtual_memory().percent
        db_connections = self.get_db_connection_count()

        # Report to CloudWatch
        self.cloudwatch.put_metric_data(
            Namespace='SnapBuy/Checkout',
            MetricData=[
                {
                    'MetricName': 'ActiveCarts',
                    'Value': active_carts,
                    'Unit': 'Count'
                },
                {
                    'MetricName': 'MemoryUsage',
                    'Value': memory_usage,
                    'Unit': 'Percent'
                },
                {
                    'MetricName': 'DatabaseConnections',
                    'Value': db_connections,
                    'Unit': 'Count'
                }
            ]
        )

# Run every minute
monitor = CheckoutMonitor()
while True:
    monitor.track_checkout_metrics()
    time.sleep(60)

Phase 2: Automated Long-term Solution

After identifying the critical metrics, we implemented a permanent, automated solution:

Set Up IAM Role

# Create trust policy
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

aws iam create-role \
    --role-name CustomMetricsRole \
    --assume-role-policy-document file://trust-policy.json

Create Automated Monitoring Script

#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

# Collect metrics
memory_used=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
cpu_steal=$(cat /proc/stat | grep '^cpu ' | awk '{print $9}')
load_avg=$(uptime | awk -F'load average:' '{ print $2 }' | awk -F',' '{ print $1 }')

# Push to CloudWatch
aws cloudwatch put-metric-data \
    --region $REGION \
    --namespace "SnapBuy/SystemMetrics" \
    --metric-data "[
        {
            \"MetricName\": \"MemoryUsedPercent\",
            \"Value\": $memory_used,
            \"Unit\": \"Percent\"
        },
        {
            \"MetricName\": \"CPUStealTime\",
            \"Value\": $cpu_steal,
            \"Unit\": \"Count\"
        },
        {
            \"MetricName\": \"SystemLoadAverage\",
            \"Value\": $load_avg,
            \"Unit\": \"Count\"
        }
    ]"

Deploy Using User Data

aws ec2 run-instances \
    --image-id ami-12345678 \
    --instance-type t3.micro \
    --iam-instance-profile Name=CustomMetricsProfile \
    --user-data file://user-data.sh \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=SnapBuy-Monitor}]'

Phase 3: Set Up Proactive Alerts

Created CloudWatch alarms for early warning:

aws cloudwatch put-metric-alarm \
    --alarm-name HighMemoryUsage \
    --alarm-description "Memory usage too high" \
    --metric-name MemoryUsedPercent \
    --namespace SnapBuy/SystemMetrics \
    --threshold 90 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --period 300 \
    --statistic Average \
    --alarm-actions arn:aws:sns:region:account-id:notification-topic

Results: Case Closed! 🎉

Our investigation and implementation yielded significant improvements:

Immediate Findings

Memory usage was hitting 95% during peak hours
Database connection pool was maxed out
CPU steal time was spiking during high traffic

Long-term Benefits

Performance Improvements
- 60% reduction in checkout time
- 75% decrease in cart abandonment rate
- 99.9% uptime during peak hours
Operational Improvements
- Early warning system for resource issues
- Automated scaling based on custom metrics
- Clear visibility into system health

Lessons Learned

Monitor What Matters
- Business metrics (cart abandonment, checkout times)
- System metrics (memory, connections)
- Customer experience metrics
Automate Everything
- Monitoring setup through user data
- Alert responses
- Scaling actions
Be Proactive
- Set up alerts before issues occur
- Monitor trends over time
- Regular system health checks

Troubleshooting Guide 🔧

If your metrics aren't showing up:

Check IAM roles: aws iam get-role --role-name CustomMetricsRole
Verify cron is running: systemctl status crond
Test metric collection: /opt/custom-metrics/collect-metrics.sh
Check CloudWatch logs for errors

Remember: In EC2 monitoring, the best defense is a good offense. Don't wait for customers to report issues – catch them before they become problems!

P.S. After implementing these solutions, our cyber Monday was smoother than a well-oiled shopping cart 🛒✨