The EC2 Monitoring Detective: When Default Metrics Aren't Enough

The EC2 Monitoring Detective: When Default Metrics Aren't Enough

Situation: The Black Friday Breakdown ๐Ÿ›๏ธ

Picture this: You're the lead developer at "SnapBuy," a thriving e-commerce platform. Everything seems perfect until Black Friday arrives. Despite CloudWatch showing normal CPU usage and network traffic, customers are abandoning their carts in droves, and checkout times are crawling.

The standard CloudWatch metrics painted a deceiving picture:

  • โœ… CPU Utilization: Normal

  • โœ… Network Traffic: Stable

  • โœ… Disk Usage: Within limits

  • โŒ Yet customers are experiencing significant delays

Task: Uncover the Hidden Metrics ๐Ÿ”

Our mission became clear: We needed to:

  1. Identify what metrics we weren't seeing

  2. Set up comprehensive monitoring beyond CloudWatch defaults

  3. Implement automated, scalable monitoring solutions

  4. Prevent future Black Friday disasters

The investigation revealed several "unmonitorable" suspects:

  • Memory usage patterns

  • Database connection pool status

  • Application-level response times

  • Shopping cart processing metrics

  • System-level performance indicators

Action: Implementing the Solution ๐Ÿ› ๏ธ

Phase 1: Quick Investigation with Python

First, we deployed a rapid response solution using Python to track critical metrics:

import boto3
import psutil
import time

class CheckoutMonitor:
    def __init__(self):
        self.cloudwatch = boto3.client('cloudwatch')

    def track_checkout_metrics(self):
        # Get current checkout stats
        active_carts = self.get_active_shopping_carts()
        memory_usage = psutil.virtual_memory().percent
        db_connections = self.get_db_connection_count()

        # Report to CloudWatch
        self.cloudwatch.put_metric_data(
            Namespace='SnapBuy/Checkout',
            MetricData=[
                {
                    'MetricName': 'ActiveCarts',
                    'Value': active_carts,
                    'Unit': 'Count'
                },
                {
                    'MetricName': 'MemoryUsage',
                    'Value': memory_usage,
                    'Unit': 'Percent'
                },
                {
                    'MetricName': 'DatabaseConnections',
                    'Value': db_connections,
                    'Unit': 'Count'
                }
            ]
        )

# Run every minute
monitor = CheckoutMonitor()
while True:
    monitor.track_checkout_metrics()
    time.sleep(60)

Phase 2: Automated Long-term Solution

After identifying the critical metrics, we implemented a permanent, automated solution:

  1. Set Up IAM Role
# Create trust policy
cat > trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

aws iam create-role \
    --role-name CustomMetricsRole \
    --assume-role-policy-document file://trust-policy.json
  1. Create Automated Monitoring Script
#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)

# Collect metrics
memory_used=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
cpu_steal=$(cat /proc/stat | grep '^cpu ' | awk '{print $9}')
load_avg=$(uptime | awk -F'load average:' '{ print $2 }' | awk -F',' '{ print $1 }')

# Push to CloudWatch
aws cloudwatch put-metric-data \
    --region $REGION \
    --namespace "SnapBuy/SystemMetrics" \
    --metric-data "[
        {
            \"MetricName\": \"MemoryUsedPercent\",
            \"Value\": $memory_used,
            \"Unit\": \"Percent\"
        },
        {
            \"MetricName\": \"CPUStealTime\",
            \"Value\": $cpu_steal,
            \"Unit\": \"Count\"
        },
        {
            \"MetricName\": \"SystemLoadAverage\",
            \"Value\": $load_avg,
            \"Unit\": \"Count\"
        }
    ]"
  1. Deploy Using User Data
aws ec2 run-instances \
    --image-id ami-12345678 \
    --instance-type t3.micro \
    --iam-instance-profile Name=CustomMetricsProfile \
    --user-data file://user-data.sh \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=SnapBuy-Monitor}]'

Phase 3: Set Up Proactive Alerts

Created CloudWatch alarms for early warning:

aws cloudwatch put-metric-alarm \
    --alarm-name HighMemoryUsage \
    --alarm-description "Memory usage too high" \
    --metric-name MemoryUsedPercent \
    --namespace SnapBuy/SystemMetrics \
    --threshold 90 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --period 300 \
    --statistic Average \
    --alarm-actions arn:aws:sns:region:account-id:notification-topic

Results: Case Closed! ๐ŸŽ‰

Our investigation and implementation yielded significant improvements:

Immediate Findings

  1. Memory usage was hitting 95% during peak hours

  2. Database connection pool was maxed out

  3. CPU steal time was spiking during high traffic

Long-term Benefits

  1. Performance Improvements

    • 60% reduction in checkout time

    • 75% decrease in cart abandonment rate

    • 99.9% uptime during peak hours

  2. Operational Improvements

    • Early warning system for resource issues

    • Automated scaling based on custom metrics

    • Clear visibility into system health

Lessons Learned

  1. Monitor What Matters

    • Business metrics (cart abandonment, checkout times)

    • System metrics (memory, connections)

    • Customer experience metrics

  2. Automate Everything

    • Monitoring setup through user data

    • Alert responses

    • Scaling actions

  3. Be Proactive

    • Set up alerts before issues occur

    • Monitor trends over time

    • Regular system health checks

Troubleshooting Guide ๐Ÿ”ง

If your metrics aren't showing up:

  1. Check IAM roles: aws iam get-role --role-name CustomMetricsRole

  2. Verify cron is running: systemctl status crond

  3. Test metric collection: /opt/custom-metrics/collect-metrics.sh

  4. Check CloudWatch logs for errors

Remember: In EC2 monitoring, the best defense is a good offense. Don't wait for customers to report issues โ€“ catch them before they become problems!

P.S. After implementing these solutions, our cyber Monday was smoother than a well-oiled shopping cart ๐Ÿ›’โœจ

ย