Table of contents
Situation: The Black Friday Breakdown ๐๏ธ
Picture this: You're the lead developer at "SnapBuy," a thriving e-commerce platform. Everything seems perfect until Black Friday arrives. Despite CloudWatch showing normal CPU usage and network traffic, customers are abandoning their carts in droves, and checkout times are crawling.
The standard CloudWatch metrics painted a deceiving picture:
โ CPU Utilization: Normal
โ Network Traffic: Stable
โ Disk Usage: Within limits
โ Yet customers are experiencing significant delays
Task: Uncover the Hidden Metrics ๐
Our mission became clear: We needed to:
Identify what metrics we weren't seeing
Set up comprehensive monitoring beyond CloudWatch defaults
Implement automated, scalable monitoring solutions
Prevent future Black Friday disasters
The investigation revealed several "unmonitorable" suspects:
Memory usage patterns
Database connection pool status
Application-level response times
Shopping cart processing metrics
System-level performance indicators
Action: Implementing the Solution ๐ ๏ธ
Phase 1: Quick Investigation with Python
First, we deployed a rapid response solution using Python to track critical metrics:
import boto3
import psutil
import time
class CheckoutMonitor:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
def track_checkout_metrics(self):
# Get current checkout stats
active_carts = self.get_active_shopping_carts()
memory_usage = psutil.virtual_memory().percent
db_connections = self.get_db_connection_count()
# Report to CloudWatch
self.cloudwatch.put_metric_data(
Namespace='SnapBuy/Checkout',
MetricData=[
{
'MetricName': 'ActiveCarts',
'Value': active_carts,
'Unit': 'Count'
},
{
'MetricName': 'MemoryUsage',
'Value': memory_usage,
'Unit': 'Percent'
},
{
'MetricName': 'DatabaseConnections',
'Value': db_connections,
'Unit': 'Count'
}
]
)
# Run every minute
monitor = CheckoutMonitor()
while True:
monitor.track_checkout_metrics()
time.sleep(60)
Phase 2: Automated Long-term Solution
After identifying the critical metrics, we implemented a permanent, automated solution:
- Set Up IAM Role
# Create trust policy
cat > trust-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
EOF
aws iam create-role \
--role-name CustomMetricsRole \
--assume-role-policy-document file://trust-policy.json
- Create Automated Monitoring Script
#!/bin/bash
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)
# Collect metrics
memory_used=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
cpu_steal=$(cat /proc/stat | grep '^cpu ' | awk '{print $9}')
load_avg=$(uptime | awk -F'load average:' '{ print $2 }' | awk -F',' '{ print $1 }')
# Push to CloudWatch
aws cloudwatch put-metric-data \
--region $REGION \
--namespace "SnapBuy/SystemMetrics" \
--metric-data "[
{
\"MetricName\": \"MemoryUsedPercent\",
\"Value\": $memory_used,
\"Unit\": \"Percent\"
},
{
\"MetricName\": \"CPUStealTime\",
\"Value\": $cpu_steal,
\"Unit\": \"Count\"
},
{
\"MetricName\": \"SystemLoadAverage\",
\"Value\": $load_avg,
\"Unit\": \"Count\"
}
]"
- Deploy Using User Data
aws ec2 run-instances \
--image-id ami-12345678 \
--instance-type t3.micro \
--iam-instance-profile Name=CustomMetricsProfile \
--user-data file://user-data.sh \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=SnapBuy-Monitor}]'
Phase 3: Set Up Proactive Alerts
Created CloudWatch alarms for early warning:
aws cloudwatch put-metric-alarm \
--alarm-name HighMemoryUsage \
--alarm-description "Memory usage too high" \
--metric-name MemoryUsedPercent \
--namespace SnapBuy/SystemMetrics \
--threshold 90 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--period 300 \
--statistic Average \
--alarm-actions arn:aws:sns:region:account-id:notification-topic
Results: Case Closed! ๐
Our investigation and implementation yielded significant improvements:
Immediate Findings
Memory usage was hitting 95% during peak hours
Database connection pool was maxed out
CPU steal time was spiking during high traffic
Long-term Benefits
Performance Improvements
60% reduction in checkout time
75% decrease in cart abandonment rate
99.9% uptime during peak hours
Operational Improvements
Early warning system for resource issues
Automated scaling based on custom metrics
Clear visibility into system health
Lessons Learned
Monitor What Matters
Business metrics (cart abandonment, checkout times)
System metrics (memory, connections)
Customer experience metrics
Automate Everything
Monitoring setup through user data
Alert responses
Scaling actions
Be Proactive
Set up alerts before issues occur
Monitor trends over time
Regular system health checks
Troubleshooting Guide ๐ง
If your metrics aren't showing up:
Check IAM roles:
aws iam get-role --role-name CustomMetricsRole
Verify cron is running:
systemctl status crond
Test metric collection:
/opt/custom-metrics/
collect-metrics.sh
Check CloudWatch logs for errors
Remember: In EC2 monitoring, the best defense is a good offense. Don't wait for customers to report issues โ catch them before they become problems!
P.S. After implementing these solutions, our cyber Monday was smoother than a well-oiled shopping cart ๐โจ