How to Build a 90-Day Utilization Baseline from CloudWatch

The 90-day utilization baseline is the foundation of reliable EC2 right-sizing. Without it, right-sizing decisions are made on incomplete data — a 14-day snapshot that may not capture the workload's full cycle, or a 30-day average that smooths over the weekly patterns that determine actual peak demand. Building the baseline correctly is the difference between right-sizing recommendations that engineering teams trust and implement, and recommendations they decline because the numbers do not match their intuition about the workload.

This article covers the technical approach to collecting, storing, and analyzing 90 days of CloudWatch metrics for right-sizing purposes. It covers which metrics to collect, what sampling interval to use, how to handle day-of-week segmentation, and how to identify and exclude anomalous periods that would distort the baseline.

Metric Selection: What to Collect and Why

For a complete right-sizing baseline, collect five metric categories. First, CPU utilization: CPUUtilization from EC2 at 1-minute resolution via Detailed Monitoring. This is the primary right-sizing signal for compute-bound workloads. Second, memory utilization: mem_used_percent from the CloudWatch Agent. Requires agent installation and configuration — critical for memory-bound workloads (JVM applications, in-memory databases, analytics engines). Third, network throughput: NetworkIn and NetworkOut from EC2 at 1-minute resolution. Required to detect instances where network bandwidth is the binding constraint, not CPU. Fourth, disk I/O: DiskReadBytes and DiskWriteBytes from EC2, or VolumeReadBytes and VolumeWriteBytes from EBS, at 1-minute resolution. Required for storage-bound workloads. Fifth, instance metadata: instance type, launch time, tags, and placement (AZ and region). Required for correlating utilization data with cost calculations.

Most right-sizing analyses collect only CPUUtilization because it is the only metric available without agent installation or additional configuration. The resulting recommendations have a systematic error for any workload where CPU is not the binding constraint. For a typical production EC2 fleet, 25-35% of instances are better characterized by memory or network constraints than by CPU, and those instances will receive incorrect right-sizing recommendations from a CPU-only analysis.

Storage Architecture for 90 Days of 1-Minute Metrics

The data volume for a 90-day, 1-minute resolution baseline for a single EC2 instance is approximately: 90 days × 24 hours × 60 minutes = 129,600 data points per metric, per instance. For 5 metric categories across 100 instances, that is 64.8 million data points. At approximately 20 bytes per data point in compact format, the total is around 1.3 GB for 100 instances.

CloudWatch stores 1-minute resolution data for 15 days and 5-minute resolution data for 63 days before aggregating to hourly. If you need 90 days of 1-minute resolution data, you must pull and store it yourself within the 15-day window. For right-sizing purposes, 5-minute resolution over 90 days is an acceptable compromise — it reduces data volume by 5x and covers the 63-day storage window, with the caveat that sub-5-minute spikes are not captured.

S3 with Parquet format provides the most cost-efficient storage for this data. Parquet compression reduces the 1.3 GB estimate above to approximately 150-200 MB for 100 instances over 90 days. Query performance with Athena or a local DuckDB instance is acceptable for batch analysis workflows that run nightly or weekly rather than real-time.

The GetMetricStatistics API vs. GetMetricData

CloudWatch offers two APIs for retrieving historical metric data: GetMetricStatistics (older) and GetMetricData (current). For collecting 90-day baselines, use GetMetricData. It supports batch retrieval of multiple metrics in a single API call, returns data in time-sorted order, supports higher throughput via the MetricDataQuery structure, and costs the same per API call regardless of how many metrics you retrieve in the batch.

GetMetricStatistics retrieves one metric per call and has a lower throughput ceiling. For a fleet of 100 instances with 5 metrics each, the throughput difference between the two APIs is significant for the initial data collection pass. Subsequent incremental collection (fetching the last 24 hours of data nightly) has lower throughput requirements and works with either API.

Rate limits for GetMetricData are 50 requests per second and 180,000 data points per second. For collecting 90 days of 5-minute resolution data for 100 instances, the initial collection run stays within these limits when batched into groups of 20 instances per API call with a 100ms sleep between calls.

Day-of-Week Segmentation: The Implementation

The correct right-sizing baseline is not a single 90-day average — it is a per-day-of-week p95 calculated independently for each day type. The implementation requires timestamped metric data (which CloudWatch provides), a local time conversion based on the instance's region and the team's business timezone, and a grouping step that separates the 90 days of data into seven day-of-week buckets.

From each bucket, calculate the p95 of the metric values. The right-sizing baseline for the instance is the maximum p95 across all seven day-of-week buckets, plus the configured headroom percentage. For a production web service that peaks on Mondays at 67% CPU but averages 31% across the week, the Monday p95 of 67% is the correct baseline — not the weekly average. Sizing for the 31% average would result in CPU throttling on every Monday morning.

Weekend versus weekday segmentation is important for identifying non-production environments (as described in our article on non-production scheduling) and also for identifying workloads where the right-sizing calculation should use a lower baseline on weekends. For a production environment that serves a global audience with heavier weekend traffic, the Saturday and Sunday p95 values may actually be higher than weekday values — and the correct baseline reflects that.

Identifying and Excluding Anomalous Periods

A 90-day utilization window may include periods of anomalously high utilization that are not representative of normal operating load: a major product launch, a DDoS incident, a runaway process, or a one-time data migration. Including these periods in the baseline inflates the p95 and produces right-sizing recommendations that are sized for the anomalous event rather than for normal operation.

Identifying anomalous periods requires detecting statistical outliers in the 90-day time series. A practical approach: calculate the rolling 7-day median for each metric, then flag any period where the observed value exceeds 3.5 standard deviations from the 7-day median for 30 or more consecutive minutes. Periods meeting this criterion are candidate anomalies for manual review.

Not all anomalies should be excluded. A product launch that drove a permanent step-change in traffic should be included in the baseline because the post-launch traffic level is the new normal. A DDoS incident that lasted 4 hours and then returned to baseline should be excluded because it is not a recurring demand pattern the instance needs to be sized for. The distinction requires human judgment — which is why the anomaly detection step flags periods for review rather than automatically excluding them.

Headroom Configuration: Not a Single Number

The headroom percentage added to the p95 baseline to determine the right-sizing target is not a single organization-wide number. It should be configurable per workload category because the cost of under-sizing varies significantly by workload type.

Recommended headroom tiers: 30% for stateless production services with auto-scaling (a 30% headroom is appropriate because auto-scaling can cover demand spikes above the headroom threshold); 50% for stateful production services without auto-scaling (higher headroom because manual intervention is required to respond to demand spikes above the baseline); 15% for non-production services (these environments tolerate performance degradation; aggressive sizing is appropriate); 60% for databases and persistent stores (database performance degradation is user-visible and potentially data-affecting; conservative sizing is warranted).

Putting It Together: The Baseline Calculation

With 90 days of per-instance utilization data at 5-minute resolution, the complete baseline calculation produces the following outputs per instance: per-day-of-week p95 for CPU, memory, network, and disk; the maximum p95 across all day-of-week buckets for each metric; the headroom-adjusted target for each metric; the smallest EC2 instance type that satisfies all four headroom-adjusted targets simultaneously; and the projected monthly savings from downsizing to the recommended instance type at current On-Demand pricing.

That final output — smallest instance type satisfying all four dimensions plus savings projection — is the right-sizing proposal. When that proposal is accompanied by the underlying utilization evidence (the 90-day charts, the day-of-week breakdown, the p95 values for each dimension), engineering teams accept it at a substantially higher rate than proposals presented without evidence. The data transparency is what converts a recommendation from an instruction to an argument — and arguments are what engineers respond to.

Get right-sizing proposals backed by 90-day utilization data

KernelRun ingests 90 days of CloudWatch metrics, applies day-of-week segmentation, and presents each proposal with the full utilization evidence attached. Connect your first account in 4 minutes.

Request a Demo