What We Learned From Our First 50 Customer Accounts

After analyzing 50 engineering teams' AWS accounts through KernelRun's initial access period, some findings were expected and some were not. The expected finding: nearly all accounts had identifiable waste that exceeded 25% of total compute spend. The unexpected finding: the distribution of waste types was more consistent across accounts than we anticipated. Three categories — non-production scheduling, EBS snapshot accumulation, and RDS over-provisioning — account for 61% of the total savings identified across the dataset. The long tail of remaining savings came from 11 other categories, each representing under 6% of the total individually.

This concentration matters for prioritization. A FinOps program that focuses on these three categories first will capture the majority of available savings before tackling the more complex optimization work of EC2 right-sizing, commitment purchasing, and architecture changes. Here is what the data showed across each category.

Category 1: Non-Production Scheduling (28% of Total Savings)

Non-production scheduling was the single largest savings category, representing 28% of identified savings across the 50 accounts. This was higher than we expected going into the analysis. The median savings from non-production scheduling was $3,400/month per account, with a range from $400 (small teams with few environments) to $22,000 (accounts with large staging and QA fleets running multiple services).

The pattern is consistent: development and staging environments running 24 hours a day, 7 days a week, with near-zero utilization during nights and weekends. The environments are not running continuously because the team needs them to be always available — they are running continuously because nobody has implemented a schedule. When asked why the environments are not scheduled, the most common answer from engineering teams is "we tried once and it broke something, and we never had time to fix the approach."

The "it broke something" event is almost always a dependency ordering issue at startup (as described in our guide to non-production scheduling) that is straightforward to resolve but requires an upfront analysis of the environment dependency graph. Teams that implement scheduling correctly the first time, with proper startup sequencing and Slack-based overrides, sustain the schedules without disruption.

Category 2: EBS Snapshot Accumulation (19% of Total Savings)

EBS snapshot accumulation was the second-largest category and the one most teams were least aware of. Across the 50 accounts, the average orphaned snapshot cost was $1,800/month, with several accounts exceeding $8,000/month in snapshots attached to terminated instances or generated by deprecated backup policies that were never cleaned up.

The distribution by account age was striking: accounts running for more than 3 years had average orphaned snapshot costs of $3,200/month, while accounts running for less than 18 months had average costs under $400/month. Snapshot accumulation is a compounding problem — it grows continuously as instances are terminated and backup policies continue to run. The longer an account has been running without a snapshot audit, the larger the orphaned snapshot cost.

Identifying orphaned snapshots requires one specific query: enumerate all EBS snapshots in the account, cross-reference the SnapshotId against currently running instances and active volumes, and flag snapshots whose source volume and source instance no longer exist. This query is straightforward in the AWS CLI or SDK but is not exposed through a standard Cost Explorer view, which is why it goes undetected in most accounts.

One nuance: not all snapshots of terminated instances should be deleted. Some snapshots serve as disaster recovery archives and should be retained regardless of whether the source instance still exists. The cleanup workflow should include a "retention reason" classification step so that legitimate archive snapshots are not deleted alongside orphaned ones.

Category 3: RDS Over-Provisioning (14% of Total Savings)

RDS over-provisioning was the third-largest category. Unlike EC2 over-provisioning, which is widely recognized and has tooling support from AWS Compute Optimizer, RDS over-provisioning is less frequently analyzed and more likely to persist unreviewed. Across the 50 accounts, the average RDS right-sizing opportunity was $2,100/month, concentrated primarily in three configurations: Multi-AZ on non-critical databases, read replicas with zero query traffic, and storage allocation far exceeding actual usage.

The Multi-AZ pattern mirrors what we see in ElastiCache: it is applied as a default for production databases regardless of criticality. A nightly batch job database with an 8-hour maintenance window and a recovery time objective measured in hours does not require Multi-AZ automatic failover at $0.48/hour for the standby instance. Disabling Multi-AZ on non-critical RDS instances is the highest-ROI single RDS change in most accounts.

RDS read replicas with zero query traffic are a direct analog to the ElastiCache pattern described in our article on ElastiCache cost findings. They are provisioned during traffic spikes and remain after traffic returns to normal. The DatabaseConnections metric per replica identifies them: a read replica with fewer than 10 connections over 30 days is receiving no meaningful query traffic.

Where the Remaining 39% Comes From

The remaining 39% of identified savings is distributed across 11 categories. The largest of these are EC2 right-sizing (9%), ElastiCache optimization (7%), and data transfer charge reduction (6%). The remaining 17% covers orphaned Elastic IPs, unused load balancers, oversized Lambda function memory allocations, underutilized NAT Gateways, unused Route 53 hosted zones, stale CloudFormation stacks, and Savings Plan optimization.

EC2 right-sizing representing only 9% of the total savings identified was the finding most at odds with common assumptions in the FinOps community, where EC2 right-sizing is often treated as the primary cost optimization lever. Our interpretation: EC2 right-sizing is harder to implement (requires multi-dimensional analysis, headroom calibration, and engineering approval) and is therefore less likely to be complete even in accounts that have done some previous cost optimization work. The lower percentage may partly reflect that teams implementing prior cost programs prioritized the easier categories (non-production scheduling, snapshot cleanup) and left EC2 right-sizing partially complete.

What the Teams With the Smallest Gaps Did Differently

Across the 50 accounts, 8 had identified savings below 15% of their total compute spend. These accounts shared three practices that the higher-waste accounts did not have. First, they had an explicit cost review cadence: a monthly 1-hour meeting where FinOps or platform engineering reviewed the top 5 cost anomalies and assigned owners. Second, they had tagging coverage above 80% with explicit team-level attribution. Third, they had already implemented non-production scheduling and were running it consistently.

None of the 8 low-waste accounts used a best-in-class cost optimization platform — most were using a combination of native AWS tools (Cost Explorer, Compute Optimizer, Trusted Advisor) with manual review. The difference was organizational: they had assigned ownership, regular cadence, and the accumulated benefit of years of incremental cleanup. The accounts with the highest waste were not using worse tools. They had not institutionalized the review and cleanup process.

What This Means for Prioritization

For an engineering team starting a cost optimization program with no previous work, the data from these 50 accounts suggests a clear starting sequence: first, implement non-production scheduling for all identifiable dev/staging/QA environments (28% of available savings, low risk, implementable in 2-4 weeks); second, run the orphaned EBS snapshot audit and delete confirmed orphans (19% of available savings, zero risk, implementable in one day with proper review); third, review RDS configurations for unnecessary Multi-AZ and zero-traffic read replicas (14% of available savings, moderate risk, implementable in 1-2 weeks per account).

These three categories, executed sequentially, typically capture over 60% of the available savings in an account and build the organizational muscle for cost optimization work. The remaining categories — EC2 right-sizing, commitment optimization, architecture-level data transfer reduction — can follow as the team's capacity and tooling mature.

Find out what your account looks like in 20 minutes

KernelRun analyzes all three primary waste categories within 20 minutes of connecting your first account. Average first-cycle savings identified: 34% of compute spend. Connect in 4 minutes.

Request a Demo