Anatomy of a Cloud Bill Spike: Three Root Causes

When an engineering team notices that their AWS bill jumped 40% month-over-month without a corresponding traffic increase, the first assumption is usually a misconfigured NAT Gateway. NAT Gateway data processing charges are notoriously opaque, and they do show up as unexpected line items. But in KernelRun's cost anomaly detection data across 50+ customer accounts, NAT Gateway surprises represent only about 18% of spike events. The other 82% fall into three categories that are less frequently discussed and harder to catch with standard budget alerts.

Understanding these three categories matters because the remediation approach is different for each, and because the correct detection method for each is different. A static budget threshold catches all three eventually — after the damage is done. A multi-variate baseline model with per-service anomaly detection catches them within hours.

Root Cause 1: Spot Fleet Interruption Cascades

Spot instances are priced at 60-90% below On-Demand rates, and workloads that can tolerate interruption should be using them. The problem occurs when a spot fleet is interrupted and the Auto Scaling group's fallback configuration switches to On-Demand instances without a corresponding notification or capacity review.

The typical pattern: a spot fleet covering 60% of a batch processing cluster gets interrupted during a capacity crunch in a specific availability zone. The Auto Scaling group successfully replaces the interrupted instances with On-Demand equivalents, as designed. The workload continues without any user-visible impact. The engineering team sees no alert because the infrastructure behaved correctly — from an availability standpoint. What they miss is that their On-Demand instance count has tripled and the spot fleet, once interrupted, may not recover to spot capacity for days.

We have seen this pattern generate $18,000 in unexpected monthly charges for a team running a large data processing cluster. The On-Demand fallback ran for 11 days before anyone noticed, because the budget alert threshold was set at the monthly level and the overage only became visible 12 days into the month when Cost Explorer caught up.

The detection method that works: monitor On-Demand instance count per Auto Scaling group in real time, not per billing cycle. A 24-hour period where the On-Demand percentage for a historically spot-heavy group exceeds a threshold (we default to 30% deviation from the 30-day baseline) triggers an immediate alert. This catches the cascade within hours rather than weeks.

Root Cause 2: EBS Snapshot Accumulation from Automated Backups

EBS snapshots are incremental after the first full snapshot, but they are not free. Each snapshot charges $0.05 per GB-month. For a database with 1TB of storage running a daily backup policy with 30-day retention, the snapshot costs are approximately $50/month — predictable and budgeted. The problem occurs when retention policies are not enforced, instances are terminated without snapshot cleanup, or automated backup tools create additional snapshots outside the primary retention policy.

In practice, EBS snapshot proliferation happens because multiple tools often manage snapshots for the same instance. AWS Backup runs its retention policy. The operations team has a manual snapshot script from three years ago that predates AWS Backup adoption. A third-party DR tool also creates snapshots. Nobody owns the coordination between these three systems, and the snapshots accumulate.

One customer's account had 847 snapshots for RDS instances that had been terminated. No retention policy covered terminated-instance snapshots because the policies were instance-attached, and the instance was gone. The orphaned snapshots were accumulating at $0.05/GB-month with no visibility in Cost Explorer because snapshot costs appear as a single "Amazon Elastic Block Store" line item, not per-snapshot or per-source-instance.

The detection and remediation approach is straightforward but requires dedicated analysis: enumerate all EBS snapshots, cross-reference source instance IDs against currently running instances, and flag snapshots whose source instance no longer exists. For most accounts that have been running for more than two years, the first pass of this analysis identifies between $200-$1,200/month in orphaned snapshot charges.

Root Cause 3: Data Transfer Charges from Architecture Changes

AWS data transfer pricing is structured to charge for traffic that crosses availability zones, regions, and the public internet. Most engineering teams understand the high-level structure but miss the specific patterns that create large bills. The spike trigger is almost always an architecture change that inadvertently introduces cross-AZ traffic where none existed before.

The most common variant: a team migrates an application from EC2 to ECS Fargate tasks. Fargate tasks in a new service get assigned to availability zones based on capacity, not based on the AZ placement of the services they communicate with. An application that previously ran co-located on EC2 instances within the same AZ now has Fargate tasks in us-east-1a making API calls to an RDS instance in us-east-1b. Every request crosses an AZ boundary at $0.01/GB in each direction.

For a high-throughput service making 10M API calls per day averaging 5KB of response payload, cross-AZ data transfer costs approximately $900/month that did not exist before the migration. The application works perfectly. No alerts fire. The cost shows up as an uptick in "Data Transfer" charges without any obvious link to the Fargate migration.

As discussed in our article on tag inference and Cost Explorer gaps, the challenge is that data transfer charges in Cost Explorer are attributed to the source service but not to the architectural decision that generated them. Correlating a billing spike with a specific deployment requires cross-referencing deployment timestamps with billing data — which most teams do not have set up as a standard practice.

Why Standard Budget Alerts Miss These Patterns

AWS Budgets supports threshold-based alerts: notify when spend exceeds X% of the monthly budget. This is a necessary baseline but it has two structural limitations for the three root causes above.

First, monthly budget alerts operate on a lag. If a spot fleet cascade happens on day 3 of the month, the budget alert will not fire until the accumulated overage reaches the threshold — which for typical threshold settings happens around day 20. By that point, three weeks of On-Demand overspend has already occurred.

Second, budget alerts operate on total spend, not on spend anomalies relative to a per-service baseline. A team whose AWS bill is growing at 15% month-over-month due to legitimate business growth will set their budget threshold at 20% above current spend. An anomalous spike of 12% will not trigger the alert because it is within the budget threshold — even though it represents $8,000 of unexpected spend from an identifiable root cause.

Cost anomaly detection with per-service baselines solves both problems. KernelRun's anomaly engine builds a multi-variate baseline per service using 90 days of history and flags deviations within 4 hours. The first flag is a warning (3-sigma deviation); the second is an alert (5-sigma deviation requiring immediate investigation). This structure catches all three root causes above within the same business day they occur.

Building the Detection Infrastructure

For teams that want to build this detection capability without a dedicated platform, the minimum viable approach is: enable AWS Cost Anomaly Detection (free), configure monitors for EC2, RDS, and data transfer separately (not a single monitor for all services), and set alert thresholds based on dollar impact rather than percentage — a 50% increase in data transfer costs is more actionable than a 0.5% increase in total spend.

AWS Cost Anomaly Detection uses a machine learning baseline model and covers the basic version of this detection need. Its limitations are that it does not correlate anomalies with infrastructure changes, does not track per-Auto Scaling group metrics, and does not identify orphaned resources as a root cause category. It is a starting point, not a complete solution.

For teams managing more than $50,000/month in AWS spend, the cost of missing these anomalies for even a few days is large enough to justify dedicated anomaly detection tooling. The three root causes described in this article account for approximately 42% of the spike events we observe — and all three are preventable with the right detection infrastructure in place.

Catch cost anomalies within 4 hours

KernelRun's anomaly detection builds a multi-variate baseline per service and flags deviations the same day they occur. Connect your first AWS account in 4 minutes.

Request a Demo