← Back to Blog

Tag Inference: When Your Cost Explorer Data Has Gaps

Tag inference cost analysis

AWS resource tagging is the foundation of cost attribution: if you cannot tag resources, you cannot report spend by team or product. Most FinOps programs start with a tagging initiative, spend several months improving tag coverage, reach 55-65% coverage, and then slow down. The last 35-45% of resources are the hard ones — created by infrastructure automation that does not propagate tags, created before the tagging policy was established, or created by services like Lambda that generate child resources without inheriting parent tags.

The 35-45% of untagged spend is not trivial. For a company spending $400,000/month on AWS, that gap represents $140,000-$180,000 in cost that cannot be attributed to any team or product. It shows up as "unallocated" in FinOps reports, which means it gets allocated as overhead rather than charged back to the teams that created it. This removes the accountability signal that makes cost optimization programs work.

Tag inference — attributing cost to teams without native tags — is not a perfect solution, but it is a practical one. This article describes the inference approaches that work, their accuracy levels, and where each approach breaks down.

Why Tag Coverage Stalls Below 70%

Engineering organizations typically stall in their tagging programs for three structural reasons. First, some AWS services do not support resource tagging at all, or support it inconsistently. Data transfer charges, Route 53 query charges, and some support charges cannot be tagged. These represent a floor of unattributable spend.

Second, ephemeral resources created by Auto Scaling, Lambda invocations, ECS task execution, and Fargate task launches often do not inherit tags from their parent resources. An EC2 Auto Scaling group properly tagged with "Team: payments" will launch instances that inherit that tag — but only if the Auto Scaling group is configured to propagate tags at launch. Many are not.

Third, the resources created earliest in an account's history are often the hardest to tag. Infrastructure provisioned before IaC adoption tends to be long-running, manually managed, and poorly tagged. These are often the largest instances — the ones that have been running for two years because nobody wants to touch them — and they represent a disproportionate share of untagged spend.

Inference Method 1: VPC and Subnet Association

The simplest inference method for EC2 instances and ECS services is VPC and subnet association. Most engineering organizations structure their VPCs by environment or team — "team-payments-vpc," "team-platform-staging-subnet," and so on. An untagged EC2 instance in the payments VPC can be attributed to the payments team with reasonable confidence.

Accuracy for this method is typically 70-80%. The error cases are resources in shared VPCs (common for shared services, databases, and monitoring infrastructure) and resources in VPCs with generic names that predate the team-based VPC structure.

VPC inference is quick to implement — it requires only a mapping table from VPC ID and subnet ID to team, which can usually be derived from existing naming conventions. The resulting attribution is less precise than explicit tagging but substantially better than "unallocated."

Inference Method 2: IAM Role and Instance Profile Cross-Reference

EC2 instances are launched with IAM instance profiles that grant them access to specific AWS services and resources. Those instance profiles are often named with team or service conventions even when the instance itself is not tagged. An instance running with the "payments-service-prod-role" profile is almost certainly a payments service production instance.

IAM role inference works particularly well for instances launched by infrastructure automation (CloudFormation, Terraform, CDK) where the role name follows a consistent pattern derived from the same naming convention as the rest of the infrastructure. Accuracy for this method is 80-90% for infrastructure provisioned with modern IaC tooling.

The limitation is instance profiles with generic names ("ec2-instance-profile," "app-role") that predate the naming convention. For those, IAM role inference provides no attribution signal.

Inference Method 3: Security Group Association

Security groups often encode service-level information in their names and descriptions even when the instances using them are not tagged. A security group named "payments-service-alb-sg" attached to an EC2 instance provides a strong attribution signal. Security groups are also more consistently named than instances because they are typically created and managed by the platform engineering team rather than by individual developers.

Security group inference is most useful when combined with VPC inference. An untagged instance in an ambiguous VPC that belongs to the "payments-service-alb-sg" security group can be attributed to payments with higher confidence than VPC inference alone provides.

Inference Method 4: CloudTrail Cross-Reference for Instance Origin

AWS CloudTrail logs every API call including who called it, when, and with what parameters. For EC2 instances without tags, the CloudTrail RunInstances event log shows which IAM user or role launched the instance. That IAM principal is typically associated with a team — either through the role naming convention or through an IAM user-to-team mapping in your identity provider.

CloudTrail inference is the highest-accuracy method for identifying who created a resource. The challenge is data retention — CloudTrail logs are typically retained for 90 days in S3 without additional archiving. For a long-running instance created 18 months ago, the creation event is gone. CloudTrail inference works reliably only for instances less than 90 days old unless the organization archives CloudTrail logs to a longer-retention store.

Inference Method 5: Commit-Level Telemetry Cross-Reference

For organizations that deploy infrastructure via CI/CD pipelines, the deployment pipeline logs contain a mapping from git commit to infrastructure change. A Terraform apply triggered by a commit in the "payments-service" repository created a specific set of EC2 instances. The git repository attribution can be used to tag those instances even if the Terraform code itself did not apply tags.

This method requires instrumenting the CI/CD pipeline to emit structured deployment events that include: git repository, commit SHA, IAM role used for deployment, and a list of resource ARNs created or modified. With that event log, a resource that was created by a deployment from the "payments-service" repository can be attributed to the payments team retroactively.

Pipeline-level telemetry is the most accurate inference method for recently-created resources, but it requires upfront instrumentation work. For organizations already using deployment platforms that emit structured events (GitHub Actions, Atlantis, Spacelift), the instrumentation is relatively lightweight — capturing the Terraform state diff and the triggering repository at deployment time.

Combining Inference Methods: A Priority Stack

In practice, inference accuracy improves when methods are combined in a priority stack. The recommended priority order: (1) explicit tag, highest confidence; (2) CloudTrail + pipeline telemetry cross-reference, 85-90% accuracy for recent resources; (3) IAM role name inference, 80-90% accuracy for IaC-provisioned resources; (4) security group association combined with VPC inference, 75-85% accuracy; (5) VPC and subnet association alone, 70-80% accuracy.

Applying this stack to an untagged resource fleet, most organizations can attribute 80-90% of previously unattributed spend. The remaining 10-20% consists of genuinely ambiguous resources — shared infrastructure, third-party managed services, and resources with no retrievable origin signal — that remain as "shared/unattributed" overhead.

Attribution Accuracy Limits and How to Communicate Them

Inferred attribution is more useful than no attribution, but it should be presented to teams with accuracy metadata. A team receiving a cost report that shows $12,000 attributed to them based on VPC inference should know that the attribution is inferred at 75% confidence, not explicit. That context prevents disputes and helps teams identify which resources they should tag explicitly to improve confidence.

The practical approach: apply a confidence tier to each attributed cost item (Explicit/High/Medium/Low), surface the tier in cost reports, and create a remediation dashboard that shows untagged resources sorted by spend so teams can prioritize which resources to tag first. Tagging the top 20 resources by spend typically raises explicit tag coverage by 15-20 percentage points for most accounts, because cost distribution is heavily right-tailed.

Attribute 90%+ of your cloud spend with confidence

KernelRun applies multi-method tag inference to fill attribution gaps and presents inferred cost with confidence tiers. Start with 4 minutes to connect your first account.

Request a Demo