Tracking IOPS vs Baseline Storage Spend in RDS

This page walks through the exact Python you need to correlate provisioned RDS IOPS against the baseline storage you are actually paying for, so idle gp2 volumes and over-provisioned io2 capacity stop hiding inside a blended bill.

Back to: Compute vs Storage Cost Breakdowns

Amazon RDS deliberately decouples the performance you buy from the capacity you buy, and the two are billed on different clocks. With gp2, baseline IOPS are welded to volume size at 3 IOPS per GB — $I_{base} = \min(3 \times S_{GB},, 16000)$ — so teams routinely over-allocate storage purely to unlock throughput. With gp3 and io1/io2, IOPS are provisioned and billed as an explicit line item independent of capacity. This is exactly the compute-versus-storage disaggregation problem, narrowed to a single question: for each instance, how much of the IOPS you pay a baseline for is your workload actually using? Answering it means joining three sources — instance metadata, CloudWatch telemetry, and resource-level Cost Explorer spend — and doing it resiliently, using the same fallback routing for cost APIs that keeps the pipeline honest when an endpoint degrades. The resulting utilization signal then feeds directly into hard and soft quota boundaries.

Prerequisites

Before running the pipeline, confirm the following are in place.

IAM permissions: the execution role needs read-only access to RDS metadata, CloudWatch metrics, and Cost Explorer. Scope to least privilege — this is part of broader access control for cost data, so never reuse an admin credential for read-only extraction.
```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RdsIopsCostCorrelation",
      "Effect": "Allow",
      "Action": [
        "rds:DescribeDBInstances",
        "cloudwatch:GetMetricData",
        "ce:GetCostAndUsage"
      ],
      "Resource": "*"
    }
  ]
}
```
cloudwatch:GetMetricData and ce:GetCostAndUsage do not support resource-level ARNs, so Resource stays "*"; tighten access instead with an IAM condition or an SCP restricting the account and region.
Cost Explorer resource-level data: enable resource-level granularity in the Cost Management console. Without it, RESOURCE_ID filtering returns empty and per-instance spend is unavailable.
Python: 3.10 or newer (the code uses modern asyncio APIs and structural typing).
Libraries: install the async AWS client alongside boto3 for the synchronous fallback path.
```
pip install "aiobotocore>=2.13" "boto3>=1.34"
```

Step-by-Step Implementation

The pipeline discovers every RDS instance, derives its baseline IOPS from storage configuration, pulls 30-day CloudWatch utilization concurrently, and joins it against resource-level Cost Explorer spend — with a synchronous fallback and a local cache so a Cost Explorer outage degrades gracefully rather than failing the run.

Step 1 — Derive baseline IOPS from storage type

The derivation is a pure, side-effect-free function so it can be unit-tested in isolation. Each storage type follows a different rule, and getting this wrong silently corrupts every downstream utilization figure.

from typing import Optional


class CostAttributionError(Exception):
    """Raised when cost/metric correlation fails irrecoverably."""


def calculate_baseline_iops(
    storage_gb: int,
    storage_type: str,
    provisioned_iops: Optional[int],
) -> int:
    """Derive baseline IOPS from an RDS volume's storage configuration."""
    if storage_type == "gp2":
        # gp2 grants 3 IOPS/GB, floored at 100, capped at 16,000.
        return max(100, min(storage_gb * 3, 16000))
    if storage_type == "gp3":
        # gp3 ships a free 3,000 IOPS baseline; anything above is billed.
        return provisioned_iops if provisioned_iops else 3000
    if storage_type in ("io1", "io2"):
        # io1/io2 baseline == the explicitly provisioned (and billed) IOPS.
        return provisioned_iops or 0
    raise CostAttributionError(f"Unsupported storage type: {storage_type}")

For a 500 GB gp2 volume this returns 1500; for a gp3 volume provisioned at 6000 it returns 6000 (of which 3,000 are free and 3,000 are billed).

Step 2 — Discover RDS instances asynchronously

describe_db_instances paginates. The aiobotocore paginator yields pages as an async iterator, so hundreds of instances stream in without blocking the event loop.

from typing import Dict, List


async def fetch_rds_instances(rds_client) -> List[Dict]:
    """Retrieve RDS instance metadata with async pagination."""
    instances: List[Dict] = []
    paginator = rds_client.get_paginator("describe_db_instances")
    async for page in paginator.paginate():
        for db in page.get("DBInstances", []):
            instances.append({
                "db_instance_id": db["DBInstanceIdentifier"],
                "storage_type": db.get("StorageType", "gp2"),
                "allocated_storage_gb": db["AllocatedStorage"],
                "provisioned_iops": db.get("Iops"),
                "engine": db["Engine"],
            })
    return instances

Step 3 — Pull 30-day IOPS from CloudWatch

Request ReadIOPS and WriteIOPS in a single get_metric_data call with a daily period, then average the daily means across the window. A ClientError here degrades to zero rather than aborting the whole run.

from datetime import datetime, timedelta, timezone

from botocore.exceptions import ClientError


async def fetch_cloudwatch_iops(cw_client, instance_id: str) -> Dict[str, float]:
    """Pull the 30-day average Read/Write IOPS for one instance."""
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(days=30)

    def _query(qid: str, metric: str) -> Dict:
        return {
            "Id": qid,
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/RDS",
                    "MetricName": metric,
                    "Dimensions": [
                        {"Name": "DBInstanceIdentifier", "Value": instance_id}
                    ],
                },
                "Period": 86400,   # one datapoint per day
                "Stat": "Average",
            },
        }

    try:
        response = await cw_client.get_metric_data(
            MetricDataQueries=[
                _query("read_ops", "ReadIOPS"),
                _query("write_ops", "WriteIOPS"),
            ],
            StartTime=start_time,
            EndTime=end_time,
        )
        series = {m["Id"]: m.get("Values", []) for m in response["MetricDataResults"]}
        read_vals = series.get("read_ops") or [0.0]
        write_vals = series.get("write_ops") or [0.0]
        return {
            "avg_read_iops": sum(read_vals) / len(read_vals),
            "avg_write_iops": sum(write_vals) / len(write_vals),
        }
    except ClientError as exc:
        code = exc.response["Error"]["Code"]
        logger.warning("CloudWatch fetch failed for %s: %s", instance_id, code)
        return {"avg_read_iops": 0.0, "avg_write_iops": 0.0}

Step 4 — Fetch per-instance spend with a cached fallback

Resource-level Cost Explorer requires DAILY (or hourly) granularity and only serves the trailing 14 days, so the query window is bounded accordingly. On any failure the function falls back to the last cached value, keeping the run alive during a Cost Explorer outage.

import json
import os

import boto3

COST_CACHE = os.getenv("RDS_COST_CACHE", "/var/cache/rds_cost_cache.json")


def get_cost_explorer_spend(instance_id: str) -> Optional[float]:
    """Return trailing-14-day UnblendedCost for one instance, with cache fallback."""
    try:
        ce = boto3.client("ce", region_name="us-east-1")  # CE is global via us-east-1
        end = datetime.now(timezone.utc).date()
        start = end - timedelta(days=14)
        response = ce.get_cost_and_usage(
            TimePeriod={"Start": start.isoformat(), "End": end.isoformat()},
            Granularity="DAILY",
            Filter={"Dimensions": {"Key": "RESOURCE_ID", "Values": [instance_id]}},
            Metrics=["UnblendedCost"],
        )
        total = sum(
            float(day["Total"]["UnblendedCost"]["Amount"])
            for day in response.get("ResultsByTime", [])
        )
        return round(total, 2)
    except Exception as exc:  # ThrottlingException, DataUnavailable, endpoint errors
        logger.error("Cost Explorer lookup failed for %s: %s", instance_id, exc)
        if os.path.exists(COST_CACHE):
            with open(COST_CACHE, "r", encoding="utf-8") as fh:
                return json.load(fh).get(instance_id)
        return None

Step 5 — Correlate and orchestrate with bounded concurrency

A semaphore caps in-flight CloudWatch calls so a large estate cannot exhaust the connection pool or trip API rate limits. asyncio.gather(..., return_exceptions=True) isolates a single failing instance from the batch.

import asyncio
import logging

from aiobotocore.session import get_session
from botocore.config import Config

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("rds_iops_cost_tracker")

REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1")
MAX_CONCURRENCY = int(os.getenv("MAX_CONCURRENCY", "10"))


async def correlate_instance_cost(cw_client, instance: Dict) -> Dict:
    """Join baseline IOPS, observed IOPS, and spend into one attribution record."""
    db_id = instance["db_instance_id"]
    baseline = calculate_baseline_iops(
        instance["allocated_storage_gb"],
        instance["storage_type"],
        instance["provisioned_iops"],
    )
    telemetry = await fetch_cloudwatch_iops(cw_client, db_id)
    observed = telemetry["avg_read_iops"] + telemetry["avg_write_iops"]
    utilization = (observed / baseline * 100) if baseline > 0 else 0.0

    return {
        "db_instance_id": db_id,
        "storage_type": instance["storage_type"],
        "baseline_iops": baseline,
        "avg_observed_iops": round(observed, 2),
        "utilization_pct": round(utilization, 2),
        "trailing_14d_spend_usd": get_cost_explorer_spend(db_id),
    }


async def run_attribution_pipeline() -> List[Dict]:
    """Discover instances, correlate them concurrently, and return the records."""
    cfg = Config(retries={"max_attempts": 3, "mode": "adaptive"})
    session = get_session()
    async with (
        session.create_client("rds", region_name=REGION, config=cfg) as rds_client,
        session.create_client("cloudwatch", region_name=REGION, config=cfg) as cw_client,
    ):
        instances = await fetch_rds_instances(rds_client)
        logger.info("Discovered %d RDS instances for attribution.", len(instances))

        sem = asyncio.Semaphore(MAX_CONCURRENCY)

        async def _bounded(inst: Dict) -> Dict:
            async with sem:
                return await correlate_instance_cost(cw_client, inst)

        results = await asyncio.gather(
            *(_bounded(i) for i in instances), return_exceptions=True
        )
        records = [r for r in results if isinstance(r, dict)]
        logger.info("Correlated %d instances successfully.", len(records))
        return records


if __name__ == "__main__":
    for record in asyncio.run(run_attribution_pipeline()):
        print(record)

The multi-step async fetch, retry, and fallback sequence looks like this:

Expected output for a small estate is one record per instance:

{'db_instance_id': 'prod-orders-1', 'storage_type': 'gp2', 'baseline_iops': 1500, 'avg_observed_iops': 214.6, 'utilization_pct': 14.31, 'trailing_14d_spend_usd': 61.44}
{'db_instance_id': 'prod-ledger-1', 'storage_type': 'io2', 'baseline_iops': 12000, 'avg_observed_iops': 11380.2, 'utilization_pct': 94.84, 'trailing_14d_spend_usd': 812.90}

Verification

Confirm the correlation is trustworthy before wiring it into any dashboard or right-sizing job.

Assert the join shape. Every record must carry a baseline and a utilization figure; a missing spend value means the Cost Explorer fallback fired.

records = asyncio.run(run_attribution_pipeline())
for r in records:
    assert r["baseline_iops"] >= 0 and 0 <= r["utilization_pct"] <= 200
    if r["trailing_14d_spend_usd"] is None:
        logger.warning("spend missing for %s — served from cache miss", r["db_instance_id"])

Cross-check baselines against the console. In RDS > Databases, a gp2 instance’s Modify page shows allocated storage; multiply by 3 and confirm it equals baseline_iops. For gp3/io2, the provisioned IOPS field must match exactly.
Reconcile spend against Cost Explorer. In Cost Management > Cost Explorer, group by Resource and filter Service = Relational Database Service over the same 14-day window. The per-resource totals should equal each record’s trailing_14d_spend_usd within rounding.

Gotchas & Edge Cases

Resource-level Cost Explorer is time-boxed. RESOURCE_ID filtering only works with DAILY/hourly granularity and only serves the trailing 14 days. Requesting a 30-day window here returns a DataUnavailableException, not an error you can ignore — this is why the spend window is 14 days while the IOPS window is 30.
gp2 burst credits mask true utilization. A volume under 1 TB earns burst credits and can sustain 3,000 IOPS well above its baseline. A utilization figure over 100% means the workload is spending burst credits; sustained bursting signals impending credit exhaustion and latency spikes, not comfortable headroom.
CloudWatch IOPS are counts, not throughput. ReadIOPS/WriteIOPS are operations per second; do not confuse them with ReadThroughput/WriteThroughput (bytes/sec). Correlating the wrong pair against a baseline IOPS figure produces nonsense.
Storage-type migration changes the baseline mid-window. If a volume moved from gp2 to gp3 inside the 30-day window, DescribeDBInstances reports only the current type, so the baseline no longer matches the historical IOPS. Pin correlations to a period that starts after the last modification.
Cost allocation tag lag. Newly launched instances can take up to 24 hours to surface in resource-level Cost Explorer. Treat a None spend on a fresh instance as propagation delay, not a bug.
Low utilization is a migration signal, not an idle signal. An instance sitting under 30% of its gp2 baseline is a candidate to move to gp3 (or downsize), but only after confirming the low reading is not an artifact of burst credits or a quiet period. Feed that decision into your quota boundary policies rather than acting on a single snapshot.

Frequently Asked Questions

Why is my utilization above 100% on a small gp2 volume?

Because gp2 volumes under 1 TB burst above their baseline using I/O credits. A reading over 100% means the workload is drawing on burst credits faster than the baseline replenishes them. Sustained bursting is an early warning of credit exhaustion — the fix is more baseline IOPS (a larger gp2 volume or a gp3 migration), not a wider dashboard band.

Should I migrate everything under 30% utilization from gp2 to gp3?

Usually yes, but verify first. gp3 decouples IOPS from capacity and its 3,000 free baseline covers most low-utilization workloads at lower cost. Confirm the volume genuinely needs less than 3,000 IOPS across peaks (not just the 30-day average), because a gp3 volume that must provision extra IOPS to match a large gp2’s size-derived baseline can cost more, not less.

Why does Cost Explorer return nothing for a valid instance ID?

Two common causes: resource-level granularity is not enabled on the account, or you requested more than 14 days of RESOURCE_ID-filtered data. The pipeline caps the window at 14 days for exactly this reason. Also confirm you are calling the ce client in us-east-1, where the global Cost Explorer endpoint lives.

How do I attribute IOPS cost on Aurora instead of standard RDS?

Aurora does not use gp2/gp3/io2 volumes — its distributed storage layer bills I/O per request under distinct usage types, so the baseline model here does not apply. Attribute Aurora I/O by pulling the VolumeReadIOPs/VolumeWriteIOPs metrics and joining them to the Aurora:StorageIOUsage line item instead.

Can I run this across every region in one pass?

Yes. RDS and CloudWatch are regional, so wrap run_attribution_pipeline in a loop over boto3.Session().get_available_regions("rds"), creating per-region clients. Cost Explorer stays global on us-east-1; only the CloudWatch and RDS clients change region.

How to Separate Compute and Storage Costs in Azure SQL — the Azure-side equivalent, splitting a blended bill by meter category.
Building async Python parsers for AWS Cost Explorer — the same async extraction pattern applied to raw Cost Explorer records.
Graceful degradation when billing APIs are down — hardening the fallback-and-cache path this pipeline relies on.
Compute vs Storage Cost Breakdowns — the parent topic covering compute/storage disaggregation across managed database engines.

Back to: Compute vs Storage Cost Breakdowns

Tracking IOPS vs Baseline Storage Spend in RDS #

Prerequisites #

Step-by-Step Implementation #

Step 1 — Derive baseline IOPS from storage type #

Step 2 — Discover RDS instances asynchronously #

Step 3 — Pull 30-day IOPS from CloudWatch #

Step 4 — Fetch per-instance spend with a cached fallback #

Step 5 — Correlate and orchestrate with bounded concurrency #

Verification #

Gotchas & Edge Cases #

Frequently Asked Questions #

Why is my utilization above 100% on a small gp2 volume? #

Should I migrate everything under 30% utilization from gp2 to gp3? #

Why does Cost Explorer return nothing for a valid instance ID? #

How do I attribute IOPS cost on Aurora instead of standard RDS? #

Can I run this across every region in one pass? #

Related #