Batch Processing for Historical Metrics

Batch processing for historical metrics is the discipline of reconstructing accurate, tenant-attributed database cost records over long past time windows — backfilling gaps, re-pricing corrected telemetry, and producing an auditable ledger that quota policies and chargeback reports can trust.

Part of Metric Extraction & Aggregation Pipelines.

Unlike the live telemetry handled by real-time metric streaming, a backfill runs against bounded, immutable history: you know the exact window boundaries up front, you re-read data that may already be partially loaded, and every retry must converge on the same ledger state. That makes idempotency, deterministic windowing, and checkpoint-driven resumption the three properties that separate a production backfill from a script that quietly double-bills a tenant. The sections below decompose the billing model that makes historical attribution hard, the extraction and normalization work, idiomatic Python automation, how the resulting aggregates feed quota enforcement, and the failure modes you will actually hit in production.

The diagram below traces a single backfill from time-window chunking through idempotent batch jobs, schema validation, and bulk load into the cost warehouse.

Billing Model & Attribution Challenges

Historical cost attribution is hard because the billing dimensions you need were rarely recorded at billable granularity when the workload ran. Providers expose consumption after the fact, blended across accounts, and often re-stated days later. A backfill has to reconcile three moving targets: the raw telemetry (vCPU-seconds, provisioned IOPS, storage-GB-hours), the provider’s own billed line items, and the tenant mapping that assigns each unit to a team, schema, or database instance.

The first challenge is blended versus disaggregated billing. AWS Cost Explorer and the Cost and Usage Report (CUR) present both blended and unblended rates; RDS instances under a Savings Plan or Reserved Instance are billed at an effective rate that only resolves after the amortization window closes. If your backfill prices historical usage at the on-demand list rate, it will systematically over-attribute cost versus what the invoice actually shows. The reconciliation model in compute vs storage cost breakdowns is the reference for splitting a single instance-hour into its compute and storage components before you re-price it.

The second challenge is late-arriving and re-stated data. CloudWatch metric statistics for a period can change for up to a few hours after the period closes as delayed data points land; Azure Cost Management restates amortized costs as reservations are applied. A backfill that runs the morning after has to treat any window younger than the provider’s settlement horizon as provisional and flag it for a later re-run, rather than sealing it into the ledger as final.

The third challenge is attribution edge cases: instances that were resized mid-window, multi-tenant databases where a single physical instance hosts several billed tenants, and resources whose cost-allocation tags were applied after the usage occurred. Tags do not apply retroactively — an instance tagged team=payments today was untagged last month, so a naive GROUP BY tag over historical data drops that spend into an “untagged” bucket. Historical backfills must join usage against a point-in-time tag history (an SCD-Type-2 dimension keyed on (resource_id, valid_from, valid_to)) rather than the current tag state. The canonical cost record a historical backfill emits is:

\begin{aligned} \text{total\_cost} &= \text{compute\_hours} \times \text{compute\_rate} + \text{provisioned\_GB} \times \text{storage\_rate} + \text{io\_requests} \times \text{io\_rate} \end{aligned}

where each rate is the effective (amortized, unblended) rate for the resource at the window it was consumed, not the current rate. Getting that temporal join right is the single largest source of historical chargeback disputes.

Telemetry Extraction & Metric Normalization

Extracting history efficiently means minimizing production impact while maximizing data density per API call. The extraction pattern is deterministic: enumerate fixed time windows (24-hour or 72-hour chunks are a good default — small enough to checkpoint frequently, large enough to amortize API overhead), then pull each window with an indexed, bounded query rather than an unbounded scan.

When harvesting counters from live database engines rather than provider APIs, reuse the safe polling techniques from system view querying patterns: indexed temporal predicates, cursor-based pagination, and filtered projections that isolate only cost-relevant signals — vCPU utilization, provisioned and consumed IOPS, storage throughput, connection counts, and backup retention volume. Avoid full-table scans and unbounded BETWEEN clauses that trigger lock contention or memory pressure on production instances.

For CloudWatch, get_metric_data is the correct API for backfills because it returns many metrics in a single paginated call and lets you set the Period explicitly to match the metric’s native aggregation cadence — 60 seconds for standard metrics, 1 second for high-resolution. Misaligning the period against the provider’s aggregation boundaries produces statistical artifacts (partial-period averages that understate peaks), so pin the period to the provider granularity and downsample locally only after validation.

import boto3
from datetime import datetime, timezone

cw = boto3.client("cloudwatch", region_name="us-east-1")

def extract_window(instance_id: str, start: datetime, end: datetime, period: int = 300):
    """Pull cost-relevant RDS metrics for one backfill window.

    Uses get_metric_data with a paginator so a wide window that exceeds
    the API's 100,800 data-point response cap streams correctly.
    """
    queries = [
        {
            "Id": "cpu",
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/RDS",
                    "MetricName": "CPUUtilization",
                    "Dimensions": [{"Name": "DBInstanceIdentifier", "Value": instance_id}],
                },
                "Period": period,
                "Stat": "Average",
            },
            "ReturnData": True,
        },
        {
            "Id": "iops",
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/RDS",
                    "MetricName": "WriteIOPS",
                    "Dimensions": [{"Name": "DBInstanceIdentifier", "Value": instance_id}],
                },
                "Period": period,
                "Stat": "Sum",
            },
            "ReturnData": True,
        },
    ]

    paginator = cw.get_paginator("get_metric_data")
    results: dict[str, list] = {}
    for page in paginator.paginate(
        MetricDataQueries=queries,
        StartTime=start,
        EndTime=end,
        ScanBy="TimestampAscending",
    ):
        for series in page["MetricDataResults"]:
            results.setdefault(series["Id"], [])
            results[series["Id"]].extend(zip(series["Timestamps"], series["Values"]))
    return results

Normalization turns these provider-shaped series into the unified usage schema every downstream stage expects. Each raw point becomes a record keyed on (instance_id, metric_name, window_start, window_end) with a canonical unit — vCPU-seconds, IO-requests, GB-hours — regardless of source cloud. When a backfill spans more than one provider, run the raw exports through the same canonical model described in multi-cloud cost normalization so a Postgres-on-RDS instance and an Azure SQL database resolve to the same compute-unit denomination before pricing. Schema drift — a provider renaming a metric or adding a dimension — is caught downstream by schema validation for billing data, which acts as the contract gate before any record reaches the ledger.

Python Automation Patterns

The core of a batch backfill is a driver that enumerates windows, checks the checkpoint registry to skip already-completed work, processes each window idempotently, and commits a checkpoint only after a window is durably loaded. The composite key (instance_id, metric_name, window_start, window_end) is both the upsert key in the warehouse and the checkpoint identity — one key, one source of truth, so a re-run of any window produces byte-identical ledger rows.

import hashlib
import json
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone


@dataclass(frozen=True)
class Window:
    instance_id: str
    start: datetime
    end: datetime

    @property
    def key(self) -> str:
        raw = f"{self.instance_id}|{self.start.isoformat()}|{self.end.isoformat()}"
        return hashlib.sha256(raw.encode()).hexdigest()


def iter_windows(instance_id: str, since: datetime, until: datetime,
                 chunk: timedelta = timedelta(hours=24)):
    """Deterministically enumerate fixed backfill windows."""
    cursor = since
    while cursor < until:
        yield Window(instance_id, cursor, min(cursor + chunk, until))
        cursor += chunk


def run_backfill(instance_id: str, since: datetime, until: datetime, store, warehouse):
    settlement_horizon = timedelta(hours=6)  # provider late-data window
    now = datetime.now(timezone.utc)

    for window in iter_windows(instance_id, since, until):
        if store.is_committed(window.key):
            continue  # idempotent skip — already durably loaded

        raw = extract_window(window.instance_id, window.start, window.end)
        records = normalize(raw, window)
        valid, rejected = validate(records)

        if rejected:
            warehouse.dead_letter(rejected)  # forensic copy, pipeline continues

        warehouse.upsert(valid, key_cols=("instance_id", "metric_name",
                                          "window_start", "window_end"))

        provisional = window.end > (now - settlement_horizon)
        store.commit(window.key, provisional=provisional)

The store.is_committed / store.commit pair is the checkpoint registry — a small relational metadata table or key-value store recording every completed window and whether it was sealed as final or provisional. Provisional windows are re-run on the next pass once they age past the settlement horizon; final windows are never touched again. Because upsert is keyed on the composite identity, replaying a window that was already partially loaded before a crash overwrites in place rather than appending duplicates — the property that makes the whole pipeline safe to retry.

For wide fan-outs across hundreds of instances, wrap the per-window I/O in the same async, semaphore-controlled concurrency described in async usage parsing workflows, using aioboto3 so extraction, normalization, and load overlap without exhausting connection pools or tripping provider rate limits:

import asyncio
import aioboto3

async def process_windows(windows, store, warehouse, max_concurrency: int = 8):
    sem = asyncio.Semaphore(max_concurrency)
    session = aioboto3.Session()

    async with session.client("cloudwatch", region_name="us-east-1") as cw:
        async def worker(window):
            async with sem:  # bound in-flight API calls to respect throttle limits
                if store.is_committed(window.key):
                    return
                raw = await extract_window_async(cw, window)
                records = normalize(raw, window)
                valid, _ = validate(records)
                await warehouse.upsert_async(valid)
                store.commit(window.key)

        await asyncio.gather(*(worker(w) for w in windows))

Rate-limit resilience belongs in a retry decorator with exponential backoff and jitter so that a fleet of workers recovering from a throttle does not synchronize into a thundering herd. Because that logic is shared across every stage of the platform, it lives with the retry and circuit-breaker patterns in error handling in cost pipelines rather than being reimplemented per job.

Quota Enforcement Integration

A backfill is not an end in itself — its output is the historical baseline that quota policy is calibrated against. Once a window is sealed as final in the ledger, the aggregated cost per tenant becomes an input to two enforcement paths.

First, threshold calibration. Hard and soft quota limits are only defensible if they are derived from real historical consumption rather than guessed. The rolling per-tenant aggregates a backfill produces feed directly into the limit-setting model in database quota boundary design, where a p95 of the last 90 days of a tenant’s daily spend becomes the soft alert threshold and a multiple of it becomes the hard cap. Without an accurate backfill, those percentiles are computed over gappy data and the limits are wrong.

Second, retroactive reconciliation. When a tenant disputes a chargeback or a quota breach is investigated, the historical ledger is the evidence. Because every record is keyed on the deterministic composite identity and priced with the effective rate for its window, the enforcement engine can replay exactly which windows pushed a tenant over budget. The mapping from a normalized cost signal to an actual enforcement action — throttle, alert, or block — is where the aggregation layer hands off to the quota control plane, and the accuracy of that decision is only as good as the backfill underneath it.

Provisional windows are deliberately excluded from hard enforcement: a limit should never be tripped by data the provider may still re-state. The backfill’s provisional flag therefore does double duty — it gates re-runs and it tells the enforcement engine which aggregates are safe to act on versus merely display.

Failure Modes & Troubleshooting

ThrottlingException / Rate exceeded from CloudWatch or Cost Explorer. A wide fan-out or too-small a window size multiplies API calls. Increase the window chunk size to reduce call count, lower the semaphore ceiling, and confirm the retry decorator applies backoff with jitter. Cost Explorer in particular has a low request quota — batch instances into a single get_cost_and_usage call with a GroupBy on DIMENSION/TAG rather than looping per instance.

Duplicate ledger rows / double-billed tenant. Almost always a broken idempotency key. Verify the upsert targets the full composite key (instance_id, metric_name, window_start, window_end) and that window boundaries are computed deterministically (UTC, fixed chunk arithmetic) — a window boundary drifting by even one second between runs produces a new key and a duplicate row. Never derive end from datetime.now() inside the loop.

Silent under-attribution into an “untagged” bucket. The point-in-time tag join is missing or joining against current tag state. Confirm the usage-to-tag join uses valid_from <= window_start AND (valid_to IS NULL OR valid_to > window_start) against the tag-history dimension, not a join to the live tag table.

Off-by-a-period aggregation errors. The extraction Period does not match the provider’s native granularity, so partial periods at window edges are averaged into neighbouring buckets. Pin Period to 60 or 300 seconds for standard metrics and align window boundaries to period multiples.

Schema mismatch mid-backfill. A provider renamed a metric or changed a dimension partway through the historical range. The validation gate routes these to the dead-letter queue with the original payload preserved; inspect the DLQ, extend the schema contract in schema validation for billing data, and re-run only the affected windows — the checkpoint registry means completed windows are skipped automatically.

Backfill re-prices at list rate, not invoice rate. The effective-rate lookup is falling back to on-demand pricing. Cross-check against the amortized reconciliation in query execution cost modeling and the fallback behaviour in fallback routing for cost APIs so a missing rate surfaces as a flagged, re-runnable window rather than a silently wrong number.

System View Querying Patterns — safe engine-level polling of pg_stat_activity, V$SESSION, and warehouse query history that feeds historical extraction.
Async Usage Parsing Workflows — semaphore-controlled concurrency for fanning a backfill across hundreds of instances.
Schema Validation for Billing Data — the contract gate that quarantines malformed historical records before they reach the ledger.
Error Handling in Cost Pipelines — retry decorators, circuit breakers, and compensating rollbacks shared by every batch job.
Real-Time Metric Streaming Setup — the live path a completed backfill hands off to for continuous cost attribution.

Back to: Metric Extraction & Aggregation Pipelines

Batch Processing for Historical Metrics #

Billing Model & Attribution Challenges #

Telemetry Extraction & Metric Normalization #

Python Automation Patterns #

Quota Enforcement Integration #

Failure Modes & Troubleshooting #

Related #