Implementing retry logic for failed metric pulls

This page shows the exact async Python needed to retry a failed cloud billing metric pull — an AWS Cost Explorer get_cost_and_usage call that throttled or timed out — without amplifying the outage or corrupting a chargeback ledger.

Back to: Error Handling in Cost Pipelines

A single dropped metric pull is never just a missing row. For a pipeline that feeds quota enforcement and chargeback, one silently failed get_cost_and_usage window understates a tenant’s spend, skews the allocation, and can leave a database cluster provisioning against a stale budget. The naive fixes — a bare time.sleep(5) loop or an unbounded while True retry — make it worse: they synchronise every worker into a thundering herd that hammers a billing API already returning ThrottlingException, and they retry payloads that were never going to succeed. The pattern below classifies each failure first, backs off with full jitter, and trips a circuit breaker so a sustained outage short-circuits to the degradation path instead of stalling the pipeline. It sits directly under the async semaphore-controlled parsing that drives concurrency, and hands off to graceful degradation when billing APIs are down once retries are exhausted — retries recover from a blip, degradation survives a sustained outage.

The flowchart below traces the decision path a single metric pull follows through classification, backoff, and circuit-breaker enforcement.

Prerequisites

Before running the retry engine, confirm the following are in place.

Cloud permissions: the identity running the extractor needs read-only access to cost and usage data — nothing more. On AWS, attach a minimal policy scoped to Cost Explorer; this least-privilege posture is part of broader access control for cost data.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOnlyCostExplorer",
      "Effect": "Allow",
      "Action": [
        "ce:GetCostAndUsage"
      ],
      "Resource": "*"
    }
  ]
}

Python: 3.10 or newer (the code uses X | None unions and modern asyncio APIs).
Libraries: install the async AWS SDK. aioboto3 pulls in aiobotocore and botocore, which supply the real exception types the classifier keys on.
```
pip install "aioboto3>=13.0"
```

Step-by-Step Implementation

The engine wraps any async metric pull in a retry loop that (1) classifies the failure, (2) computes a jittered backoff delay, (3) records failures against a sliding-window breaker, and (4) returns a typed outcome the caller can branch on. Each step is independently testable.

Step 1 — Classify transient versus terminal failures

Not every failure warrants a retry. Retrying a 400 or an expired credential just burns API quota and delays the real fix. Classify at both the transport and application layers before deciding to back off.

import asyncio
from botocore.exceptions import (
    ClientError,
    EndpointConnectionError,
    ConnectTimeoutError,
    ReadTimeoutError,
)

# Cost Explorer / STS error codes that are safe to retry.
TRANSIENT_ERROR_CODES = {
    "ThrottlingException", "Throttling", "TooManyRequestsException",
    "RequestLimitExceeded", "ServiceUnavailable", "InternalFailure",
}
TRANSIENT_STATUS = {429, 500, 502, 503, 504}


def is_transient(exc: Exception) -> bool:
    """Return True when the failure is worth retrying, False to fail fast."""
    if isinstance(exc, (EndpointConnectionError, ConnectTimeoutError,
                        ReadTimeoutError, asyncio.TimeoutError)):
        return True
    if isinstance(exc, ClientError):
        code = exc.response.get("Error", {}).get("Code", "")
        status = exc.response.get("ResponseMetadata", {}).get("HTTPStatusCode")
        return code in TRANSIENT_ERROR_CODES or status in TRANSIENT_STATUS
    return False


# Expected:
#   is_transient(asyncio.TimeoutError())                       -> True
#   is_transient(ClientError({"Error": {"Code": "Throttling"}}, "op"))  -> True
#   is_transient(ClientError({"Error": {"Code": "AccessDenied"}}, "op")) -> False

Step 2 — Compute the backoff delay with full jitter

Blind fixed retries re-synchronise workers into a thundering herd. Full jitter spreads attempts across a random interval while still guaranteeing an upper bound. The delay for attempt $n$ (zero-indexed) is drawn uniformly from zero to a capped exponential ceiling:

$$\text{delay}_n = U!\left(0,\ \min!\left(\text{cap},\ \text{base} \cdot 2^{,n}\right)\right)$$

import random


def backoff_delay(attempt: int, base: float = 1.0, cap: float = 30.0) -> float:
    """Full-jitter exponential backoff: uniform over [0, capped ceiling]."""
    ceiling = min(cap, base * (2 ** attempt))
    return random.uniform(0, ceiling)


# Expected (ceilings, not exact values — the draw is random):
#   attempt 0 -> [0, 1s]    attempt 3 -> [0, 8s]
#   attempt 1 -> [0, 2s]    attempt 6 -> [0, 30s]  (capped)

Step 3 — Track failures with a sliding-window circuit breaker

Retries handle a blip; a breaker handles a sustained outage. It counts failures inside a rolling window and, once the threshold is crossed, reports open so the caller can stop hammering a degraded endpoint and reserve the worker pool for healthy tenants.

import time


class CircuitBreaker:
    def __init__(self, threshold: int = 10, window: float = 60.0):
        self.threshold = threshold
        self.window = window
        self._failures: list[float] = []

    def record_failure(self) -> None:
        now = time.monotonic()  # monotonic: an NTP correction can't skew the window
        self._failures.append(now)
        self._failures = [t for t in self._failures if t > now - self.window]

    def is_open(self) -> bool:
        cutoff = time.monotonic() - self.window
        self._failures = [t for t in self._failures if t > cutoff]
        return len(self._failures) >= self.threshold

    def reset(self) -> None:
        self._failures.clear()

Step 4 — Wrap the metric pull in the retry engine

The engine takes a factory — a zero-argument callable that returns a fresh coroutine — because a coroutine object can only be awaited once; retrying the same awaited object raises RuntimeError. It returns a (payload, RetryOutcome) tuple so the caller can route success, exhaustion, an open breaker, and a terminal error down different paths.

import logging
from dataclasses import dataclass
from enum import Enum
from typing import Any, Awaitable, Callable

import aioboto3

logger = logging.getLogger(__name__)


class RetryOutcome(Enum):
    SUCCESS = "success"
    EXHAUSTED = "exhausted"
    CIRCUIT_OPEN = "circuit_open"
    TERMINAL_ERROR = "terminal_error"


@dataclass
class RetryConfig:
    max_attempts: int = 5
    base_delay: float = 1.0
    max_delay: float = 30.0


async def pull_with_retry(
    factory: Callable[[], Awaitable[Any]],
    config: RetryConfig,
    breaker: CircuitBreaker,
) -> tuple[Any, RetryOutcome]:
    """Retry an async metric pull with full jitter and circuit breaking."""
    if breaker.is_open():
        logger.warning("Circuit breaker open; short-circuiting metric pull")
        return None, RetryOutcome.CIRCUIT_OPEN

    for attempt in range(config.max_attempts):
        try:
            payload = await factory()          # fresh coroutine each attempt
            breaker.reset()
            return payload, RetryOutcome.SUCCESS
        except Exception as exc:
            if not is_transient(exc):
                logger.error("Terminal error, failing fast: %s", exc)
                return None, RetryOutcome.TERMINAL_ERROR

            breaker.record_failure()
            if breaker.is_open():
                logger.warning("Breaker tripped mid-retry; short-circuiting")
                return None, RetryOutcome.CIRCUIT_OPEN
            if attempt == config.max_attempts - 1:
                break

            delay = backoff_delay(attempt, config.base_delay, config.max_delay)
            logger.warning(
                "Transient failure on attempt %d/%d; retrying in %.2fs (%s)",
                attempt + 1, config.max_attempts, delay, exc,
            )
            await asyncio.sleep(delay)

    logger.error("Retries exhausted after %d attempts", config.max_attempts)
    return None, RetryOutcome.EXHAUSTED


def cost_explorer_factory(start: str, end: str) -> Callable[[], Awaitable[Any]]:
    """Bind a Cost Explorer window into a re-awaitable factory."""
    async def _call() -> list[dict]:
        session = aioboto3.Session()
        async with session.client("ce", region_name="us-east-1") as ce:
            resp = await ce.get_cost_and_usage(
                TimePeriod={"Start": start, "End": end},
                Granularity="DAILY",
                Metrics=["UnblendedCost"],
                GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
            )
            return resp["ResultsByTime"]
    return _call


# Usage:
#   breaker = CircuitBreaker(threshold=10, window=60.0)
#   factory = cost_explorer_factory("2026-06-01", "2026-06-02")
#   results, outcome = await pull_with_retry(factory, RetryConfig(), breaker)
#   if outcome is RetryOutcome.SUCCESS:
#       ingest(results)

Because the retry loop is generic over the factory, the same engine wraps a Postgres system-view cost query or an Azure Cost Management pull without change — only the coroutine inside the factory differs.

Verification

Confirm the engine behaves correctly before it guards real ingestion. A deterministic fake that fails a fixed number of times, then succeeds, exercises the whole path without touching AWS.

import asyncio


async def verify() -> None:
    breaker = CircuitBreaker(threshold=10, window=60.0)
    attempts = {"n": 0}

    def flaky_factory():
        async def _call():
            attempts["n"] += 1
            if attempts["n"] < 3:                # fail twice, then succeed
                raise asyncio.TimeoutError("simulated throttle")
            return [{"TimePeriod": {"Start": "2026-06-01"}, "cost": "12.00"}]
        return _call

    payload, outcome = await pull_with_retry(
        flaky_factory(), RetryConfig(base_delay=0.01, max_delay=0.05), breaker
    )
    assert outcome is RetryOutcome.SUCCESS, outcome
    assert attempts["n"] == 3, "should succeed on the third attempt"
    print("retry recovered after", attempts["n"], "attempts:", payload)


asyncio.run(verify())
# -> retry recovered after 3 attempts: [{'TimePeriod': {'Start': '2026-06-01'}, 'cost': '12.00'}]

A successful pull hands back the raw ResultsByTime list, each element shaped like {"TimePeriod": {"Start": "2026-06-01", "End": "2026-06-02"}, "Groups": [{"Keys": ["Amazon RDS"], "Metrics": {"UnblendedCost": {"Amount": "12.4100", "Unit": "USD"}}}], "Total": {}} — feed that straight into strict schema validation for billing data before it reaches the ledger.

Gotchas & Edge Cases

Retry a factory, never a coroutine. Awaiting the same coroutine object twice raises RuntimeError: cannot reuse already awaited coroutine. The zero-argument factory rebuilds the coroutine — and its client session — on every attempt; this is the single most common bug when bolting retries onto existing async code.
ThrottlingException means slow down, not fail. Cost Explorer enforces a low request-per-second ceiling. Treat it as transient, but pair the backoff with async semaphore-controlled concurrency so you are not retrying your way through a wall you built by fanning out too wide in the first place.
Per-tenant breakers, not one global breaker. A shared circuit breaker lets one noisy neighbour that exhausts its API quota trip the breaker for every healthy tenant. Key a breaker per account, cluster, or resource group so a single bad tenant is isolated.
A successful retry can still be a bad window. Cost Explorer often returns a provisional figure for the current day that is later restated. A pull that succeeds is not final — reconcile recent windows against batch historical aggregation rather than trusting the first success.
Bound total retry time inside batch jobs. Five attempts at a 30-second cap can stall a worker for over a minute. Wrap the whole call in asyncio.wait_for() so a hard job deadline — an end-of-month reconciliation window — is never blown by one stubborn endpoint.
Blended versus unblended is a dimension mismatch, not an error. If a retry succeeds against UnblendedCost but downstream expects blended figures, the numbers will look wrong and tempt a “fix” via more retries. Normalise both sides first through normalizing provider billing exports into a unified schema.

Frequently Asked Questions

How many retry attempts should I configure?

For interactive quota checks, three to five attempts against a 30-second cap balances recovery against latency. For batch backfills that can tolerate more delay, raise the cap rather than the attempt count — more attempts inside a tight window mostly just delay the inevitable fall-through to the degradation path. Always bound the total with asyncio.wait_for() so retries can never exceed the job’s deadline.

Why full jitter instead of plain exponential backoff?

Plain exponential backoff still fires every worker at the same instant — base * 2^n is deterministic — so a fleet recovering from an outage re-synchronises into a second thundering herd. Full jitter draws each delay uniformly from zero up to the exponential ceiling, decorrelating the workers while keeping the same worst-case bound. AWS’s own architecture guidance settled on full jitter for exactly this reason.

What should happen when the circuit breaker opens?

An open breaker means retrying is pointless — the endpoint is sustained-down, not blipping. Stop calling it and hand off to graceful degradation when billing APIs are down, which serves the last known-good snapshot from a local cache and switches enforcement to conservative mode. The breaker converts a retry problem into a degradation problem, which is a much safer place to be.

Do I need a separate circuit breaker per tenant?

Yes, if fairness matters. One global breaker means a single tenant that exhausts its own API quota can short-circuit pulls for everyone. Keying breakers per account or per database cluster contains the blast radius so a noisy neighbour degrades only its own metrics, not the whole estate.

How do I keep retries from double-counting cost records?

Make the ledger write idempotent, not the retry. Key each write on a deterministic reconciliation key (tenant + resource + billing window) so a payload that arrives twice — once from a retry that actually succeeded upstream before the client saw a timeout — upserts rather than duplicates. Retry safety lives in the write path, not the fetch path.

Graceful degradation when billing APIs are down — where an exhausted retry or open breaker hands off to serve cached cost data.
Async usage parsing workflows — semaphore-controlled concurrency that keeps you from retrying into a self-inflicted throttle.
Schema validation for billing data — validate a successfully retried payload before it reaches the ledger.
Error Handling in Cost Pipelines — the parent topic covering retries, circuit breakers, and idempotent writes end to end.

Back to: Error Handling in Cost Pipelines

Python’s native asyncio event loop supplies the scheduling and timeout primitives this pattern relies on — see the official asyncio documentation — and the backoff strategy follows the AWS Architecture Blog’s guidance on exponential backoff and jitter.

Implementing retry logic for failed metric pulls #

Prerequisites #

Step-by-Step Implementation #

Step 1 — Classify transient versus terminal failures #

Step 2 — Compute the backoff delay with full jitter #

Step 3 — Track failures with a sliding-window circuit breaker #

Step 4 — Wrap the metric pull in the retry engine #

Verification #

Gotchas & Edge Cases #

Frequently Asked Questions #

How many retry attempts should I configure? #

Why full jitter instead of plain exponential backoff? #

What should happen when the circuit breaker opens? #

Do I need a separate circuit breaker per tenant? #

How do I keep retries from double-counting cost records? #

Related #