Graceful degradation when billing APIs are down

When a cloud provider’s billing API is throttling, timing out, or fully down, this page shows the exact Python needed to keep quota enforcement alive by serving cost data from a local fallback cache instead of blocking on the dead upstream.

Back to: Error Handling in Cost Pipelines

A billing outage is not just a reporting gap — for a control plane that gates provisioning, it is an availability incident. If enforcement threads block synchronously on GetCostAndUsage or the Cost Management Query API, one provider-side latency spike stalls every quota decision and either freezes provisioning entirely or, worse, fails open and lets spend run unattended. The fix is to treat pricing data as eventually consistent: an asynchronous client populates a highly available local cache, and enforcement always reads from that cache, flagging records as degraded when the live feed is stale. This complements the jittered retries in implementing retry logic for failed metric pulls and the endpoint-level failover in fallback routing for cost APIs — retries recover from a blip, degradation survives a sustained outage.

Prerequisites

Before running the client, confirm the following are in place.

Cloud permissions: the identity running the extractor needs read-only access to cost and usage data — nothing more. On AWS, attach a minimal policy scoped to Cost Explorer; this least-privilege posture is part of broader access control for cost data.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadOnlyCostExplorer",
      "Effect": "Allow",
      "Action": [
        "ce:GetCostAndUsage",
        "ce:GetCostForecast"
      ],
      "Resource": "*"
    }
  ]
}

Python: 3.10 or newer (the code uses X | None unions and modern asyncio APIs).
Libraries: install the async HTTP client. sqlite3 ships with the standard library, so no extra dependency is needed for the fallback store.
```
pip install "httpx>=0.27"
```
A warm cache: the fallback is only useful if it holds a recent known-good snapshot. Seed it during normal operation (every successful fetch writes through) and pre-warm it before peak provisioning windows so a cold start never coincides with an outage.

Step-by-Step Implementation

The client fetches per-cluster billing metrics over HTTP, writes every success through to a local SQLite store, and — when a circuit breaker is open or the request fails — serves the last known-good snapshot with an is_degraded flag set. Downstream enforcement reads that flag and tightens its behavior.

Step 1 — Model the circuit state and the degraded record

Define the three breaker states, the metric record (carrying a degradation flag), and typed exceptions so callers can distinguish “served stale” from “no data at all”.

import time
from dataclasses import dataclass
from enum import Enum


class CircuitState(Enum):
    CLOSED = "closed"        # live requests flow normally
    OPEN = "open"            # upstream is failing; serve cache
    HALF_OPEN = "half_open"  # allow one probe to test recovery


@dataclass
class BillingMetrics:
    cluster_id: str
    compute_cost: float
    storage_cost: float
    timestamp: float
    is_degraded: bool = False


class BillingAPIDownError(Exception):
    """Raised when the API is down AND no valid cache entry exists."""


# Expected:
#   BillingMetrics("db-1", 12.0, 3.5, time.time()).is_degraded  -> False
#   CircuitState.OPEN.value                                     -> "open"

Step 2 — Build the local SQLite fallback store

The cache is a single table keyed on cluster_id. INSERT OR REPLACE makes the write-through idempotent, so a retry storm cannot corrupt the snapshot. Reads always mark the returned record is_degraded=True, because a cache hit only happens when the live path could not be used.

import sqlite3
from pathlib import Path


class FallbackCache:
    def __init__(self, cache_path: str = "billing_cache.db"):
        self.cache_path = Path(cache_path)
        with sqlite3.connect(self.cache_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS billing_cache (
                    cluster_id   TEXT PRIMARY KEY,
                    compute_cost REAL,
                    storage_cost REAL,
                    last_updated REAL
                )
            """)

    def write_through(self, metrics: BillingMetrics) -> None:
        with sqlite3.connect(self.cache_path) as conn:
            conn.execute(
                "INSERT OR REPLACE INTO billing_cache VALUES (?, ?, ?, ?)",
                (metrics.cluster_id, metrics.compute_cost,
                 metrics.storage_cost, metrics.timestamp),
            )

    def read(self, cluster_id: str) -> BillingMetrics | None:
        with sqlite3.connect(self.cache_path) as conn:
            row = conn.execute(
                "SELECT compute_cost, storage_cost, last_updated "
                "FROM billing_cache WHERE cluster_id = ?",
                (cluster_id,),
            ).fetchone()
        if row is None:
            return None
        return BillingMetrics(
            cluster_id=cluster_id,
            compute_cost=row[0],
            storage_cost=row[1],
            timestamp=row[2],
            is_degraded=True,  # a cache read is always a degraded read
        )

Step 3 — Wire the circuit breaker and fallback fetch

The breaker trips open after failure_threshold consecutive failures, short-circuits every call to the cache for recovery_timeout seconds, then admits one half-open probe. A successful live fetch writes through and closes the breaker; a failure with a warm cache still returns usable (degraded) data instead of raising.

The sequence below traces one fetch through the breaker, the live call, and the cache fallback.

import logging
import httpx

logger = logging.getLogger(__name__)


class ResilientBillingClient:
    def __init__(
        self,
        api_url: str,
        cache: FallbackCache,
        timeout: float = 5.0,
        failure_threshold: int = 3,
        recovery_timeout: float = 30.0,
    ):
        self.api_url = api_url
        self.cache = cache
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout

        self._state = CircuitState.CLOSED
        self._failures = 0
        self._opened_at = 0.0
        self._client = httpx.AsyncClient(timeout=timeout)

    def _allow_request(self) -> bool:
        """Return True if the breaker permits a live call right now."""
        if self._state is CircuitState.OPEN:
            if time.monotonic() - self._opened_at >= self.recovery_timeout:
                self._state = CircuitState.HALF_OPEN
                logger.info("Circuit breaker HALF_OPEN: probing billing API")
                return True
            return False
        return True

    def _record_success(self) -> None:
        self._failures = 0
        self._state = CircuitState.CLOSED

    def _record_failure(self) -> None:
        self._failures += 1
        self._opened_at = time.monotonic()
        if self._failures >= self.failure_threshold:
            self._state = CircuitState.OPEN
            logger.warning("Circuit breaker OPEN: serving degraded cache")

    async def fetch(self, cluster_id: str) -> BillingMetrics:
        # Breaker open: skip the doomed call and serve cache if we have it.
        if not self._allow_request():
            cached = self.cache.read(cluster_id)
            if cached:
                logger.debug("Breaker open; serving degraded cache for %s", cluster_id)
                return cached
            raise BillingAPIDownError("Circuit open and no local fallback available")

        try:
            resp = await self._client.get(f"{self.api_url}/metrics/{cluster_id}")
            resp.raise_for_status()
            data = resp.json()
            metrics = BillingMetrics(
                cluster_id=cluster_id,
                compute_cost=data["compute_cost"],
                storage_cost=data["storage_cost"],
                timestamp=time.time(),
            )
            self.cache.write_through(metrics)
            self._record_success()
            return metrics
        except (httpx.HTTPStatusError, httpx.RequestError, KeyError) as exc:
            self._record_failure()
            logger.error("Billing API request failed: %s", exc)
            cached = self.cache.read(cluster_id)
            if cached:
                return cached
            raise BillingAPIDownError("Primary API failed and cache is cold") from exc

Step 4 — Enforce conservatively while degraded

When is_degraded is True, downstream enforcement must stop trusting exact figures and switch to policy-driven guardrails. The safe default is to bound spend from above: estimate cost at the highest known unit rate rather than the last observed rate, so an outage never under-charges a tenant into a runaway.

C_{degraded} = quantity \times max\_unit\_cost

def degraded_enforcement(metrics: BillingMetrics, staleness_s: float,
                         max_staleness_s: float = 3600.0) -> str:
    """Pick an enforcement mode from cache freshness and degradation state."""
    if not metrics.is_degraded and staleness_s < max_staleness_s:
        return "active"       # full soft/hard limit enforcement
    if staleness_s >= max_staleness_s:
        return "frozen"       # cache too old to trust; hold all provisioning
    return "conservative"     # enforce against highest known unit cost


# Expected:
#   degraded_enforcement(live_fresh, 30)      -> "active"
#   degraded_enforcement(cached_hit, 120)     -> "conservative"
#   degraded_enforcement(cached_hit, 7200)    -> "frozen"

This mirrors the rule used across Error Handling in Cost Pipelines: fail closed on new provisioning, fail open on observation. Existing workloads keep running against the last known-good snapshot; new capacity is held until live data returns and reconciliation — via batch historical aggregation — backfills the missing windows. The conservative estimate itself feeds directly into the hard and soft quota limits that gate provisioning.

Verification

Confirm the fallback actually engages before you rely on it in an incident.

Force the breaker open and assert a degraded read. Point the client at an unreachable URL after seeding one record, then check that fetch returns cached data flagged degraded rather than raising.

import asyncio

cache = FallbackCache(":memory:")  # or a temp file path
cache.write_through(BillingMetrics("db-1", 12.0, 3.5, time.time()))

client = ResilientBillingClient("http://127.0.0.1:1", cache,
                                failure_threshold=1, timeout=0.2)
result = asyncio.run(client.fetch("db-1"))
assert result.is_degraded is True, "expected a degraded cache read"
assert result.compute_cost == 12.0, "cached figure should survive the outage"
print("degraded fallback OK:", result)

Confirm the cold-cache failure is explicit. A database cluster with no seeded row must raise BillingAPIDownError, never return silent zeros — a zero would read as “free” and disable enforcement.
```
try:
    asyncio.run(client.fetch("never-seeded"))
except BillingAPIDownError:
    print("cold cache correctly refused to fabricate data")
```

The degraded record shape enforcement should expect is a BillingMetrics instance with is_degraded=True and a timestamp older than the current sync interval — for example BillingMetrics(cluster_id='db-1', compute_cost=12.0, storage_cost=3.5, timestamp=1751000000.0, is_degraded=True).

Gotchas & Edge Cases

A stale cache is not a fresh one — track staleness explicitly. The breaker only knows the upstream is down; it says nothing about how old the cached figure is. Always compare time.time() - metrics.timestamp against a max-staleness threshold and freeze provisioning once the snapshot is too old to defend to a finance team.
time.monotonic() for the breaker, time.time() for the record. The breaker’s timers must use monotonic() so an NTP correction cannot accidentally reopen or hold a circuit; the record timestamp must use wall-clock time.time() so staleness is comparable to real billing windows. Mixing them silently breaks both.
Cold start during an outage is the worst case. If the process restarts while the API is down, an in-memory or unseeded cache has nothing to serve. Persist the SQLite file on durable storage and pre-warm it, rather than relying on write-through alone.
Blended vs unblended drift looks like corruption. If the live path returns unblended cost but a stale cache holds a blended figure (or vice versa), enforcement sees a jump when the breaker closes. Normalize both sides first — see normalizing provider billing exports into a unified schema — so a recovery is not misread as a spend spike.
Half-open thundering herd. With many workers, every one may probe at the same instant the recovery timeout elapses and re-hammer a still-fragile API. Jitter the recovery_timeout per worker, or gate probes through a shared lock, so recovery is sampled rather than stampeded.
Never let a KeyError masquerade as an outage. The client treats a missing compute_cost key as a failure and serves cache — correct for a truncated payload, but a permanent schema change would then hide behind the breaker forever. Alert on sustained degradation so a contract change is caught, not silently absorbed.

Frequently Asked Questions

How long should the cache serve stale data before I stop enforcing at all?

Tie it to your billing granularity, not a fixed number. For daily-granularity cost data, a snapshot a few hours old is still defensible for conservative enforcement; past a full billing window it is not. Set max_staleness_s to well under one billing period, return "conservative" up to that point, and "frozen" beyond it so provisioning halts rather than trusting a figure you cannot reconcile.

Should degraded mode fail open or fail closed?

Split the decision by action. Fail open for observation — existing workloads keep running on the last known-good snapshot so a billing outage never takes down live databases. Fail closed for provisioning — hold new capacity until real-time data returns, because that is the only path that can create unbounded, unattributed spend during the outage.

Why SQLite instead of Redis or Memcached for the fallback?

SQLite has zero network dependency, which is exactly what you want when the network to a cost API is already unreliable. It survives process restarts, needs no separate service to stay healthy during an incident, and is fast enough for per-cluster reads. Reach for Redis only when many processes must share one warm cache and you can guarantee its own availability.

How do I reconcile the interpolated figures once the API recovers?

Every degraded read is provisional. When the breaker closes, re-pull the affected windows and write them through the same idempotent path used during normal operation; because ledger writes are keyed on a deterministic reconciliation key, the real figures overwrite the placeholders without double-counting. Backfilling those windows is the job of batch historical aggregation.

What should I monitor to know degradation is working?

Track three signals: circuit breaker state transitions (CLOSED → OPEN → HALF_OPEN), the fallback hit rate (share of quota checks served from cache), and cache staleness (seconds since the last successful sync). Route them through structured telemetry so an on-call engineer can see at a glance whether enforcement is running on live or degraded data, and stream them live as described in real-time metric streaming setup.

Implementing retry logic for failed metric pulls — jittered exponential backoff that recovers from a blip before degradation is needed.
Fallback routing for cost APIs — endpoint-level failover to a secondary billing source, complementary to a local cache.
Batch Processing for Historical Metrics — reconciling the windows held out while the API was down.
Error Handling in Cost Pipelines — the parent topic covering retries, circuit breakers, and idempotent writes end to end.

Back to: Error Handling in Cost Pipelines

Python’s native asyncio event loop provides the primitives for scheduling background cache-refresh tasks without blocking enforcement — see the official asyncio documentation — and aligning degradation trade-offs with the FinOps Framework keeps the cost-visibility compromises documented and approved by finance stakeholders.

Graceful degradation when billing APIs are down #

Prerequisites #

Step-by-Step Implementation #

Step 1 — Model the circuit state and the degraded record #

Step 2 — Build the local SQLite fallback store #

Step 3 — Wire the circuit breaker and fallback fetch #

Step 4 — Enforce conservatively while degraded #

Verification #

Gotchas & Edge Cases #

Frequently Asked Questions #

How long should the cache serve stale data before I stop enforcing at all? #

Should degraded mode fail open or fail closed? #

Why SQLite instead of Redis or Memcached for the fallback? #

How do I reconcile the interpolated figures once the API recovers? #

What should I monitor to know degradation is working? #

Related #