Extracting pg_stat_activity for cost tracking

This page walks through the exact Python and SQL needed to turn PostgreSQL’s pg_stat_activity view into per-tenant cost records, so a shared instance’s blended compute bill becomes an auditable charge against the backend that actually consumed it.

Back to: System View Querying Patterns

pg_stat_activity is the authoritative, session-granular signal for what a PostgreSQL instance is doing right now — which backends are active, what they are waiting on, and how long each statement has been running. Provider invoices tell you the instance-hour total hours after the fact; this view lets you disaggregate that total live, at the moment spend is being committed. Feeding it into your Metric Extraction & Aggregation Pipelines is what makes showback, chargeback, and pre-emptive quota enforcement possible instead of reactive spreadsheet reconciliation. The catch is that the view is a point-in-time snapshot and observer-sensitive: query it naively and the measurement itself becomes a cost and a stability risk. This page implements the disciplined extraction pattern that the System View Querying Patterns tier is built on.

Prerequisites

Before running the extractor, confirm the following are in place.

Database permissions: a non-superuser sees only its own sessions’ query text and a truncated view of everyone else’s. For cost attribution you need to read every backend, so grant the built-in pg_monitor role (PostgreSQL 10+) rather than superuser:
```
-- Least-privilege monitoring identity for the extractor
CREATE ROLE finops_metrics LOGIN PASSWORD 'rotate-me-via-secrets-manager';
GRANT pg_monitor TO finops_metrics;  -- includes pg_read_all_stats

-- Optional: a per-database default cost-center tag the query can read back
ALTER DATABASE analytics SET app.cost_center = 'team-analytics';
```
pg_monitor bundles pg_read_all_stats, which unlocks full visibility into other sessions’ state and query text without the blast radius of superuser. On Amazon RDS and Aurora, grant it through rds_superuser or the GRANT pg_monitor TO ... path your parameter group allows. Credential handling for this role should follow the same rotation discipline described in securing access to cost data.
Python: 3.10 or newer (the code uses modern asyncio and typing syntax).
Libraries: install the async driver and the synchronous fallback driver.
```
pip install "asyncpg>=0.29" "psycopg2-binary>=2.9"
```

Step-by-Step Implementation

The extractor provisions a read-only role, runs a filtered active-backend query that computes duration server-side, pulls the result over asyncpg with a synchronous psycopg2 fallback and exponential backoff, validates each record, then attributes the active seconds to a cost center. Build it in four steps.

Step 1 — Write the low-overhead extraction query

Querying pg_stat_activity in production demands non-blocking, low-payload patterns. A bare SELECT * drags idle connections, background workers, and the polling connection itself across the wire on every cycle. Filter aggressively by backend_type, drop idle sessions, exclude your own pid, and compute elapsed time server-side so no timezone drift creeps in during transit.

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    backend_type,
    wait_event_type,
    wait_event,
    EXTRACT(EPOCH FROM (now() - query_start)) AS active_seconds,
    current_setting('app.cost_center', true)  AS cost_center_tag
FROM pg_stat_activity
WHERE backend_type = 'client backend'
  AND state <> 'idle'
  AND pid <> pg_backend_pid()      -- never bill the extractor for observing
  AND query_start IS NOT NULL;

The true second argument to current_setting is the missing_ok flag: it returns NULL instead of raising when the app.cost_center GUC was never set, so an untagged session degrades to a null tag rather than crashing the poll. Server-side EXTRACT(EPOCH ...) keeps the arithmetic on the instance’s clock, and excluding pg_backend_pid() stops the measurement from inflating its own bill.

Step 2 — Extract asynchronously with a synchronous fallback

Production polling must survive transient network failures, pool exhaustion, and failovers without losing a billing window. The extractor below uses asyncpg for high-throughput I/O, retries with exponential backoff, and falls back to a synchronous psycopg2 query when the async driver is blocked or degraded. This is the same resilience posture covered under error handling in cost pipelines, applied to a live system view.

import asyncio
import logging
import os
from typing import Any

import asyncpg
import psycopg2
import psycopg2.extras

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("pg_cost_extractor")

EXTRACTION_QUERY = """
SELECT
    pid, usename, application_name, client_addr, state, backend_type,
    wait_event_type, wait_event,
    EXTRACT(EPOCH FROM (now() - query_start)) AS active_seconds,
    current_setting('app.cost_center', true)  AS cost_center_tag
FROM pg_stat_activity
WHERE backend_type = 'client backend'
  AND state <> 'idle'
  AND pid <> pg_backend_pid()
  AND query_start IS NOT NULL;
"""


async def fetch_activity_async(pool: asyncpg.Pool) -> list[dict[str, Any]]:
    """Run the optimized query over an asyncpg pool connection."""
    async with pool.acquire() as conn:
        rows = await conn.fetch(EXTRACTION_QUERY)
        return [dict(row) for row in rows]


def fetch_activity_sync(dsn: str) -> list[dict[str, Any]]:
    """Synchronous psycopg2 fallback for degraded async environments."""
    with psycopg2.connect(dsn) as conn:
        with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
            cur.execute(EXTRACTION_QUERY)
            return [dict(record) for record in cur.fetchall()]


async def extract_with_retry(
    dsn: str, max_retries: int = 3, base_delay: float = 1.0
) -> list[dict[str, Any]]:
    """Async extraction with exponential backoff, then a sync driver fallback."""
    pool: asyncpg.Pool | None = None
    for attempt in range(max_retries):
        try:
            if pool is None:
                pool = await asyncpg.create_pool(dsn, min_size=2, max_size=5, timeout=10)
            records = await fetch_activity_async(pool)
            logger.info("Async extraction succeeded: %d active backends", len(records))
            return records
        except (asyncpg.exceptions.PostgresConnectionError, asyncio.TimeoutError) as exc:
            logger.warning("Async attempt %d/%d failed: %s", attempt + 1, max_retries, exc)
            if attempt == max_retries - 1:
                logger.info("Falling back to synchronous psycopg2 extraction")
                return fetch_activity_sync(dsn)
            await asyncio.sleep(base_delay * (2 ** attempt))  # 1s, 2s, 4s ...
    return []

The diagram below traces this resilient flow from the async query through retry and fallback to the final cost attribution in Step 4.

Step 3 — Validate before anything reaches the ledger

A snapshot row can arrive with a null usename (background activity that slipped the filter) or a non-numeric active_seconds. Enforce the record contract at the pipeline boundary so a malformed row is dropped, not silently written as a bogus charge — the same principle formalized in schema validation for billing data.

REQUIRED_FIELDS = {"pid", "usename", "application_name", "state", "active_seconds"}


def validate(records: list[dict[str, Any]]) -> list[dict[str, Any]]:
    """Keep only rows with the required fields and a numeric active_seconds."""
    valid: list[dict[str, Any]] = []
    for record in records:
        if not REQUIRED_FIELDS.issubset(record.keys()):
            logger.warning("Schema validation failed for pid=%s", record.get("pid"))
            continue
        try:
            record["active_seconds"] = float(record["active_seconds"])
        except (ValueError, TypeError):
            logger.warning("Non-numeric active_seconds for pid=%s", record.get("pid"))
            continue
        valid.append(record)
    return valid

Step 4 — Attribute active seconds to a cost center

pg_stat_activity exposes primitives, not dollars: each row is one backend’s current active duration. Because the view is a snapshot rather than a cumulative counter, cost is reconstructed by sampling at a fixed cadence and dividing the instance rate across the concurrently active backends, weighted by their active time over the interval. For a poll interval of duration $\Delta t$, the charge to cost center $c$ is:

$$ \text{cost}{c} = \sum{s ,\in, c} \frac{a_s}{\sum_{j} a_j} \times \text{rate}_{\text{inst}} \times \Delta t $$

where $a_s$ is a session’s active_seconds and $\text{rate}_{\text{inst}}$ is the instance’s per-second price. The snippet folds validated rows into a per-tag charge for one interval.

def attribute(records: list[dict[str, Any]], rate_inst: float, interval_s: float) -> dict[str, float]:
    """Proportionally charge the instance rate to each cost center for one poll."""
    total_active = sum(r["active_seconds"] for r in records)
    if total_active == 0:
        return {}
    charges: dict[str, float] = {}
    for r in records:
        tag = r.get("cost_center_tag") or "untagged"
        share = r["active_seconds"] / total_active
        charges[tag] = charges.get(tag, 0.0) + share * rate_inst * interval_s
    return charges


async def main() -> None:
    dsn = os.getenv("PG_DSN", "postgresql://finops_metrics@localhost:5432/analytics")
    raw = await extract_with_retry(dsn)
    records = validate(raw)
    charges = attribute(records, rate_inst=0.000116, interval_s=30.0)  # ~$0.42/hr instance
    logger.info("Interval charges by cost center: %s", charges)
    # Downstream routing to Kafka, S3, or a FinOps aggregator would occur here.


if __name__ == "__main__":
    asyncio.run(main())

Verification

Confirm the extractor produces the expected record shape before you schedule it.

Assert the record contract on a live pull. Every validated row must carry a numeric active_seconds and survive deduplication on pid.

records = validate(asyncio.run(extract_with_retry(os.environ["PG_DSN"])))
assert all(isinstance(r["active_seconds"], float) for r in records)
assert len({r["pid"] for r in records}) == len(records)  # one row per backend

Cross-check the count against the engine. The extractor’s row count should match a direct query for active client backends, minus the extractor’s own connection:
```
SELECT count(*) FROM pg_stat_activity
WHERE backend_type = 'client backend' AND state <> 'idle';
```
Inspect one attributed record. A healthy interval charge maps each tag to a positive float, for example {'team-analytics': 0.0021, 'untagged': 0.0007} — the shape that merges cleanly into the aggregation tier.

Gotchas & Edge Cases

It is a snapshot, not a running total. Unlike Oracle’s cumulative V$SESSTAT counters, each pg_stat_activity row reflects the instant you queried it. Cost must be integrated by sampling across polls — reading a row as if it were a lifetime total wildly over-charges a long-running query.
idle in transaction still holds resources. The filter drops state = 'idle', but idle in transaction sessions pin locks and an open snapshot. Decide deliberately whether to bill them; excluding them can under-attribute a leaked transaction that is genuinely costing the instance.
query_start is null for some backends. Autovacuum workers, the walwriter, and freshly spawned backends can have a null query_start; the IS NOT NULL guard prevents active_seconds from computing against nothing and emitting a bogus duration.
client_addr is null for local sockets. Unix-socket and same-host connections report a null client_addr. Never key attribution on it — use usename, application_name, or the app.cost_center GUC instead.
Polling faster than the stats flush wastes cycles. PostgreSQL refreshes activity state on its own cadence; intervals below ~500 ms mostly return identical snapshots at real CPU cost. A 15–30 s cadence satisfies most billing windows, and the PostgreSQL monitoring statistics documentation explains why. For sub-second needs, push the extractor behind a real-time metric streaming setup rather than tightening the poll.
Cost-center tags propagate only where the GUC is set. current_setting('app.cost_center', true) returns NULL unless the value was applied per-database, per-role, or per-session. Bucket untagged active time into an explicit "untagged" charge so it stays visible instead of vanishing from the ledger.

Frequently Asked Questions

Why not just use pg_stat_statements instead of pg_stat_activity?

They answer different questions. pg_stat_statements aggregates cumulative execution cost per normalized query across the whole instance, which is ideal for finding expensive statements but tells you nothing about who is running them right now. pg_stat_activity is per-live-session, carrying usename, application_name, and client_addr — the identity dimensions you need to attribute active seconds to a tenant or cost center. Many pipelines poll both: activity for attribution, statements for query-level cost modeling.

How often should I poll pg_stat_activity for cost tracking?

Match the poll interval to your billing window, not to how fast you can query. A 15–30 second cadence captures active-time proportions accurately for hourly or daily chargeback without straining the shared pool. Polling faster than the engine’s internal stats refresh returns near-identical snapshots at real CPU cost, and it competes with the very workload you are trying to measure.

Do I need superuser to read every session’s activity?

No, and you should not use it. Grant the built-in pg_monitor role, which includes pg_read_all_stats and gives the extractor full visibility into other backends’ state and query text without superuser’s blast radius. On managed platforms like RDS and Aurora, grant pg_monitor through the provider’s privileged role rather than requesting raw superuser.

Why does the extractor fall back to psycopg2 instead of just retrying asyncpg?

Because the two drivers fail independently. If the async event loop is starved, the connector library is misconfigured, or an environment blocks asyncpg’s binary protocol, retrying the same driver just repeats the failure. A synchronous psycopg2 path on the final attempt is a different code route to the same data, so a billing window survives a degradation that would otherwise drop it — the graceful-degradation discipline the pipeline applies everywhere.

How do these active-second charges feed quota enforcement?

The per-cost-center charges are additive across intervals, so summing them over a rolling window gives the running spend you compare against a policy threshold. When a tenant’s accumulated active-second cost crosses its limit, the aggregate is what triggers the alert or throttle defined in hard and soft quota boundary design. Keeping attribution idempotent per pid and interval is what stops a retried poll from double-counting and firing a false breach.

Querying Oracle V$SESSION for resource usage — the sibling pattern for Oracle’s cumulative session views, where cost is a delta rather than an integrated snapshot.
System View Querying Patterns — the parent topic covering how to read engine internals into a canonical, tenant-attributed usage schema.
Schema validation for billing data — the record-contract enforcement that keeps malformed activity rows out of the ledger.
Real-time metric streaming setup — where to push this extractor when polling latency must drop below the stats-flush window.

Back to: System View Querying Patterns

Extracting pg_stat_activity for cost tracking #

Prerequisites #

Step-by-Step Implementation #

Step 1 — Write the low-overhead extraction query #

Step 2 — Extract asynchronously with a synchronous fallback #

Step 3 — Validate before anything reaches the ledger #

Step 4 — Attribute active seconds to a cost center #

Verification #

Gotchas & Edge Cases #

Frequently Asked Questions #

Why not just use pg_stat_statements instead of pg_stat_activity? #

How often should I poll pg_stat_activity for cost tracking? #

Do I need superuser to read every session’s activity? #

Why does the extractor fall back to psycopg2 instead of just retrying asyncpg? #

How do these active-second charges feed quota enforcement? #

Related #