Detecting API degradation before thresholds are crossed

Name: Detecting API degradation before thresholds are crossed
Availability: InStock
Author: AnchorFlow

by AnchorFlow·Apr 21, 2026·4 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

Rate-of-change detection catches degradation before P95 thresholds fire.

Strengths

•Dynamic baseline learning adapts to traffic patterns without manual tuning
•Redis Streams storage enables full incident replay and debugging
•259 passing tests with memory mode for Redis-free deployments

Weaknesses

•FastAPI-only limits audience - most enterprises use multiple frameworks
•Anomaly detection exists in Datadog, New Relic, and Honeycomb already

Post Description

While building a payment orchestration system, I ran into a problem: Most monitoring tools alert when a threshold is already breached (e.g. P95 > 1000ms). But in practice, systems often degrade before hitting those limits — especially under bursty traffic. So I experimented with detecting degradation before thresholds are crossed, directly inside a FastAPI app. I built a small middleware that:

Tracks P95 latency per route template (e.g. /users/{id}) Learns a baseline dynamically from recent traffic Detects spikes using rate-of-change (not just static thresholds) Computes a 0–100 health score with trend direction (improving / stable / degrading) Stores events in Redis Streams for replay and debugging

One interesting result: In synthetic load tests (gradual latency ramp from ~200ms to ~1200ms over 60 seconds, with a P95 warning threshold at 1000ms), rate-of-change detection consistently surfaced degradation slightly before static threshold alerts. The window is small, but it was often enough to notice system stress before crossing alert thresholds.

Design constraints:

Near-zero overhead on the request path (async, fire-and-forget writes) Must fail silently if Redis is unavailable No external monitoring stack required (runs in-app)

Example usage: pythonapp.add_middleware(RequestMetricsMiddleware, alert_engine=engine)

Context: This is part of a larger system I'm building that integrates cloud services with mobile money APIs (EcoCash, etc.), where partial failures and latency spikes are common. Still early — hasn't been tested under real production traffic yet.

Curious how others are handling early degradation detection in FastAPI or similar systems. Repo: https://github.com/Tandem-Media/fastapi-alertengine PyPI: https://pypi.org/project/fastapi-alertengine/