Back to browse
GitHub Repository

🪂 Latency-aware model cascade for agentic LLM workflows. Auto-switches to a faster model when yours is slow.

4 starsPython

glide – LLM cascade proxy, auto-switches models before timeout

by phanisaimuni116·Mar 7, 2026·1 point·1 comment

AI Analysis

●●SolidSolve My ProblemNiche Gem

TTFT-aware model fallback—avoids timeouts by hedging between Opus, Sonnet, Haiku automatically.

Strengths
  • Proactive p95 routing skips slow models without waiting, reducing perceived latency.
  • Request hedging fires parallel models and streams whichever answers first, canceling losers.
  • Zero-config setup: single shell command starts proxy, integrates via base URL env var.
Weaknesses
  • Limited to Anthropic models plus local Ollama; no support for OpenAI, Gemini, Grok.
  • Routing logic depends on historical TTFT percentiles—may thrash if load is bursty or unpredictable.
Target Audience

Developers building agentic LLM workflows, AI engineers managing inference latency.

Similar To

LiteLLM · Anthropic batch API · request hedging patterns in Envoy/Thrift

Similar Projects

Developer Tools●●Solid

Preventing runaway LLM agents (enforcement layer)

VERONICA puts an enforcement shim between your agent and the model so you can halt costly spirals before a request hits the provider — it natively exposes hard budget enforcement, circuit breakers, retry containment and degradation levels. The README + runnable runaway-loop demo make the failure mode concrete and the API (BudgetEnforcer, RuntimeContext, BudgetExceeded) is small and practical. I'd like to see richer observability/adapter docs for common agent frameworks, but as an enforcement-first primitive this is a clever, useful tool.

Niche GemBig Brain
amabito
124mo ago