InferShrink – Cut LLM API costs 10x with automatic model routing

Name: InferShrink – Cut LLM API costs 10x with automatic model routing
Availability: InStock
Author: doronp

by doronp·Feb 25, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerSolve My ProblemShip It

Three-line wrapper cuts LLM costs 80%+ via prompt classification and same-provider routing.

Strengths

•Sub-millisecond classification overhead means negligible latency tax on existing workflows; three-line integration (wrap client, no code changes) lowers adoption friction dramatically
•Same-provider routing eliminates surprise API failures; costs drop measurably on mixed-complexity workloads without user intervention
•Optional FAISS retrieval + LLMLingua compression stack work together; 539 tests + Semgrep/Trivy scanning signal production intent

Weaknesses

•Classification heuristics opaque; no published benchmarks showing when routing fails or accuracy metrics on complexity prediction
•Relies on optional third-party tools (LLMLingua, FAISS) that add dependencies; marginal gains on simple workloads or tight latency budgets

Post Description

I built this to solve my own problem — paying for GPT-4/Claude on prompts that Gemini Flash handles fine.

InferShrink wraps your existing OpenAI/Anthropic/Google client in 3 lines. It classifies prompt complexity and routes to the cheapest model that can handle it. Same provider, no surprise switches.

The pipeline: classify → compress (LLMLingua, optional) → retrieve (FAISS, optional) → route → track. When all stages combine, 10x+ cost reduction on mixed workloads.

Key design decisions:

• Same-provider routing only. If you use OpenAI, it stays on OpenAI. No cross-provider surprises. • Sub-millisecond classification overhead • Optional FAISS retrieval + LLMLingua compression for RAG pipelines • 539 tests, Semgrep + Trivy scanned

pip install infershrink

Blog post with the reasoning: https://musashimiyamoto1-cloud.github.io/infershrink-site/bl...

Happy to answer questions about the routing heuristics or compression tradeoffs.

Similar Projects

Developer Tools●●Solid

NadirClaw, LLM router that cuts costs by routing prompts right

If you're burning through Claude/OpenAI credits, this is a low-friction stopgap: it classifies prompts in ~10ms and routes trivial tasks to cheaper/local models while reserving premium APIs for complex work. The agentic-task detection, reasoning-aware routing, session pinning and context-window fallback are practical touches that avoid mid-thread model bouncing and 429 failures. It isn't reinventing the space (OpenRouter and others exist), but it's focused on real-world cost tradeoffs and drop-in compatibility.

Solve My ProblemNiche Gem

amirdor

113mo ago

Developer Tools●●Solid

API router that picks the cheapest model that fits each query

Komilion turns model sprawl into a cost-control layer you drop in by swapping a base_url: requests are classified (regex fast path + tiny LLM) and matched to ~400 models so cheap models handle the easy stuff and premium models only run when needed. The ~60% zero‑call regex fast path and benchmark-driven routing (LMArena) are clever, pragmatic moves; the hard questions left are model-quality drift across providers and how routing decisions map to real-world user satisfaction.

Solve My ProblemWizardrySlick

robinbanner

113mo ago

Gaming●●●Banger

Dwarf.land – autonomous dwarf civilization SIM with AI model routing

Four-tier AI model routing with $8.50/hr budget cap is genuinely clever engineering.

Rabbit HoleWizardryBig Brain

kulesh

453mo ago

Infrastructure●Mid

Nexus Gateway – Reduce LLM API Costs Using Semantic Caching

Semantic caching for LLM APIs exists (Anthropic prompt caching, Langchain, Miniplex, vLLM); gateway routing is table stakes.

Ship ItSolve My Problem

Sunnyanand_dev

213mo ago

Infrastructure●●Solid

ARouter – drop-in OpenAI/Anthropic proxy that cuts cost and fails over

OpenAI SDK calls Claude through one proxy with conformance-tested wire translation.

Ship ItBig Brain

sricola

216d ago

AI/ML●●Solid

LLM-use – cost-effective LLM orchestrator for agents

Smart local‑first routing that only escalates to expensive cloud planners when necessary is the standout idea — combined with per‑run cost accounting and full Ollama offline support it solves a real operational itch. The repo is a pragmatic, CLI/TUI-focused toolkit (scraping + cache, MCP server mode) that feels useful for teams wanting a no‑friction orchestrator, but it’s playing in a crowded space of agent frameworks so the novelty is incremental rather than revolutionary.

Niche GemBig Brain

justvugg

213mo ago