Back to browse
InferShrink – Cut LLM API costs 10x with automatic model routing

InferShrink – Cut LLM API costs 10x with automatic model routing

by doronp·Feb 25, 2026·2 points·0 comments

AI Analysis

●●●BangerSolve My ProblemShip It

Three-line wrapper cuts LLM costs 80%+ via prompt classification and same-provider routing.

Strengths
  • Sub-millisecond classification overhead means negligible latency tax on existing workflows; three-line integration (wrap client, no code changes) lowers adoption friction dramatically
  • Same-provider routing eliminates surprise API failures; costs drop measurably on mixed-complexity workloads without user intervention
  • Optional FAISS retrieval + LLMLingua compression stack work together; 539 tests + Semgrep/Trivy scanning signal production intent
Weaknesses
  • Classification heuristics opaque; no published benchmarks showing when routing fails or accuracy metrics on complexity prediction
  • Relies on optional third-party tools (LLMLingua, FAISS) that add dependencies; marginal gains on simple workloads or tight latency budgets
Target Audience

Cost-conscious teams using OpenAI or Anthropic APIs who want automatic spend reduction without changing their code.

Similar To

Ollama · llm-cost-optimizer · prompt-engineering libraries

Post Description

I built this to solve my own problem — paying for GPT-4/Claude on prompts that Gemini Flash handles fine.

InferShrink wraps your existing OpenAI/Anthropic/Google client in 3 lines. It classifies prompt complexity and routes to the cheapest model that can handle it. Same provider, no surprise switches.

The pipeline: classify → compress (LLMLingua, optional) → retrieve (FAISS, optional) → route → track. When all stages combine, 10x+ cost reduction on mixed workloads.

Key design decisions:

• Same-provider routing only. If you use OpenAI, it stays on OpenAI. No cross-provider surprises. • Sub-millisecond classification overhead • Optional FAISS retrieval + LLMLingua compression for RAG pipelines • 539 tests, Semgrep + Trivy scanned

pip install infershrink

Blog post with the reasoning: https://musashimiyamoto1-cloud.github.io/infershrink-site/bl...

Happy to answer questions about the routing heuristics or compression tradeoffs.

Similar Projects

Developer Tools●●Solid

NadirClaw, LLM router that cuts costs by routing prompts right

If you're burning through Claude/OpenAI credits, this is a low-friction stopgap: it classifies prompts in ~10ms and routes trivial tasks to cheaper/local models while reserving premium APIs for complex work. The agentic-task detection, reasoning-aware routing, session pinning and context-window fallback are practical touches that avoid mid-thread model bouncing and 429 failures. It isn't reinventing the space (OpenRouter and others exist), but it's focused on real-world cost tradeoffs and drop-in compatibility.

Solve My ProblemNiche Gem
amirdor
113mo ago
Developer Tools●●Solid

API router that picks the cheapest model that fits each query

Komilion turns model sprawl into a cost-control layer you drop in by swapping a base_url: requests are classified (regex fast path + tiny LLM) and matched to ~400 models so cheap models handle the easy stuff and premium models only run when needed. The ~60% zero‑call regex fast path and benchmark-driven routing (LMArena) are clever, pragmatic moves; the hard questions left are model-quality drift across providers and how routing decisions map to real-world user satisfaction.

Solve My ProblemWizardrySlick
robinbanner
113mo ago
Gaming●●●Banger

Dwarf.land – autonomous dwarf civilization SIM with AI model routing

Four-tier AI model routing with $8.50/hr budget cap is genuinely clever engineering.

Rabbit HoleWizardryBig Brain
kulesh
453mo ago
AI/ML●●Solid

LLM-use – cost-effective LLM orchestrator for agents

Smart local‑first routing that only escalates to expensive cloud planners when necessary is the standout idea — combined with per‑run cost accounting and full Ollama offline support it solves a real operational itch. The repo is a pragmatic, CLI/TUI-focused toolkit (scraping + cache, MCP server mode) that feels useful for teams wanting a no‑friction orchestrator, but it’s playing in a crowded space of agent frameworks so the novelty is incremental rather than revolutionary.

Niche GemBig Brain
justvugg
213mo ago