AI Cost Firewall – OpenAI-compatible gateway with semantic caching
LLM gateway with Redis + Qdrant caching, but LiteLLM does this.

Semantic caching for LLM APIs exists (Anthropic prompt caching, Langchain, Miniplex, vLLM); gateway routing is table stakes.
Application developers using multiple LLM providers, cost-conscious AI teams, infrastructure engineers
Anthropic prompt caching · vLLM · Helicone
I'm building Nexus Gateway, an AI gateway that helps developers reduce LLM API costs.
Problem: Many applications send repeated or semantically similar prompts to LLMs, which leads to unnecessary API calls and higher costs.
Solution: Nexus Gateway uses semantic caching to detect similar prompts and serve cached responses instead of calling the LLM again.
Features: • Semantic caching to reduce repeated API calls • Multi-model support (OpenAI, Gemini, Llama, Anthropic) • BYOK support • PII protection and sovereign AI layer (in progress)
Goal: Reduce LLM costs by 40–70% while improving latency.
I’d really appreciate feedback from the community.
Website: https://www.nexus-gateway.org
LLM gateway with Redis + Qdrant caching, but LiteLLM does this.
94% GPU reduction claim needs verifiable benchmarks to stand out.
Zero-trust networking via zrok beats LiteLLM when your GPUs sit behind NAT.
Semantic caching with dependency invalidation beats standard Redis wrappers for agent costs.
Multi-model LLM router with semantic cache, but caching+fallback already exist (Anthropic, LangSmith, Unify).
Tool result caching for agents when GPTCache and LangChain already do semantic caching.