Caliper – Auto Instrumented LLM Observability with Custom Metadata

Name: Caliper – Auto Instrumented LLM Observability with Custom Metadata
Availability: InStock
Author: OliverGuy

by OliverGuy·Mar 8, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidSolve My ProblemShip It

Zero-code instrumentation via monkey-patching, but Langsmith, Helicone, and Arize already do this.

Strengths

•Single init() call required—no SDK code changes needed for basic metrics.
•Flexible metadata capture: pre/post-request KV pairs via caliper_metadata and annotate().
•Multiple export targets (local JSONL, S3) with configurable batching and retry logic.

Weaknesses

•LiteLLM support promised but not shipped; currently only OpenAI and Anthropic.
•Crowded space: Langsmith, Helicone, Arize, and Datadog already solve LLM observability with larger ecosystems.

Post Description

Caliper is designed to auto instrument LLM calls within Python, it monkey patches the OpenAI and Anthropic SDKs (Got plans to add LiteLLM so you can use any provider you want to) so it's almost completely invisible to you as the developer and for basic metrics can slot in as a single init() at start.

It can also gather custom metadata about a call, this can be any KV pairs you want, both pre and post request.

```python

import caliper

import anthropic

caliper.init(target="s3") # This is all that's required for basic observability, no changes needed to LLM calls for basic metrics

client = anthropic.Anthropic()

response = client.messages.create(

model="claude-sonnet-4-20250514",

messages=[{"role": "user", "content": "What is 2 + 2?"}],

caliper_metadata={"campaign": "q4"}, # Pre request metadata

)

print(response.content[0].text)

caliper.annotate(sentiment="positive") # Post request metadata

```

You can use this to track effectiveness of model changes, tracking them against difference user tiers. Maybe your free tier users don't notice if you use a cheaper model but you paying users do? How do you know if a recent system prompt change was effective? You can track the version of the prompt in metadata and compare post request rating annotations between prompt versions.

It has a dev mode which logs locally, it can also send files to S3. The SDK has a background queue and worker which flushes in batches that are configurable in size and time between flushes. It exports to S3 as batched JSON files to readily to integrate into most data engineering pipelines or you can just query directly with a tool like DuckDB.

PyPi: https://pypi.org/project/caliper-sdk/

Edits: formatting and PyPi Link