Back to browse
GitHub Repository

LLM benchmark for datetime format generation reliability

5 starsPython

Datetime-bench: which datetime formats LLMs get right (and wrong)

by diwank·Mar 26, 2026·2 points·1 comment

AI Analysis

●●●BangerSolve My ProblemDark Horse

RFC 3339 hits 88% accuracy while unix epoch fails 50% of the time.

Strengths
  • Tests 22 models across 7 formats with 95% confidence intervals
  • Fills genuine gap — temporal reasoning benchmarks ignore output format reliability
  • Actionable findings: JS Date wrong 1 in 4 times, epoch drops to 40% on arithmetic
Weaknesses
  • Benchmark repo, not a daily-use tool for most developers
  • Limited to datetime formats, doesn't cover other common output contracts
Category
Target Audience

AI engineers building production LLM applications

Similar To

TimeBench · TRAM · HELM

Post Description

tl;dr

* If you need an LLM to parse OR emit a timestamp, use:

RFC 3339 ( e.g. 2024-03-26 10:30:00-05:00 )

* python date format also works well

* Do NOT use unix epoch or javascript date formats.

* Smaller models and non-reasoning models still make a LOT of mistakes in time parsing / formatting.

---

There are lots of temporal reasoning benchmarks (like TimeBench, TRAM, etc.) but they test whether models understand time concepts. Nothing on which datetime output format models get right most often. So we just built the benchmark ourselves.

We tested 22 models across Google, Anthropic, OpenAI, Qwen, and GLM on 235 scenarios and 7 different formats.

The two that surprised us the most were JavaScript Date and unix epoch. JavaScript Date is probably the most commonly used format and it's wrong ~1 in 4 times on parsing. Unix epoch drops to 40% on arithmetic tasks. If you need epoch, just have the model output a string and convert it yourself in code.

Similar Projects