Tested 12 LLMs with few-shot examples
Research article revealing few-shot collapse patterns, not a usable tool or product.
A framework for few-shot evaluation of language models.
Industry standard benchmark harness refactored with lighter installs and new SGLang support.
ML researchers and LLM developers
lmms-eval · HELM · BigBench
Research article revealing few-shot collapse patterns, not a usable tool or product.
Kokoro voice cloning with multilingual support, but voice cloning itself is crowded.
Another Whisper wrapper with a nice UI, but lacks novelty against Hugging Face Spaces.
90.3 BrowseComp score with verification-centric model architecture.
Interactive DuckDB-WASM benchmark beats static leaderboards for agentic SQL eval.
Batch OCR with free Tesseract + bring-your-own-key for Vision/AI, plus Bangla support fills a real niche.