Back to browse
Tested 12 LLMs with few-shot examples

Tested 12 LLMs with few-shot examples

by shuntaro-okuma·Mar 26, 2026·2 points·0 comments

AI Analysis

MidDark Horse

Research article revealing few-shot collapse patterns, not a usable tool or product.

Strengths
  • 27k+ evaluations across 12 models provides statistically meaningful data.
  • Uncovers Gemini 3 Flash dropping 63 points when given examples.
  • Demo data included for exploring patterns without API costs.
Weaknesses
  • Blog post format rather than interactive tool or reusable product.
  • Findings lack generalizability beyond the specific tasks tested.
Category
Target Audience

ML engineers, researchers evaluating LLM prompting strategies

Similar To

LMSys Chatbot Arena · Hugging Face Open LLM Leaderboard

Post Description

I evaluated 12 models (6 cloud, 6 local) across 5 tasks at shot counts 0, 1, 2, 4, and 8, with 3 trials each. 60 model-task pairs, 27k+ evaluations. Three patterns stood out:

1. Few-shot can cause collapse: Gemini 3 Flash scored 93% at zero-shot on route optimization, then crashed to 30% at 8-shot. Same model family (Gemma 3 27B, local) stayed stable at 90%.

2. Most models benefit from few-shot: On classification, all models scored 0-20% at zero-shot. At 8-shot, scores spread from 27% to 80%. Zero-shot benchmarks would have led to the wrong model choice.

3. Task mismatch ≠ collapse: Reasoning-specialized models scored low on summarization regardless of shot count. They're not "collapsing" — they're just not suited for the task.

A 27B local model (Gemma 3) matched Claude Haiku's adaptation efficiency (AUC 0.814 vs 0.815). The 12-model results are included as default demo data — explore the patterns without API keys.

Article: https://dev.to/shuntarookuma/i-tested-12-llms-with-few-shot-...

GitHub (MIT): https://github.com/ShuntaroOkuma/adapt-gauge-core

Similar Projects

AI/ML●●Solid

ModelSweep - Open-Source Benchmarking for Local LLMs

Postman for local LLMs with LLM-as-Judge and Elo ratings built in.

Ship ItNiche GemSlick
leonickson
203mo ago