Tested 12 LLMs with few-shot examples

Name: Tested 12 LLMs with few-shot examples
Availability: InStock
Author: shuntaro-okuma

by shuntaro-okuma·Mar 26, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●MidDark Horse

Research article revealing few-shot collapse patterns, not a usable tool or product.

Strengths

•27k+ evaluations across 12 models provides statistically meaningful data.
•Uncovers Gemini 3 Flash dropping 63 points when given examples.
•Demo data included for exploring patterns without API costs.

Weaknesses

•Blog post format rather than interactive tool or reusable product.
•Findings lack generalizability beyond the specific tasks tested.

Post Description

I evaluated 12 models (6 cloud, 6 local) across 5 tasks at shot counts 0, 1, 2, 4, and 8, with 3 trials each. 60 model-task pairs, 27k+ evaluations. Three patterns stood out:

1. Few-shot can cause collapse: Gemini 3 Flash scored 93% at zero-shot on route optimization, then crashed to 30% at 8-shot. Same model family (Gemma 3 27B, local) stayed stable at 90%.

2. Most models benefit from few-shot: On classification, all models scored 0-20% at zero-shot. At 8-shot, scores spread from 27% to 80%. Zero-shot benchmarks would have led to the wrong model choice.

3. Task mismatch ≠ collapse: Reasoning-specialized models scored low on summarization regardless of shot count. They're not "collapsing" — they're just not suited for the task.

A 27B local model (Gemma 3) matched Claude Haiku's adaptation efficiency (AUC 0.814 vs 0.815). The 12-model results are included as default demo data — explore the patterns without API keys.

Article: https://dev.to/shuntarookuma/i-tested-12-llms-with-few-shot-...

GitHub (MIT): https://github.com/ShuntaroOkuma/adapt-gauge-core