I built a thinking framework for Claude

Name: I built a thinking framework for Claude
Availability: InStock
Author: bengia

by bengia·Feb 17, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

The Take

The site pairs a concrete 5-step rubric (ground facts, stress-test failures, reframe, trace implications, audit reasoning) with a blind A/B UI so humans can judge if structure actually helps an LLM — neat experimental rigor that most prompt experiments skip. It’s clever and pragmatic, but the idea isn’t brand-new (it’s essentially formalized chain-of-thought prompting) and the scope is narrow (Claude-only, small sample of topics).

Post Description

I built an open-source Claude Code skill called /think that applies a structured 5-element analysis framework (ground in facts, stress-test for failure, reframe the question, trace implications, audit your own reasoning) before synthesizing a recommendation. The obvious question: does it actually produce better output than just asking Claude directly? To test this, I ran blind A/B comparisons. Two isolated Claude Opus 4.6 agents get the same question — one runs /think, one responds naturally. Both responses are anonymized (framework markers stripped, sections retitled by content) and presented blind. The test covers 5 topics any professional would recognize: scaling a team post-fundraise, build vs buy decisions, when to pivot a product, SaaS pricing strategy, and the remote/hybrid/office debate. An AI judge scored /think winning all 5 pairs. But AI judging AI is circular — which is why the blind test is live for humans to judge. What I found so far (~21 comparisons across calibration + blind tests):

/think wins ~69% of comparisons overall Risk coverage is the clearest advantage (17-2 across all tests) — it consistently surfaces failure modes the organic response misses Decision impact is nearly even — organic Claude is often more actionable for practical problems Novel insight is mostly a wash — both find similar core insights, just different ones No decisive gaps in either direction. The advantage is depth and rigor, not dramatic superiority

Honest limitations:

All judges so far are AI. The whole point of publishing the blind test is to get human validation. ~21 comparisons is a pattern, not statistical significance Anonymization isn't perfect — /think responses have stylistic tells (confidence assessments, "what would change this conclusion" sections) The framework costs significantly more tokens

The skill itself is a recursive learning agent — it persists what it learns to a .think/ directory and loads that context in future sessions. Over time it builds project-specific knowledge. It also used its own framework to diagnose and fix its own weaknesses after the first round of testing. Everything is open source: https://github.com/bengiaventures/effective-thinking-skill I'd genuinely like to know if the blind test matches what the AI judges found, or if humans see something different. Takes about 15 minutes.