Agent-skills-eval – Test whether Agent Skills improve outputs
Lightweight A/B testing for SKILL.md files when LangSmith feels too heavy.
🔬 Verifiable AI-Augmented Engineering Framework - Stop AI hallucinations with formal traceability (REQ→ART→TC). Agent Skills for Claude Code, Cursor, VS Code & Copilot. Enterprise-grade: ISO 9001, ISO 27001, GxP-ready. Red Team verification, multi-cycle lifecycle, behavioral anti-patterns.
Enforces test independence in AI agents to break the confirmation bias loop.
AI engineering teams, DevOps managers
LangChain · AutoGen · Cognition
Most AI coding workflows are just "write this, now test this." The agent that writes the code also writes the tests for the code it just wrote. That's not testing, that's confirmation bias in a loop.
Curious whether others have tried to enforce test independence structurally in agentic workflows, and whether the Skills format (vs. system prompts or tool definitions) is something people are actually using or think is the right abstraction for this.
Lightweight A/B testing for SKILL.md files when LangSmith feels too heavy.
Generates verified MCP tools from URLs with smoke tests before agents touch them.
Security scanning catches data exfiltration before skills go live.
QA test documents generated before code—that verification shift is genuinely clever.
Fully observable multi-agent drama: every LLM call logged, every decision traceable, deterministic at scale.
Agent-native eval workflow beats LangSmith's manual dashboard setup.