OctopusGarden – An autonomous software factory (specs in, code out)
Orchestrates AI agents to iterate code until tests pass—but StrongDM already ships this.

Fix task 14 of 30 without restarting—Cursor's all-or-nothing approach can't do this.
Developers using LLMs for code generation
Cursor · Continue · Windsurf
The core idea: specs are the source of truth and the human stays in the loop. You review the plan before anything gets built. You can step through tasks one at a time or let it run and intervene on failures. When task 14 of 30 fails, you fix that task and keep going instead of starting over. When requirements change you update the spec and only the affected parts get rebuilt.
The workflow is validate -> audit -> build. Validate checks structure (no LLM). Audit sends specs to an LLM for review and generates the build plan. Build executes it task by task with verification after each step.
Works with Anthropic, OpenAI, Mistral, Google, etc. and local models through Ollama.
Python 3.14+, MIT licensed.
GitHub: https://github.com/ossature/ossature
Docs: https://docs.ossature.dev
Some examples: https://github.com/ossature/ossature-examples
This is v0.0.1. Right now it works best for greenfield projects. I want to support workflows against existing codebases but honestly I'm not sure yet what the right approach looks like there. If you have thoughts on that or anything else, open an issue or start a discussion on the repo.
Orchestrates AI agents to iterate code until tests pass—but StrongDM already ships this.
LLM cost optimizer, but Anthropic's batch API and local quantization solve this cheaper.
Schema-valid evidence packs for AI agents when generic evals miss domain nuance.
Catches LLMs cheating on evals with a 9-pattern catalog nobody else documents.
Math-spec approach for LLM-generated code, but lacks working examples and doesn't solve the reasoning-accuracy problem.
Task-specific LLM benchmarking beats generic leaderboards that ignore your actual workload.