OWASP VulnerableApp Break It.Scan It.Benchmark Against It.Improve It
Scanner benchmarking for DAST tools. DVWA and Juice Shop dominate security training.
Open SAST benchmark corpus: 9 languages, 18 frameworks, combinatorial vulnerable/safe cases, anti-leakage by construction. Cross-file CWE chains, polyglot microservice taint, and adversarial modern-threat cases (evasion, supply-chain, AI-era). Score any SARIF 2.1.0 tool's accuracy (TPR/FPR, Youden's J). OWASP Top 10 2025.
First public SAST benchmark for Go and Rust with adversarial evasion test cases.
Security engineers and SAST tool developers
OWASP Benchmark for Java · OWASP Benchmark for Python
Show HN: A benchmark for SAST exploit chain and evasion detection
Traditional SAST benchmarks are great at measuring simple source-to-sink taint flows, but real-world attacks have moved past that. I spent some time building a benchmark suite to test the things that current static analysis tools structurally struggle to see.
Design Principles
Test cases written from security knowledge, not from knowledge of any specific SAST engine's detection capabilities No vulnerability hints in source code -- the CSV answer key is the ONLY ground truth. No comments, no CWE references, no category names in filenames or function names. 50/50 TP/TN balance prevents classifier gaming -- a tool that flags everything scores 0%, not 100% Category-averaged scoring prevents large categories from dominating small ones Minimum 25 TP + 25 TN per category ensures statistical significance (Youden's J per-case swing ≤ 4%) Tool-agnostic SARIF-based scoring -- any SAST tool that exports SARIF 2.1.0 can be scored 1 file = 1 test case for the baseline language benchmarks (standalone functions with no cross-file dependencies), while the Chain Detection tests explicitly use multi-file application structures.
It focuses heavily on two main areas:
Chain Detection: 500 test cases that measure if a tool can correlate multiple low-severity findings across different files into a compound exploit path. Adversarial Evasion: Tests to see if a tool can detect intentional concealment, like payloads hidden inside invisible Unicode characters or visual deception using Bidi overrides.
Since there was no public ground truth for Go, Rust, Bash, PHP, and Ruby, I also built baseline vulnerability benchmarks for those languages as part of the suite, bringing the total to over 7,700 test cases.
Building ground truth at this scale as a solo developer is a massive undertaking, and right now I have a serious echo chamber problem. I am the student taking the exam, the master designing it, and the professor grading my own homework. It sucks, and I know I have blind spots in my test designs.
I am releasing this openly because imperfect ground truth that invites correction is more valuable than no ground truth at all. If you work in AppSec, build SAST engines, or just enjoy breaking logic, I would love your scrutiny. Finding my misclassifications and edge cases will make this infinitely more valuable for everyone.
Repo link: https://github.com/TheAuditorTool/sast-benchmark // ThailandJohn. TheAuditorTool Maintainer.
Scanner benchmarking for DAST tools. DVWA and Juice Shop dominate security training.
Wraps Semgrep and Joern with AI, but 0 stars and placeholder badges.
Suite-specific search across chains when Google Hotels only does standard rooms.
CTF-style flags for voice prompt injection make learning LLM security actually fun.
Multi-agent AI chains real exploits with a judge that kills false positives—two hours, not weeks.
Postman for local LLMs with LLM-as-Judge and Elo ratings built in.