Back to browse
GitHub Repository

Open SAST benchmark corpus: 9 languages, 18 frameworks, combinatorial vulnerable/safe cases, anti-leakage by construction. Cross-file CWE chains, polyglot microservice taint, and adversarial modern-threat cases (evasion, supply-chain, AI-era). Score any SARIF 2.1.0 tool's accuracy (TPR/FPR, Youden's J). OWASP Top 10 2025.

6 starsPython

A benchmark for SAST exploit chain and evasion detection

by ThailandJohn·Apr 8, 2026·2 points·1 comment

AI Analysis

●●SolidNiche GemBold Bet

First public SAST benchmark for Go and Rust with adversarial evasion test cases.

Strengths
  • 50/50 TP/TN balance prevents classifier gaming — flagging everything scores 0%.
  • Adversarial evasion benchmark tests Unicode hiding and visual deception attacks.
  • Fills a genuine gap — no existing public SAST benchmarks for these five languages.
Weaknesses
  • Under active development with acknowledged potential misclassifications in test cases.
  • Benchmark suites are reference material, not products with recurring utility.
Category
Target Audience

Security engineers and SAST tool developers

Similar To

OWASP Benchmark for Java · OWASP Benchmark for Python

Post Description

MAKE HACKERNEWS SHOWCASE POST AND SUBMIT IT 10pm MORNING SILICON VALLEY...

Show HN: A benchmark for SAST exploit chain and evasion detection

Traditional SAST benchmarks are great at measuring simple source-to-sink taint flows, but real-world attacks have moved past that. I spent some time building a benchmark suite to test the things that current static analysis tools structurally struggle to see.

Design Principles

Test cases written from security knowledge, not from knowledge of any specific SAST engine's detection capabilities No vulnerability hints in source code -- the CSV answer key is the ONLY ground truth. No comments, no CWE references, no category names in filenames or function names. 50/50 TP/TN balance prevents classifier gaming -- a tool that flags everything scores 0%, not 100% Category-averaged scoring prevents large categories from dominating small ones Minimum 25 TP + 25 TN per category ensures statistical significance (Youden's J per-case swing ≤ 4%) Tool-agnostic SARIF-based scoring -- any SAST tool that exports SARIF 2.1.0 can be scored 1 file = 1 test case for the baseline language benchmarks (standalone functions with no cross-file dependencies), while the Chain Detection tests explicitly use multi-file application structures.

It focuses heavily on two main areas:

Chain Detection: 500 test cases that measure if a tool can correlate multiple low-severity findings across different files into a compound exploit path. Adversarial Evasion: Tests to see if a tool can detect intentional concealment, like payloads hidden inside invisible Unicode characters or visual deception using Bidi overrides.

Since there was no public ground truth for Go, Rust, Bash, PHP, and Ruby, I also built baseline vulnerability benchmarks for those languages as part of the suite, bringing the total to over 7,700 test cases.

Building ground truth at this scale as a solo developer is a massive undertaking, and right now I have a serious echo chamber problem. I am the student taking the exam, the master designing it, and the professor grading my own homework. It sucks, and I know I have blind spots in my test designs.

I am releasing this openly because imperfect ground truth that invites correction is more valuable than no ground truth at all. If you work in AppSec, build SAST engines, or just enjoy breaking logic, I would love your scrutiny. Finding my misclassifications and edge cases will make this infinitely more valuable for everyone.

Repo link: https://github.com/TheAuditorTool/sast-benchmark // ThailandJohn. TheAuditorTool Maintainer.

Similar Projects

AI/ML●●Solid

ModelSweep - Open-Source Benchmarking for Local LLMs

Postman for local LLMs with LLM-as-Judge and Elo ratings built in.

Ship ItNiche GemSlick
leonickson
203mo ago