VeilPhantom – Open-source on-device PII detection for AI pipelines

Name: VeilPhantom – Open-source on-device PII detection for AI pipelines
Availability: InStock
Author: nakaiwilliams

by nakaiwilliams·Mar 10, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainSolve My Problem

Phonetic embeddings catch ASR-mangled names across cultures before LLM sees them.

Strengths

•22M param PhoneticDeBERTa model trained on 72M words of meeting data.
•Under 6ms overhead with 97.1% F1 score on meeting transcripts.
•Token-direct pipeline preserves structure while hiding real values with rehydration.

Weaknesses

•Python-only SDK limits adoption in non-Python AI stacks.
•PII redaction space has Microsoft Presidio and other established tools.

Post Description

I built VeilPhantom, a Python SDK that detects and tokenizes PII before text reaches any LLM.

The problem: AI agents processing meetings, emails, support tickets are handling raw sensitive data. Names, salaries, medical details — all flowing through cloud APIs.

The solution: Detect PII on-device, replace with tokens ([PERSON_1], [AMOUNT_1]), send safe tokens to LLM, rehydrate response locally.

Interesting finding: In benchmarks (98 scenarios, 8 verticals, Claude Haiku), accuracy went UP with PII redaction — 91.5% → 93.3%. Token-structured input seems to help models parse arguments more reliably.

Technical details: - Shade V7: 22M param PhoneticDeBERTa (DeBERTa-v3-xsmall + Double Metaphone embeddings) - Trained on 72M words of meeting/business data - 7 detection layers (NER, gazetteers, regex, NLP, contextual) - 19 PII token types - 6ms average overhead - 97.1% F1 on meeting transcripts

The phonetic embeddings help catch ASR-mangled names across cultures — "Nkosinathi" transcribed as "Ink Casino Thea" still gets detected.

pip install veil-phantom

Docs + benchmarks: https://helloveil.com/sdk GitHub: https://github.com/helloveil/veil-phantom

Apache 2.0. Happy to answer questions about the architecture or training approach.