Back to browse
GitHub Repository

Open research infrastructure for reproducible data preparation

2 starsPython

LexPrep – reproducible lexical preprocessing for research experiments

by Sajjad_m·Feb 28, 2026·2 points·0 comments

AI Analysis

●●SolidSolve My ProblemNiche Gem

Reproducible wordlist preprocessing with automatic manifest output, not text analysis.

Strengths
  • Explicitly designed for wordlists, not running text — fills a real gap between general NLP tools and manual preprocessing.
  • Reproducibility-first: every command outputs a ZIP with data + run_manifest.json for research audit trails.
  • Multilingual support (Persian, English, Japanese) with language-specific backends (PersianG2p, g2p-en, Stanza, UniDic).
Weaknesses
  • Crowded NLP tooling space; researchers already use spaCy, Stanza, or language-specific wrappers — unclear why this is better than composing those.
  • Only 2 GitHub stars and minimal documentation on why wordlist-specific preprocessing was hard before this.
Target Audience

Linguistic researchers, psycholinguists, cognitive scientists running controlled experiments with word stimuli.

Similar To

spaCy · Stanza · g2p-en

Post Description

I built this to solve reproducibility issues in research pipelines. Would love feedback on architecture.

If you find it interesting, a GitHub star would help visibility

Similar Projects

SocialMid

Lastwordonearth.com

Quirky global word elimination game, but the 403 error blocks access entirely.

Crowd PleaserShip It
hnrich
574d ago