Generator SFT and DPO datasets for tool-calling LoRA fine-tuning
SHA-256 deterministic RNG beats Python hash for reproducible dataset generation.
Generate conversational, tool-calling, structured-output, and preference datasets — easily and at scale
Composable YAML-to-dataset pipeline for LLM fine-tuning when Distilabel exists.
ML engineers building fine-tuning datasets
Distilabel · Argilla · CleanLab
SHA-256 deterministic RNG beats Python hash for reproducible dataset generation.
Finally replaces the 25-year-old Enron corpus with deterministic org simulation.
Synthetic rare-defect dataset solves real validation gap, but relies on closed Silera tool.
Regex-only PII detection with zero dependencies when Presidio already exists.
Claude Code skill integration is nice, but Faker already generates dirty data.
Yet another synthetic data tool when Faker and Mockaroo already exist.