Generate scaled synthetic datasets for RAG evaluation
So I built this. Generate complete RAG evaluation datasets from a single text prompt. Fresh synthetic data at any scale you need.
This lets you test what actually matters:
$ dataset-factory generate \ --prompt "A gold rush town in the Yukon during the 1890s" \ --documents 1000 \ --queries 100 \ --output output/goldrush # Output: # ✓ 1,000 unique documents (268-12,631 tokens) # ✓ 100 queries with ground truth # ✓ Rich metadata: settlements, roles, dates, activities # ✓ Cost tracking: $0.46 with Groq
LLM analyzes your prompt and creates a schema. Document types, metadata fields, value distributions.
Generates 2000 words of domain context. History, entities, terminology, relationships. Used for all documents.
Each doc gets random metadata from config. LLM generates unique content. No templates. 5 prompt variations for diversity.
Analyzes dataset statistics. Picks filter selectivity. Generates queries from actual document content with known ground truth.
400-40,000 token documents. Short reports to comprehensive audits. Realistic length distributions.
Temporal, categorical, numerical, hierarchical fields. Zipfian and uniform distributions. Precise filter control.
Control selectivity from 0.1% (ultra-specific) to 10%+ (broad). Test pre-filter vs post-filter performance.
Real-time cost monitoring per phase. Detailed breakdowns. Works across resume sessions.
Pause and resume at any time. Streams to JSONL. Memory efficient at any scale.
Groq, Gemini, OpenAI, Anthropic. Smart rate limiting. Auto-concurrency adjustment.
Using an LLM to generate eval data for LLM systems is weird. Hallucination is the goal here, which feels like an anti-pattern.
Biggest risk: Similar documents. Even with high temperature and prompt variations, you might get semantically identical documents. A hundred prospector journals that all sound the same.
Internal consistency: No guarantee the LLM maintains coherent facts across thousands of documents. It might contradict itself.
But: As long as you use the same dataset to compare multiple systems, the comparison is still fair. Weird artifacts affect all systems equally. You're measuring relative performance, not absolute quality on some perfect benchmark.
# Historical $ --prompt "Yukon gold rush town during the 1890s" # Corporate $ --prompt "Dystopian tech megacorp with surveillance and AI incidents" # Scientific $ --prompt "Biomedical research papers and clinical trials" # Legal $ --prompt "Legal contracts and case law from various jurisdictions" # E-commerce $ --prompt "Product listings with reviews and specifications"
# Install $ uv pip install dataset-factory # Set API key (supports Groq, Gemini, OpenAI, Anthropic) $ echo "GROQ_API_KEY=your_key" > .env $ # or GEMINI_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY # Generate with your chosen provider $ dataset-factory generate \ --prompt "your domain description" \ --documents 1000 \ --queries 100 \ --output output/my_dataset