datasetFactory - Synthetic RAG Evaluation Datasets

The Problem with RAG Evaluation

Data pollution: Benchmark datasets are in training data. Foundation models have seen MS MARCO, BeIR. You're not testing retrieval, you're testing memorization.
High-fidelity filtering: Production RAG needs complex metadata filters. Date ranges, nested categories, numerical thresholds. Existing datasets have a category field and maybe some tags.

So I built this. Generate complete RAG evaluation datasets from a single text prompt. Fresh synthetic data at any scale you need.

This lets you test what actually matters:

→ RAG systems without training data contamination
→ How vector databases handle complex filters
→ Pre-filter vs post-filter performance
→ Retrieval quality degradation with corpus size
→ Metadata selectivity edge cases

Generate a Dataset

$ dataset-factory generate \
  --prompt "A gold rush town in the Yukon during the 1890s" \
  --documents 1000 \
  --queries 100 \
  --output output/goldrush

# Output:
# ✓ 1,000 unique documents (268-12,631 tokens)
# ✓ 100 queries with ground truth
# ✓ Rich metadata: settlements, roles, dates, activities
# ✓ Cost tracking: $0.46 with Groq

How It Works

1. Config Generation

LLM analyzes your prompt and creates a schema. Document types, metadata fields, value distributions.

2. World Building

Generates 2000 words of domain context. History, entities, terminology, relationships. Used for all documents.

3. Document Generation

Each doc gets random metadata from config. LLM generates unique content. No templates. 5 prompt variations for diversity.

4. Query Generation

Analyzes dataset statistics. Picks filter selectivity. Generates queries from actual document content with known ground truth.

Features

Variable Length

400-40,000 token documents. Short reports to comprehensive audits. Realistic length distributions.

Rich Metadata

Temporal, categorical, numerical, hierarchical fields. Zipfian and uniform distributions. Precise filter control.

Selective Queries

Control selectivity from 0.1% (ultra-specific) to 10%+ (broad). Test pre-filter vs post-filter performance.

Cost Tracking

Real-time cost monitoring per phase. Detailed breakdowns. Works across resume sessions.

Resumable

Pause and resume at any time. Streams to JSONL. Memory efficient at any scale.

Multi-LLM

Groq, Gemini, OpenAI, Anthropic. Smart rate limiting. Auto-concurrency adjustment.

Risks

Using an LLM to generate eval data for LLM systems is weird. Hallucination is the goal here, which feels like an anti-pattern.

Biggest risk: Similar documents. Even with high temperature and prompt variations, you might get semantically identical documents. A hundred prospector journals that all sound the same.

Internal consistency: No guarantee the LLM maintains coherent facts across thousands of documents. It might contradict itself.

But: As long as you use the same dataset to compare multiple systems, the comparison is still fair. Weird artifacts affect all systems equally. You're measuring relative performance, not absolute quality on some perfect benchmark.

Example Datasets

# Historical
$ --prompt "Yukon gold rush town during the 1890s"

# Corporate
$ --prompt "Dystopian tech megacorp with surveillance and AI incidents"

# Scientific
$ --prompt "Biomedical research papers and clinical trials"

# Legal
$ --prompt "Legal contracts and case law from various jurisdictions"

# E-commerce
$ --prompt "Product listings with reviews and specifications"

Get Started

# Install
$ uv pip install dataset-factory

# Set API key (supports Groq, Gemini, OpenAI, Anthropic)
$ echo "GROQ_API_KEY=your_key" > .env
$ # or GEMINI_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY

# Generate with your chosen provider
$ dataset-factory generate \
  --prompt "your domain description" \
  --documents 1000 \
  --queries 100 \
  --output output/my_dataset

Quick Start read my blog