openoranje - Dutch LLMs for Edge & Privacy

Training language models for Dutch presents unique challenges that differ from English-centric approaches. In this post, we'll explore these challenges and share our solutions.

The Data Challenge

Dutch is spoken by approximately 25 million people, making it a "mid-resource" language. While there's substantial Dutch text available, it's significantly less than English data.

Quality Over Quantity

We've found that data quality matters more than quantity for smaller models. Our curation pipeline includes:

Deduplication: Removing near-duplicate content
Quality filtering: Using perplexity-based filtering
Toxicity removal: Ensuring safe model outputs
Domain balancing: Maintaining diverse topic coverage

Tokenization Matters

Standard tokenizers trained on English don't handle Dutch well. Consider compound words like "arbeidsongeschiktheidsverzekering" (disability insurance)—a single word that English tokenizers split into many tokens.

# Poor tokenization example
english_tokenizer.encode("arbeidsongeschiktheidsverzekering")
# Returns: ['arbe', 'ids', 'on', 'gesch', 'ikt', 'he', 'ids', 'ver', 'zeker', 'ing']

# Our Dutch-optimized tokenizer
dutch_tokenizer.encode("arbeidsongeschiktheidsverzekering")
# Returns: ['arbeids', 'ongeschiktheids', 'verzekering']

Better tokenization means:

Faster inference
Better context utilization
Improved model understanding

Architecture Decisions

For edge inference, model size is critical. We've made several architectural choices:

Decision	Rationale
1B parameters	Runs on most consumer GPUs
Grouped-query attention	Reduces memory bandwidth
SwiGLU activation	Better performance per parameter
RoPE embeddings	Good length generalization

Evaluation

We evaluate our models on Dutch-specific benchmarks:

Dutch ARC: Reasoning in Dutch
HellaSwag-NL: Common sense reasoning
MMLU-NL: World knowledge in Dutch
SQuAD-NL: Reading comprehension

Lessons Learned

Don't just translate: Train on native Dutch content
Tokenizer is crucial: Build a Dutch-specific tokenizer
Quality > quantity: Curate your data carefully
Benchmark appropriately: Use Dutch-specific evaluations

Open Research

We're committed to sharing our findings. Expect detailed technical reports with each model release, including:

Training logs and hyperparameters
Ablation studies
Failure cases and limitations

Questions? Reach out at research@openoranje.nl