TechnicalResearchNLP

The Technical Challenges of Training Dutch LLMs

openoranje TeamNovember 15, 2024

Training language models for Dutch presents unique challenges that differ from English-centric approaches. In this post, we'll explore these challenges and share our solutions.

The Data Challenge

Dutch is spoken by approximately 25 million people, making it a "mid-resource" language. While there's substantial Dutch text available, it's significantly less than English data.

Quality Over Quantity

We've found that data quality matters more than quantity for smaller models. Our curation pipeline includes:

  1. Deduplication: Removing near-duplicate content
  2. Quality filtering: Using perplexity-based filtering
  3. Toxicity removal: Ensuring safe model outputs
  4. Domain balancing: Maintaining diverse topic coverage

Tokenization Matters

Standard tokenizers trained on English don't handle Dutch well. Consider compound words like "arbeidsongeschiktheidsverzekering" (disability insurance)—a single word that English tokenizers split into many tokens.

# Poor tokenization example
english_tokenizer.encode("arbeidsongeschiktheidsverzekering")
# Returns: ['arbe', 'ids', 'on', 'gesch', 'ikt', 'he', 'ids', 'ver', 'zeker', 'ing']

# Our Dutch-optimized tokenizer
dutch_tokenizer.encode("arbeidsongeschiktheidsverzekering")
# Returns: ['arbeids', 'ongeschiktheids', 'verzekering']

Better tokenization means:

  • Faster inference
  • Better context utilization
  • Improved model understanding

Architecture Decisions

For edge inference, model size is critical. We've made several architectural choices:

DecisionRationale
1B parametersRuns on most consumer GPUs
Grouped-query attentionReduces memory bandwidth
SwiGLU activationBetter performance per parameter
RoPE embeddingsGood length generalization

Evaluation

We evaluate our models on Dutch-specific benchmarks:

  • Dutch ARC: Reasoning in Dutch
  • HellaSwag-NL: Common sense reasoning
  • MMLU-NL: World knowledge in Dutch
  • SQuAD-NL: Reading comprehension

Lessons Learned

  1. Don't just translate: Train on native Dutch content
  2. Tokenizer is crucial: Build a Dutch-specific tokenizer
  3. Quality > quantity: Curate your data carefully
  4. Benchmark appropriately: Use Dutch-specific evaluations

Open Research

We're committed to sharing our findings. Expect detailed technical reports with each model release, including:

  • Training logs and hyperparameters
  • Ablation studies
  • Failure cases and limitations

Questions? Reach out at research@openoranje.nl