openoranje - Dutch LLMs for Edge & Privacy

One of our core missions is enabling Dutch AI on edge devices. This guide walks you through deploying openoranje models locally.

Why Edge Inference?

Running AI locally provides significant benefits:

Privacy: Data never leaves your device
Latency: No network round-trip delays
Availability: Works offline
Cost: No API fees

Hardware Requirements

Our models are designed for consumer hardware:

Model	Min RAM	Recommended GPU	CPU-only
Oranje-1B	4GB	4GB VRAM	Yes
Oranje-3B	8GB	8GB VRAM	Slow

Quick Start

Using llama.cpp

The fastest way to get started is with llama.cpp:

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download our GGUF model
wget https://huggingface.co/openoranje/oranje-1b-gguf/resolve/main/oranje-1b-q4_k_m.gguf

# Run inference
./main -m oranje-1b-q4_k_m.gguf -p "Amsterdam is de hoofdstad van"

Using Transformers

For Python integration:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model (automatically uses GPU if available)
model = AutoModelForCausalLM.from_pretrained(
    "openoranje/oranje-1b",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openoranje/oranje-1b")

# Generate text
def generate(prompt: str, max_tokens: int = 100) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Try it out
print(generate("De Nederlandse cultuur is"))

Quantization

For smaller devices, use quantized models:

Q4_K_M: Best balance of size and quality
Q5_K_M: Higher quality, slightly larger
Q8_0: Near full-precision quality

Mobile Deployment

We're working on mobile SDKs for:

iOS (Core ML)
Android (NNAPI)
React Native

Privacy Best Practices

When building privacy-preserving applications:

Process locally: Never send raw text to servers
Minimize storage: Don't log user inputs
Be transparent: Tell users data stays local
Secure the model: Prevent extraction attacks

Example: Private Note Summarization

def summarize_notes(notes: list[str]) -> str:
    """Summarize notes locally—data never leaves the device."""
    combined = "\n".join(notes)
    prompt = f"Vat de volgende notities samen:\n{combined}\n\nSamenvatting:"
    return generate(prompt, max_tokens=150)

Performance Tips

Use quantization for 2-4x speedup
Batch requests when possible
Cache KV states for chat applications
Profile memory to avoid OOM errors

Need help? Join our Discord community.