TutorialEdgePrivacy
Running Dutch LLMs on Edge Devices
openoranje TeamNovember 1, 2024
One of our core missions is enabling Dutch AI on edge devices. This guide walks you through deploying openoranje models locally.
Why Edge Inference?
Running AI locally provides significant benefits:
- Privacy: Data never leaves your device
- Latency: No network round-trip delays
- Availability: Works offline
- Cost: No API fees
Hardware Requirements
Our models are designed for consumer hardware:
| Model | Min RAM | Recommended GPU | CPU-only |
|---|---|---|---|
| Oranje-1B | 4GB | 4GB VRAM | Yes |
| Oranje-3B | 8GB | 8GB VRAM | Slow |
Quick Start
Using llama.cpp
The fastest way to get started is with llama.cpp:
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Download our GGUF model
wget https://huggingface.co/openoranje/oranje-1b-gguf/resolve/main/oranje-1b-q4_k_m.gguf
# Run inference
./main -m oranje-1b-q4_k_m.gguf -p "Amsterdam is de hoofdstad van"
Using Transformers
For Python integration:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model (automatically uses GPU if available)
model = AutoModelForCausalLM.from_pretrained(
"openoranje/oranje-1b",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openoranje/oranje-1b")
# Generate text
def generate(prompt: str, max_tokens: int = 100) -> str:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=max_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Try it out
print(generate("De Nederlandse cultuur is"))
Quantization
For smaller devices, use quantized models:
- Q4_K_M: Best balance of size and quality
- Q5_K_M: Higher quality, slightly larger
- Q8_0: Near full-precision quality
Mobile Deployment
We're working on mobile SDKs for:
- iOS (Core ML)
- Android (NNAPI)
- React Native
Privacy Best Practices
When building privacy-preserving applications:
- Process locally: Never send raw text to servers
- Minimize storage: Don't log user inputs
- Be transparent: Tell users data stays local
- Secure the model: Prevent extraction attacks
Example: Private Note Summarization
def summarize_notes(notes: list[str]) -> str:
"""Summarize notes locally—data never leaves the device."""
combined = "\n".join(notes)
prompt = f"Vat de volgende notities samen:\n{combined}\n\nSamenvatting:"
return generate(prompt, max_tokens=150)
Performance Tips
- Use quantization for 2-4x speedup
- Batch requests when possible
- Cache KV states for chat applications
- Profile memory to avoid OOM errors
Need help? Join our Discord community.