TutorialEdgePrivacy

Running Dutch LLMs on Edge Devices

openoranje TeamNovember 1, 2024

One of our core missions is enabling Dutch AI on edge devices. This guide walks you through deploying openoranje models locally.

Why Edge Inference?

Running AI locally provides significant benefits:

  • Privacy: Data never leaves your device
  • Latency: No network round-trip delays
  • Availability: Works offline
  • Cost: No API fees

Hardware Requirements

Our models are designed for consumer hardware:

ModelMin RAMRecommended GPUCPU-only
Oranje-1B4GB4GB VRAMYes
Oranje-3B8GB8GB VRAMSlow

Quick Start

Using llama.cpp

The fastest way to get started is with llama.cpp:

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Download our GGUF model
wget https://huggingface.co/openoranje/oranje-1b-gguf/resolve/main/oranje-1b-q4_k_m.gguf

# Run inference
./main -m oranje-1b-q4_k_m.gguf -p "Amsterdam is de hoofdstad van"

Using Transformers

For Python integration:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model (automatically uses GPU if available)
model = AutoModelForCausalLM.from_pretrained(
    "openoranje/oranje-1b",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openoranje/oranje-1b")

# Generate text
def generate(prompt: str, max_tokens: int = 100) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Try it out
print(generate("De Nederlandse cultuur is"))

Quantization

For smaller devices, use quantized models:

  • Q4_K_M: Best balance of size and quality
  • Q5_K_M: Higher quality, slightly larger
  • Q8_0: Near full-precision quality

Mobile Deployment

We're working on mobile SDKs for:

  • iOS (Core ML)
  • Android (NNAPI)
  • React Native

Privacy Best Practices

When building privacy-preserving applications:

  1. Process locally: Never send raw text to servers
  2. Minimize storage: Don't log user inputs
  3. Be transparent: Tell users data stays local
  4. Secure the model: Prevent extraction attacks

Example: Private Note Summarization

def summarize_notes(notes: list[str]) -> str:
    """Summarize notes locally—data never leaves the device."""
    combined = "\n".join(notes)
    prompt = f"Vat de volgende notities samen:\n{combined}\n\nSamenvatting:"
    return generate(prompt, max_tokens=150)

Performance Tips

  1. Use quantization for 2-4x speedup
  2. Batch requests when possible
  3. Cache KV states for chat applications
  4. Profile memory to avoid OOM errors

Need help? Join our Discord community.