TechnicalQuantizationInferenceResearch

Divide, Round, Multiply: The Surprisingly Simple Math That Shrinks LLMs

Sarthak AnandFebruary 26, 2026

Running a 7-billion parameter model on your laptop sounds impossible — until you understand quantization. This post walks through the actual mathematics behind it: how we represent numbers, how we compress them, and what we lose (and gain) in the process.

How Do We Represent Numbers?

Before we can shrink a model, we need to understand what its weights actually look like in memory. Every weight in a neural network is a floating point number, and the way we encode it determines both its precision and its memory cost.

Float32 — The Gold Standard

Float32 uses 4 bytes (32 bits) per number, split into three parts:

  • 1 bit for the sign
  • 8 bits for the exponent
  • 23 bits for the mantissa (the decimal precision)

The formula:

(-1)^sign × 2^(exponent - 127) × 1.mantissa

This gives a range of roughly [-3.4 × 10^38, 3.4 × 10^38] — enormous range with high precision. It's the gold standard for training, but storing every weight at 4 bytes adds up fast. A 7B parameter model in FP32 needs ~28GB of memory.

Float16 — Half Precision

Float16 cuts the size in half (2 bytes) by using:

  • 1 bit for sign
  • 5 bits for exponent
  • 10 bits for mantissa
(-1)^sign × 2^(exponent - 15) × 1.mantissa

The range shrinks dramatically to [-6.55 × 10^4, 6.55 × 10^4], but for inference (not training), this is usually fine. The same 7B model now fits in ~14GB.

BFloat16 — The Best of Both Worlds

BFloat16 (Brain Float) is a clever compromise, also 2 bytes:

  • 1 bit for sign
  • 8 bits for exponent (same as FP32!)
  • 7 bits for mantissa
(-1)^sign × 2^(exponent - 127) × 1.mantissa

By keeping the 8-bit exponent, BFloat16 preserves the same range as FP32 ([-3.4 × 10^38, 3.4 × 10^38]) while sacrificing some decimal precision. This means fewer overflow/underflow issues during training, which is why Google designed it and why it's become the default for modern model training.

INT4 — The Extreme End

At 4 bits (half a byte), we leave floating point behind entirely. INT4 gives us just 2^4 = 16 unique values, ranging from -8 to 7. That's it. But the memory savings are massive — a 7B model can fit in roughly ~3.5GB.

The question is: how do we map rich floating point weights into just 16 integers without destroying the model?

What Is Quantization?

Quantization is the process of converting model weights from high-precision floating point numbers (like FP32) to lower-precision representations (like INT4). Think of it as controlled rounding.

Consider a small weight matrix:

Original Weights (FP32) — Memory: 9 × 4 = 36 bytes

0.66-0.640.553
-0.550.3-0.1
0.50.80.9

Quantized Weights (INT4) — Memory: 9 × 0.5 + 4 = 8.5 bytes (+ scale factor)

5-54
-42-1
467

That's a ~4x memory reduction from a single operation. The key is the scale factor that lets us convert back and forth.

The Math, Step by Step

Let's walk through exactly how symmetric quantization works with a concrete example. Take four FP32 weights:

[0.7234567, -0.3456789, 1.2345678, -0.0123456]

These occupy 4 × 4 = 16 bytes.

Step 1: Find the Scale Factor

For INT4 with a symmetric range, we use [-7, 7] (leaving -8 unused to keep things zero-centric and symmetric).

First, find the absolute maximum:

|x_max| = max(|0.7234567|, |-0.3456789|, |1.2345678|, |-0.0123456|)
        = 1.2345678

Then compute the scale factor:

scale = |x_max| / 7
      = 1.2345678 / 7
      = 0.1763668

Step 2: Quantize

Divide each weight by the scale factor and round to the nearest integer:

q1 = round(0.7234567  / 0.1763668) = round(4.10)  = 4
q2 = round(-0.3456789 / 0.1763668) = round(-1.96) = -2
q3 = round(1.2345678  / 0.1763668) = round(7.00)  = 7
q4 = round(-0.0123456 / 0.1763668) = round(-0.07) = 0

Our quantized weights: [4, -2, 7, 0]

Memory: 4 integers × 0.5 bytes + 1 scale factor × 4 bytes = 6 bytes (down from 16).

Step 3: Dequantize (at Inference Time)

When we need to use these weights, we multiply back by the scale factor:

w1 = 4  × 0.1763668 = 0.7054672
w2 = -2 × 0.1763668 = -0.3527336
w3 = 7  × 0.1763668 = 1.2345676
w4 = 0  × 0.1763668 = 0.0000000

Step 4: Measure the Error

Comparing original vs. dequantized:

OriginalDequantizedError
0.72345670.70546720.0179895
-0.3456789-0.35273360.0070547
1.23456781.23456760.0000002
-0.01234560.00000000.0123456

The largest weight (1.2345678) maps perfectly to 7 — zero error. Smaller weights lose more relative precision. The weight -0.0123456 collapses entirely to zero. This is the fundamental tradeoff: values near zero lose the most relative precision.

Inference in Practice

During actual inference, the process looks like this:

  1. Weights are stored in quantized INT4 format with their scale factors
  2. An input arrives in full precision
  3. Weights are dequantized on-the-fly using the scale factor
  4. Matrix multiplication proceeds normally with the dequantized weights
  5. Results are returned in full precision

The dequantization adds a small overhead, but the massive reduction in memory bandwidth (loading 0.5 bytes instead of 4 bytes per weight) more than compensates.

The Tradeoffs

TypePrecisionSpeedupMemory ReductionPerformance Impact
FP1616-bit float~1.5x~50%Little
BF1616-bit float~1.5x~50%Very little
INT88-bit integer~2x~75%Decent performance
INT44-bit integer~3-4x~90%Noticeable drop, but usable

Types of Quants: Not All 4-bit Is Equal

If you've downloaded GGUF models from Hugging Face, you've seen labels like Q4_K_S, Q4_K_M, and Q4_K_L. These are all 4-bit quantizations but with different strategies:

  • Q4_K_L — Lower quality loss, less memory savings (keeps more layers at higher precision)
  • Q4_K_M — Medium quality loss, medium memory savings (the sweet spot for most users)
  • Q4_K_S — Higher quality loss, maximum memory savings (most aggressive compression)

The "K" stands for k-quants, a technique that uses different bit-widths for different layers based on their sensitivity. Attention layers and the first/last layers tend to get higher precision, while less sensitive feed-forward layers get compressed more aggressively.

Quantization Is Not Just for Local/CPU Inference

A common misconception is that quantization is only useful for running models on CPUs or laptops. In reality, cloud GPU providers also serve quantized models:

  • Nebuis serves models in FP4 (4-bit float) for cost efficiency
  • Together.ai serves models in FP8 (8-bit float) for a balance of speed and quality

Quantization reduces GPU memory requirements too, meaning you can serve larger models on smaller (cheaper) GPUs, or fit larger batch sizes on the same hardware.

Quantization Can Break Things

Quantization is not free. A real example with Qwen2-VL-2B-Instruct, a vision-language model describing an image of a train:

VariantDescription OutputSizeCorrect?
16-bit"The image shows a train traveling on tracks."4.11GBYes
Default 4-bit (all layers)"The image depicts a vibrant and colorful scene of a coastal area."1.36GBNo
Unsloth quant"The image shows a train traveling on tracks."1.81GBYes

Naive 4-bit quantization across all layers caused the model to hallucinate completely. A smarter quantization strategy (like Unsloth's, which preserves precision in critical layers) maintained correctness at only a modest size increase.

Why Attention Is a Memory Bottleneck, Not Compute

A key insight for understanding why quantization helps so much: LLM inference is memory-bound, not compute-bound.

Modern GPUs have enormous computational throughput (hundreds of TFLOPS), but moving data from DRAM to the compute cores is slow by comparison. During attention:

  • The KV-cache grows linearly with sequence length
  • Every token generation requires reading the entire KV-cache from memory
  • The compute (matrix multiplications) finishes faster than the data can be loaded

This means the bottleneck is memory bandwidth — how fast we can feed data to the GPU cores. Quantization directly attacks this bottleneck: smaller weights mean less data to move, which means the GPU spends less time waiting for memory reads.

On a GPU like the NVIDIA Tesla V100, the memory hierarchy looks like:

  • Register file (64K × 4B) — fastest, local to each streaming multiprocessor
  • Shared memory / L1 cache (128KB) — fast, shared within an SM
  • L2 cache (6MB) — slower, shared across all SMs
  • Global memory / DRAM (32GB) — slowest, where model weights live

Every weight must travel from DRAM through this hierarchy to reach the compute cores. Cutting weight size from 4 bytes to 0.5 bytes means 8x less data moving through this pipeline per inference step.

Running Models Locally

With quantization, running LLMs locally is now practical. Tools like jan.ai make it straightforward to download and run quantized models on consumer hardware. A Q4_K_M quantized 7B model runs comfortably on a MacBook with 16GB of RAM.

Wrapping Up

Quantization is one of those techniques where the math is surprisingly simple but the impact is enormous. The core idea — divide by a scale factor, round to integers, multiply back when needed — enables a 4-8x reduction in model size with remarkably little quality loss.

The key takeaways:

  1. Floating point formats trade range for precision — FP32 (4 bytes), FP16 (2 bytes), BF16 (2 bytes with FP32's range), and INT4 (0.5 bytes with just 16 values)
  2. Symmetric quantization uses a single scale factor: scale = max(|weights|) / max_int
  3. Not all bits are equal — smart quantization (K-quants) gives more precision to sensitive layers
  4. The real win is memory bandwidth — smaller weights mean faster data transfer to compute cores, which is the actual bottleneck in LLM inference
  5. Quantization works everywhere — from your laptop to cloud GPU providers

The next time you download a Q4_K_M model and run it on your machine, you'll know exactly what those numbers mean and why it works.


Questions or thoughts? Reach out on GitHub or Twitter!