ECP Internal Research

Project Optimum

Multi-tier quantization strategy for running frontier LLMs locally on Apple Silicon. The thesis: allocating 3-4 different bit-widths across model layers based on sensitivity analysis dramatically outperforms the industry-standard 1-2 tier approach.

March 2026 — Mac Studio M3 Ultra 96GB · Mac Mini M4 Pro 24GB · MLX + llama.cpp

4-Tier Accuracy %

Uniform Accuracy %

~37%

Memory Saved vs 4-bit

3-4

Optimal Tier Count

1 The Problem

Why 1-2 Tiers Isn't Enough

Current tools default to uniform quantization (every layer at 4-bit) or at most two tiers (4-bit + 6-bit for sensitive layers). This is leaving significant quality on the table. Research from ScaleBITS, LLM-MQ, and MoPEQ consistently shows that finer-grained precision allocation creates a dramatically better quality/size Pareto frontier.

35.6%

Uniform 2-bit

All layers same precision

~52%

2-Tier Mixed

High/low bit allocation

64.7%

4-Tier Mixed

ScaleBITS, same avg bits

Key insight: At the same average bits-per-weight, multi-tier allocation preserves dramatically more quality. The gap widens at lower average bit-widths (2-3 bit range) where uniform quantization collapses but mixed-precision maintains performance. This is because a small number of critical layers are disproportionately sensitive — giving them higher precision costs almost nothing in total size but preserves most of the model's capability.

2 Layer Analysis

Transformer Layer Sensitivity Map

Not all parts of a transformer are created equal. Research consistently shows embeddings, first/last layers, attention V projections, and MLP down-projections are most sensitive to quantization. Middle layers and less-used MoE experts can be compressed aggressively.

Critical

High

Medium

Low

Q proj

K proj

V proj

O proj

Up/Gate

Down

Embed / LM Head

CRITICALtoken_embd + lm_head

Layer 0-1

High

Crit

High

Crit

Layer 2-5

Med

High

Med

High

Layer 6-40

Low

Med

Low

Med

Layer 41-46

Med

High

Med

High

Layer 47 (Last)

High

Crit

High

Crit

MoE Expert Sensitivity (Qwen3-Next-80B: 512 experts, 10 active/token)

Top 50 experts

High activation freq — keep at 6-bit

Middle 300 experts

Standard activation — 4-bit is fine

Bottom 162 experts

Rarely activated — 2-3 bit aggressive

3 Strategy

The 4-Tier Quantization Strategy

Allocate 4 precision levels across the model based on measured sensitivity. The key insight: Tier 1 (highest precision) covers only ~5% of parameters but preserves the critical information pathways. The memory savings come from aggressively compressing the 30% of parameters in Tier 4.

8-bit

Tier 1

Preservation Layer

token_embd lm_head layer_0 layer_47
Embeddings, output projection, first and last transformer blocks. These form the model's interface with language — any error here cascades through every token.

~5%

of params

6-bit

Tier 2

Sensitive Computation

v_proj down_proj + top-50 MoE experts
Value projections carry information content. Down-projections are the MLP bottleneck. High-frequency experts in MoE models see the most traffic.

~15%

of params

4-bit

Tier 3

Standard Compression

q_proj k_proj o_proj up_proj gate_proj + mid-tier experts
These projections are less sensitive — attention query/key patterns are redundant and tolerant of noise. Standard 4-bit quantization here is near-lossless.

~50%

of params

2-3b

Tier 4

Aggressive Compression

Middle-layer bulk (layers 8-38), low-frequency MoE experts (bottom 162)
These parameters contribute least to output quality per Hessian analysis. With importance-weighted quantization (imatrix), even 2-bit preserves function.

~30%

of params

Memory Footprint (Qwen3-Next-80B, 512 experts)

FP16

160 GB

160GB

Uniform 4-bit

40 GB

40GB

2-Tier (4+6)

35 GB

~35GB

4-Tier Optimum

~25 GB

~25GB

The math: 5% at 8-bit (0.64GB) + 15% at 6-bit (1.44GB) + 50% at 4-bit (3.2GB) + 30% at 2.5-bit (1.2GB) = ~3.2 avg bits/weight → ~25.6GB. That fits comfortably in the Mac Studio's 96GB with plenty of headroom for KV cache and activations. For the Mini's 24GB, same strategy at slightly higher average bits (~3.8 avg).

4 Methods

Sensitivity Analysis Methods

To determine which layers deserve higher precision, we need to measure sensitivity. Four main approaches exist — they trade accuracy for compute cost. Our recommendation: gradient Taylor approximation + activation magnitude gives 90% of the accuracy at 10% of the compute cost.

Method	How It Works	Compute Cost	Accuracy	Best For	Verdict
Hessian TraceGPTQ, SqueezeLLM, QuIP#	Measures curvature of loss landscape per layer. Second-order information reveals exactly how much each weight perturbation affects output.	Very HighFull Hessian diagonal per layer. Hours on GPU.	Highest	Research, one-time calibration	Accurate but slow
Gradient TaylorLLM-MQ	First-order Taylor approximation of output loss. Accumulates gradient magnitudes per layer during a calibration pass.	MediumSingle forward+backward pass per calibration sample.	Good	Layer-level ranking, practical use	Recommended
Activation MagnitudeAWQ, SmoothQuant	Tracks activation distribution and outliers per layer. Layers with larger activations and more outliers are more sensitive.	LowForward pass only, cheap statistics.	Decent	Quick estimates, outlier detection	Recommended
MoPEQ HybridMoPEQ (2025)	Combines Hessian trace + activation frequency. K-means clusters experts by combined importance. Model-wise (not layer-wise) allocation.	HighHessian per expert + clustering.	Highest for MoE	MoE models specifically	MoE-specific
imatrixllama.cpp	Importance matrix computed from calibration data. Records per-weight importance based on activation statistics. Guides K-quant allocation.	MediumForward pass over calibration corpus.	Good	GGUF quantization, practical tool	Battle-tested

Our Approach

Gradient Taylor + Activation Magnitude (combined). Run a calibration dataset through the model, compute both gradient-based sensitivity and activation statistics in a single pass. Use the combined score to rank every layer (and for MoE, every expert). Feed rankings into an ILP solver that allocates bit-widths under a memory budget constraint. This gives us ~90% of Hessian accuracy at a fraction of the compute, and runs feasibly on Apple Silicon without needing A100s.

5 Execution

Implementation Roadmap

Five phases from research to production. Mercer handles the heavy coding; Bessemer coordinates and validates. Target: have a working 4-tier quantized model deployed on the Mac Studio within 2 weeks.

Sensitivity Profiler Mercer

Build a Python tool that runs calibration data through a model and outputs per-layer (and per-expert for MoE) sensitivity scores. Uses gradient Taylor + activation magnitude. Outputs a JSON sensitivity map.

sensitivity_profiler.py calibration dataset sensitivity.json output

Allocation Optimizer Mercer

Integer Linear Programming solver that takes sensitivity scores + memory budget and outputs optimal per-layer bit allocation. Supports 4 tiers (2/3/4/6/8 bit). Constrained to stay within target memory.

allocator.py (scipy/PuLP) allocation.json output Pareto frontier visualization

Quantization Pipeline Mercer + Bessemer

Two parallel paths: (A) MLX: extend quant_predicate to support 4 tiers via custom callable. (B) llama.cpp: generate --tensor-type regex from allocation.json. Both consume the same allocation output.

mlx_4tier_quant.py gguf_4tier_quant.sh quantized model files

Validation Bessemer

Run v5 eval harness against three variants: uniform 4-bit, 2-tier (existing mixed_4_6), and our 4-tier custom. Compare quality scores, memory footprint, tokens/sec, and time-to-first-token. Also perplexity on held-out text.

eval comparison report perplexity benchmarks memory/latency profiles

Production Deploy Bessemer

Deploy winning quantization to Mac Studio. Update mlx-server-tuned.py with new profile. Update Thurin docker-compose upstream. Monitor inference quality and latency in production for 48 hours.

mlx-server-tuned.py profile launchd service update production monitoring

6 Landscape

Current Tool Landscape

No turnkey 3-4 tier quantization tool exists today. The pieces are there — MLX has the hooks, llama.cpp has the flexibility, academia has the algorithms. We just need to glue them together.

MLX `quant_predicate`

Framework — Apple Silicon Native

Built-in recipes: mixed_2_6, mixed_3_4, mixed_3_6, mixed_4_6. Custom callable for per-layer decisions. Supports 2-8 bit weights with configurable group size.

per-layer control 2 tiers only (built-in) custom callable no sensitivity analysis

MLX `dynamic_quant`

CLI Tool — Sensitivity-Aware

Takes sensitivity JSON + target-bpw. Allocates high/low bits per layer. Currently only supports 2 levels (--low-bits, --high-bits). Could be extended to support 3-4 tiers.

sensitivity input target bpw budget 2 tiers (high/low) extendable

llama.cpp `--tensor-type`

Quantization Tool — Most Flexible

Regex-based per-tensor type selection. Combined with imatrix for importance-weighted quantization. Supports arbitrary mixed quant types (Q2_K through Q8_0) per tensor.

per-tensor control imatrix support regex matching unlimited tiers

Academic Methods

Research — State of the Art

ScaleBITS: 1-8 bit search with ILP. LLM-MQ: gradient-based layer allocation. MoPEQ: MoE expert-level quant. SliM-LLM: salience-driven groups. Most are paper-only, no turnkey tools.

optimal allocation 3-4+ tiers no turnkey tool code available (some)

The Gap — What We Build

The missing piece is a pipeline that connects sensitivity analysis to multi-tier allocation to MLX/GGUF quantization. llama.cpp --tensor-type is the most flexible backend (supports arbitrary per-tensor types via regex). MLX's quant_predicate callable can implement 4 tiers with a custom function. We build the orchestrator: profiler → optimizer → quantizer config generator.

Sources & Citations

ScaleBITS — Hardware-Aligned Mixed-Precision (2026) — arxiv.org/abs/2602.17698

LLM-MQ — Mixed-Precision Quantization for LLMs — Tsinghua/NicSEFC

MoPEQ — Mixture of Mixed Precision Experts (2025) — arxiv.org/abs/2509.02512

SliM-LLM — Salience-Driven Mixed-Precision — ICML 2025

Mixed-Precision Quantization Survey — arxiv.org/abs/2510.16805

AWQ — Activation-Aware Weight Quantization — arxiv.org/abs/2306.00978

GPTQ — Post-Training Quantization — arxiv.org/abs/2210.17323

QuIP# — Coherence-Inducing Quantization — arxiv.org/abs/2307.13304

ResQ — Low-Rank Residual Mixed-Precision (2025) — ICLR 2025

Cocktail — Chunk-Adaptive Mixed-Precision (2025) — arxiv.org/abs/2503.23294

MLX-LM Learned Quants — github.com/ml-explore/mlx-lm

llama.cpp Quantize README — github.com/ggml-org/llama.cpp

WWDC 2025 — MLX on Apple Silicon — developer.apple.com

Exa Research (2x Pro) — 101+157 pages crawled, 100+ citations

Project Optimum

Why 1-2 Tiers Isn't Enough

Transformer Layer Sensitivity Map

The 4-Tier Quantization Strategy

Preservation Layer

Sensitive Computation

Standard Compression

Aggressive Compression

Sensitivity Analysis Methods

Implementation Roadmap

Sensitivity Profiler Mercer

Allocation Optimizer Mercer

Quantization Pipeline Mercer + Bessemer

Validation Bessemer

Production Deploy Bessemer

Current Tool Landscape

MLX quant_predicate

MLX dynamic_quant

llama.cpp --tensor-type

Academic Methods

The Gap — What We Build

MLX `quant_predicate`

MLX `dynamic_quant`

llama.cpp `--tensor-type`