ECP Internal Research

Project Optimum

Multi-tier quantization strategy for running frontier LLMs locally on Apple Silicon. The thesis: allocating 3-4 different bit-widths across model layers based on sensitivity analysis dramatically outperforms the industry-standard 1-2 tier approach.

March 2026 — Mac Studio M3 Ultra 96GB · Mac Mini M4 Pro 24GB · MLX + llama.cpp

0
4-Tier Accuracy %
0
Uniform Accuracy %
~37%
Memory Saved vs 4-bit
3-4
Optimal Tier Count
1 The Problem

Why 1-2 Tiers Isn't Enough

Current tools default to uniform quantization (every layer at 4-bit) or at most two tiers (4-bit + 6-bit for sensitive layers). This is leaving significant quality on the table. Research from ScaleBITS, LLM-MQ, and MoPEQ consistently shows that finer-grained precision allocation creates a dramatically better quality/size Pareto frontier.

35.6%
Uniform 2-bit
All layers same precision
~52%
2-Tier Mixed
High/low bit allocation
64.7%
4-Tier Mixed
ScaleBITS, same avg bits

Key insight: At the same average bits-per-weight, multi-tier allocation preserves dramatically more quality. The gap widens at lower average bit-widths (2-3 bit range) where uniform quantization collapses but mixed-precision maintains performance. This is because a small number of critical layers are disproportionately sensitive — giving them higher precision costs almost nothing in total size but preserves most of the model's capability.

2 Layer Analysis

Transformer Layer Sensitivity Map

Not all parts of a transformer are created equal. Research consistently shows embeddings, first/last layers, attention V projections, and MLP down-projections are most sensitive to quantization. Middle layers and less-used MoE experts can be compressed aggressively.

Critical
High
Medium
Low
Q proj
K proj
V proj
O proj
Up/Gate
Down
Embed / LM Head
CRITICALtoken_embd + lm_head
Layer 0-1
High
High
Crit
High
High
Crit
Layer 2-5
Med
Med
High
Med
Med
High
Layer 6-40
Low
Low
Med
Low
Low
Med
Layer 41-46
Med
Med
High
Med
Med
High
Layer 47 (Last)
High
High
Crit
High
High
Crit
MoE Expert Sensitivity (Qwen3-Next-80B: 512 experts, 10 active/token)
Top 50 experts
High activation freq — keep at 6-bit
Middle 300 experts
Standard activation — 4-bit is fine
Bottom 162 experts
Rarely activated — 2-3 bit aggressive
3 Strategy

The 4-Tier Quantization Strategy

Allocate 4 precision levels across the model based on measured sensitivity. The key insight: Tier 1 (highest precision) covers only ~5% of parameters but preserves the critical information pathways. The memory savings come from aggressively compressing the 30% of parameters in Tier 4.

8-bit
Tier 1

Preservation Layer

token_embd lm_head layer_0 layer_47
Embeddings, output projection, first and last transformer blocks. These form the model's interface with language — any error here cascades through every token.
~5%
of params
6-bit
Tier 2

Sensitive Computation

v_proj down_proj + top-50 MoE experts
Value projections carry information content. Down-projections are the MLP bottleneck. High-frequency experts in MoE models see the most traffic.
~15%
of params
4-bit
Tier 3

Standard Compression

q_proj k_proj o_proj up_proj gate_proj + mid-tier experts
These projections are less sensitive — attention query/key patterns are redundant and tolerant of noise. Standard 4-bit quantization here is near-lossless.
~50%
of params
2-3b
Tier 4

Aggressive Compression

Middle-layer bulk (layers 8-38), low-frequency MoE experts (bottom 162)
These parameters contribute least to output quality per Hessian analysis. With importance-weighted quantization (imatrix), even 2-bit preserves function.
~30%
of params
Memory Footprint (Qwen3-Next-80B, 512 experts)
FP16
160 GB
160GB
Uniform 4-bit
40 GB
40GB
2-Tier (4+6)
35 GB
~35GB
4-Tier Optimum
~25 GB
~25GB

The math: 5% at 8-bit (0.64GB) + 15% at 6-bit (1.44GB) + 50% at 4-bit (3.2GB) + 30% at 2.5-bit (1.2GB) = ~3.2 avg bits/weight → ~25.6GB. That fits comfortably in the Mac Studio's 96GB with plenty of headroom for KV cache and activations. For the Mini's 24GB, same strategy at slightly higher average bits (~3.8 avg).

4 Methods

Sensitivity Analysis Methods

To determine which layers deserve higher precision, we need to measure sensitivity. Four main approaches exist — they trade accuracy for compute cost. Our recommendation: gradient Taylor approximation + activation magnitude gives 90% of the accuracy at 10% of the compute cost.

Method How It Works Compute Cost Accuracy Best For Verdict
Hessian TraceGPTQ, SqueezeLLM, QuIP# Measures curvature of loss landscape per layer. Second-order information reveals exactly how much each weight perturbation affects output. Very HighFull Hessian diagonal per layer. Hours on GPU. Highest Research, one-time calibration Accurate but slow
Gradient TaylorLLM-MQ First-order Taylor approximation of output loss. Accumulates gradient magnitudes per layer during a calibration pass. MediumSingle forward+backward pass per calibration sample. Good Layer-level ranking, practical use Recommended
Activation MagnitudeAWQ, SmoothQuant Tracks activation distribution and outliers per layer. Layers with larger activations and more outliers are more sensitive. LowForward pass only, cheap statistics. Decent Quick estimates, outlier detection Recommended
MoPEQ HybridMoPEQ (2025) Combines Hessian trace + activation frequency. K-means clusters experts by combined importance. Model-wise (not layer-wise) allocation. HighHessian per expert + clustering. Highest for MoE MoE models specifically MoE-specific
imatrixllama.cpp Importance matrix computed from calibration data. Records per-weight importance based on activation statistics. Guides K-quant allocation. MediumForward pass over calibration corpus. Good GGUF quantization, practical tool Battle-tested
Our Approach

Gradient Taylor + Activation Magnitude (combined). Run a calibration dataset through the model, compute both gradient-based sensitivity and activation statistics in a single pass. Use the combined score to rank every layer (and for MoE, every expert). Feed rankings into an ILP solver that allocates bit-widths under a memory budget constraint. This gives us ~90% of Hessian accuracy at a fraction of the compute, and runs feasibly on Apple Silicon without needing A100s.

5 Execution

Implementation Roadmap

Five phases from research to production. Mercer handles the heavy coding; Bessemer coordinates and validates. Target: have a working 4-tier quantized model deployed on the Mac Studio within 2 weeks.

1

Sensitivity Profiler Mercer

Build a Python tool that runs calibration data through a model and outputs per-layer (and per-expert for MoE) sensitivity scores. Uses gradient Taylor + activation magnitude. Outputs a JSON sensitivity map.

sensitivity_profiler.py calibration dataset sensitivity.json output
2

Allocation Optimizer Mercer

Integer Linear Programming solver that takes sensitivity scores + memory budget and outputs optimal per-layer bit allocation. Supports 4 tiers (2/3/4/6/8 bit). Constrained to stay within target memory.

allocator.py (scipy/PuLP) allocation.json output Pareto frontier visualization
3

Quantization Pipeline Mercer + Bessemer

Two parallel paths: (A) MLX: extend quant_predicate to support 4 tiers via custom callable. (B) llama.cpp: generate --tensor-type regex from allocation.json. Both consume the same allocation output.

mlx_4tier_quant.py gguf_4tier_quant.sh quantized model files
4

Validation Bessemer

Run v5 eval harness against three variants: uniform 4-bit, 2-tier (existing mixed_4_6), and our 4-tier custom. Compare quality scores, memory footprint, tokens/sec, and time-to-first-token. Also perplexity on held-out text.

eval comparison report perplexity benchmarks memory/latency profiles
5

Production Deploy Bessemer

Deploy winning quantization to Mac Studio. Update mlx-server-tuned.py with new profile. Update Thurin docker-compose upstream. Monitor inference quality and latency in production for 48 hours.

mlx-server-tuned.py profile launchd service update production monitoring
6 Landscape

Current Tool Landscape

No turnkey 3-4 tier quantization tool exists today. The pieces are there — MLX has the hooks, llama.cpp has the flexibility, academia has the algorithms. We just need to glue them together.

MLX quant_predicate

Framework — Apple Silicon Native

Built-in recipes: mixed_2_6, mixed_3_4, mixed_3_6, mixed_4_6. Custom callable for per-layer decisions. Supports 2-8 bit weights with configurable group size.

per-layer control 2 tiers only (built-in) custom callable no sensitivity analysis

MLX dynamic_quant

CLI Tool — Sensitivity-Aware

Takes sensitivity JSON + target-bpw. Allocates high/low bits per layer. Currently only supports 2 levels (--low-bits, --high-bits). Could be extended to support 3-4 tiers.

sensitivity input target bpw budget 2 tiers (high/low) extendable

llama.cpp --tensor-type

Quantization Tool — Most Flexible

Regex-based per-tensor type selection. Combined with imatrix for importance-weighted quantization. Supports arbitrary mixed quant types (Q2_K through Q8_0) per tensor.

per-tensor control imatrix support regex matching unlimited tiers

Academic Methods

Research — State of the Art

ScaleBITS: 1-8 bit search with ILP. LLM-MQ: gradient-based layer allocation. MoPEQ: MoE expert-level quant. SliM-LLM: salience-driven groups. Most are paper-only, no turnkey tools.

optimal allocation 3-4+ tiers no turnkey tool code available (some)

The Gap — What We Build

The missing piece is a pipeline that connects sensitivity analysis to multi-tier allocation to MLX/GGUF quantization. llama.cpp --tensor-type is the most flexible backend (supports arbitrary per-tensor types via regex). MLX's quant_predicate callable can implement 4 tiers with a custom function. We build the orchestrator: profiler → optimizer → quantizer config generator.

Sources & Citations
ScaleBITS — Hardware-Aligned Mixed-Precision (2026) — arxiv.org/abs/2602.17698
LLM-MQ — Mixed-Precision Quantization for LLMs — Tsinghua/NicSEFC
MoPEQ — Mixture of Mixed Precision Experts (2025) — arxiv.org/abs/2509.02512
SliM-LLM — Salience-Driven Mixed-Precision — ICML 2025
Mixed-Precision Quantization Survey — arxiv.org/abs/2510.16805
AWQ — Activation-Aware Weight Quantization — arxiv.org/abs/2306.00978
GPTQ — Post-Training Quantization — arxiv.org/abs/2210.17323
QuIP# — Coherence-Inducing Quantization — arxiv.org/abs/2307.13304
ResQ — Low-Rank Residual Mixed-Precision (2025) — ICLR 2025
Cocktail — Chunk-Adaptive Mixed-Precision (2025) — arxiv.org/abs/2503.23294
MLX-LM Learned Quants — github.com/ml-explore/mlx-lm
llama.cpp Quantize README — github.com/ggml-org/llama.cpp
WWDC 2025 — MLX on Apple Silicon — developer.apple.com
Exa Research (2x Pro) — 101+157 pages crawled, 100+ citations