Project Optimum
Multi-tier quantization strategy for running frontier LLMs locally on Apple Silicon. The thesis: allocating 3-4 different bit-widths across model layers based on sensitivity analysis dramatically outperforms the industry-standard 1-2 tier approach.
Why 1-2 Tiers Isn't Enough
Current tools default to uniform quantization (every layer at 4-bit) or at most two tiers (4-bit + 6-bit for sensitive layers). This is leaving significant quality on the table. Research from ScaleBITS, LLM-MQ, and MoPEQ consistently shows that finer-grained precision allocation creates a dramatically better quality/size Pareto frontier.
Key insight: At the same average bits-per-weight, multi-tier allocation preserves dramatically more quality. The gap widens at lower average bit-widths (2-3 bit range) where uniform quantization collapses but mixed-precision maintains performance. This is because a small number of critical layers are disproportionately sensitive — giving them higher precision costs almost nothing in total size but preserves most of the model's capability.
Transformer Layer Sensitivity Map
Not all parts of a transformer are created equal. Research consistently shows embeddings, first/last layers, attention V projections, and MLP down-projections are most sensitive to quantization. Middle layers and less-used MoE experts can be compressed aggressively.
The 4-Tier Quantization Strategy
Allocate 4 precision levels across the model based on measured sensitivity. The key insight: Tier 1 (highest precision) covers only ~5% of parameters but preserves the critical information pathways. The memory savings come from aggressively compressing the 30% of parameters in Tier 4.
Preservation Layer
token_embd lm_head layer_0 layer_47Embeddings, output projection, first and last transformer blocks. These form the model's interface with language — any error here cascades through every token.
Sensitive Computation
v_proj down_proj + top-50 MoE expertsValue projections carry information content. Down-projections are the MLP bottleneck. High-frequency experts in MoE models see the most traffic.
Standard Compression
q_proj k_proj o_proj up_proj gate_proj + mid-tier expertsThese projections are less sensitive — attention query/key patterns are redundant and tolerant of noise. Standard 4-bit quantization here is near-lossless.
Aggressive Compression
These parameters contribute least to output quality per Hessian analysis. With importance-weighted quantization (imatrix), even 2-bit preserves function.
The math: 5% at 8-bit (0.64GB) + 15% at 6-bit (1.44GB) + 50% at 4-bit (3.2GB) + 30% at 2.5-bit (1.2GB) = ~3.2 avg bits/weight → ~25.6GB. That fits comfortably in the Mac Studio's 96GB with plenty of headroom for KV cache and activations. For the Mini's 24GB, same strategy at slightly higher average bits (~3.8 avg).
Sensitivity Analysis Methods
To determine which layers deserve higher precision, we need to measure sensitivity. Four main approaches exist — they trade accuracy for compute cost. Our recommendation: gradient Taylor approximation + activation magnitude gives 90% of the accuracy at 10% of the compute cost.
| Method | How It Works | Compute Cost | Accuracy | Best For | Verdict |
|---|---|---|---|---|---|
| Hessian TraceGPTQ, SqueezeLLM, QuIP# | Measures curvature of loss landscape per layer. Second-order information reveals exactly how much each weight perturbation affects output. | Very HighFull Hessian diagonal per layer. Hours on GPU. | Highest | Research, one-time calibration | Accurate but slow |
| Gradient TaylorLLM-MQ | First-order Taylor approximation of output loss. Accumulates gradient magnitudes per layer during a calibration pass. | MediumSingle forward+backward pass per calibration sample. | Good | Layer-level ranking, practical use | Recommended |
| Activation MagnitudeAWQ, SmoothQuant | Tracks activation distribution and outliers per layer. Layers with larger activations and more outliers are more sensitive. | LowForward pass only, cheap statistics. | Decent | Quick estimates, outlier detection | Recommended |
| MoPEQ HybridMoPEQ (2025) | Combines Hessian trace + activation frequency. K-means clusters experts by combined importance. Model-wise (not layer-wise) allocation. | HighHessian per expert + clustering. | Highest for MoE | MoE models specifically | MoE-specific |
| imatrixllama.cpp | Importance matrix computed from calibration data. Records per-weight importance based on activation statistics. Guides K-quant allocation. | MediumForward pass over calibration corpus. | Good | GGUF quantization, practical tool | Battle-tested |
Gradient Taylor + Activation Magnitude (combined). Run a calibration dataset through the model, compute both gradient-based sensitivity and activation statistics in a single pass. Use the combined score to rank every layer (and for MoE, every expert). Feed rankings into an ILP solver that allocates bit-widths under a memory budget constraint. This gives us ~90% of Hessian accuracy at a fraction of the compute, and runs feasibly on Apple Silicon without needing A100s.
Implementation Roadmap
Five phases from research to production. Mercer handles the heavy coding; Bessemer coordinates and validates. Target: have a working 4-tier quantized model deployed on the Mac Studio within 2 weeks.
Sensitivity Profiler Mercer
Build a Python tool that runs calibration data through a model and outputs per-layer (and per-expert for MoE) sensitivity scores. Uses gradient Taylor + activation magnitude. Outputs a JSON sensitivity map.
Allocation Optimizer Mercer
Integer Linear Programming solver that takes sensitivity scores + memory budget and outputs optimal per-layer bit allocation. Supports 4 tiers (2/3/4/6/8 bit). Constrained to stay within target memory.
Quantization Pipeline Mercer + Bessemer
Two parallel paths: (A) MLX: extend quant_predicate to support 4 tiers via custom callable. (B) llama.cpp: generate --tensor-type regex from allocation.json. Both consume the same allocation output.
Validation Bessemer
Run v5 eval harness against three variants: uniform 4-bit, 2-tier (existing mixed_4_6), and our 4-tier custom. Compare quality scores, memory footprint, tokens/sec, and time-to-first-token. Also perplexity on held-out text.
Production Deploy Bessemer
Deploy winning quantization to Mac Studio. Update mlx-server-tuned.py with new profile. Update Thurin docker-compose upstream. Monitor inference quality and latency in production for 48 hours.
Current Tool Landscape
No turnkey 3-4 tier quantization tool exists today. The pieces are there — MLX has the hooks, llama.cpp has the flexibility, academia has the algorithms. We just need to glue them together.
MLX quant_predicate
Built-in recipes: mixed_2_6, mixed_3_4, mixed_3_6, mixed_4_6. Custom callable for per-layer decisions. Supports 2-8 bit weights with configurable group size.
MLX dynamic_quant
Takes sensitivity JSON + target-bpw. Allocates high/low bits per layer. Currently only supports 2 levels (--low-bits, --high-bits). Could be extended to support 3-4 tiers.
llama.cpp --tensor-type
Regex-based per-tensor type selection. Combined with imatrix for importance-weighted quantization. Supports arbitrary mixed quant types (Q2_K through Q8_0) per tensor.
Academic Methods
ScaleBITS: 1-8 bit search with ILP. LLM-MQ: gradient-based layer allocation. MoPEQ: MoE expert-level quant. SliM-LLM: salience-driven groups. Most are paper-only, no turnkey tools.
The Gap — What We Build
The missing piece is a pipeline that connects sensitivity analysis to multi-tier allocation to MLX/GGUF quantization. llama.cpp --tensor-type is the most flexible backend (supports arbitrary per-tensor types via regex). MLX's quant_predicate callable can implement 4 tiers with a custom function. We build the orchestrator: profiler → optimizer → quantizer config generator.