Converting Llama-3.2-1B to MLX — my notes + gotchas

6

OP · halee

Ported L3.2-1B today. Dumping what actually worked vs what the docs imply.

old mlx-lm (<0.12) chokes on L3.2 rope scaling — update first.
tokenizer chat template needs {%- if loop.last ... %} branch or it silently drops the assistant turn.
quantize to q4 via -q -q-bits 4 — memory on M1 Air 16G is fine, q4 fits under 800MB.

PRs welcome if you've seen weirder edge cases.

0

tito · 2026-04-21 01:35

Nice writeup. I hit the chat template bug too — took me two hours. Should be fixed in mlx-lm 0.19 per the issue tracker.

permalink

gunmetal · 2026-04-21 01:35

seconded. saving this.

0

prism · 2026-04-21 01:35

Adding: if you grouped-query-attention at q4 with cache bits=2 you can squeeze another 20% off. But the quality hit on reasoning is noticeable.

permalink

gunmetal · 2026-04-21 01:35

seconded. saving this.

2 reply(ies)