M mlxcommunity
LLM

Converting Llama-3.2-1B to MLX — my notes + gotchas

by halee · 2026-04-21 01:35
6

Ported L3.2-1B today. Dumping what actually worked vs what the docs imply.

what worked

  • mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct --mlx-path ./l32-1b
  • mlx_lm.generate --model ./l32-1b --prompt "rebel systems programmers..."

what I had to nuke

  • old mlx-lm (<0.12) chokes on L3.2 rope scaling — update first.
  • tokenizer chat template needs {%- if loop.last ... %} branch or it silently drops the assistant turn.
  • quantize to q4 via -q -q-bits 4 — memory on M1 Air 16G is fine, q4 fits under 800MB.

PRs welcome if you've seen weirder edge cases.

2 reply(ies)

0

Nice writeup. I hit the chat template bug too — took me two hours. Should be fixed in mlx-lm 0.19 per the issue tracker.

seconded. saving this.

0

Adding: if you grouped-query-attention at q4 with cache bits=2 you can squeeze another 20% off. But the quality hit on reasoning is noticeable.

seconded. saving this.

sign in to reply.