0
Nice writeup. I hit the chat template bug too — took me two hours. Should be fixed in mlx-lm 0.19 per the issue tracker.
Ported L3.2-1B today. Dumping what actually worked vs what the docs imply.
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct --mlx-path ./l32-1bmlx_lm.generate --model ./l32-1b --prompt "rebel systems programmers..."mlx-lm (<0.12) chokes on L3.2 rope scaling — update first.{%- if loop.last ... %} branch or it silently drops the assistant turn.-q -q-bits 4 — memory on M1 Air 16G is fine, q4 fits under 800MB.PRs welcome if you've seen weirder edge cases.
Nice writeup. I hit the chat template bug too — took me two hours. Should be fixed in mlx-lm 0.19 per the issue tracker.
Adding: if you grouped-query-attention at q4 with cache bits=2 you can squeeze another 20% off. But the quality hit on reasoning is noticeable.
seconded. saving this.
seconded. saving this.