LLM

Mixtral 8x7B on M3 Max 64G — actually usable?

by halee · 2026-04-21 01:35

OP · halee

Short answer: yes, at q4. ~7-9 tok/s, ~38GB RAM.

Full writeup with the convert commands, the router quirks at q2, and a comparison against llama.cpp metal backend incoming this weekend. Asking first: anyone already benched with longer contexts (>4k)?

1 reply(ies)

tito · 2026-04-21 01:35

I ran up to 12k context, still stable. Q4 with activation-aware quantization only for the experts — router stays f16.

permalink