0
I ran up to 12k context, still stable. Q4 with activation-aware quantization only for the experts — router stays f16.
Short answer: yes, at q4. ~7-9 tok/s, ~38GB RAM.
Full writeup with the convert commands, the router quirks at q2, and a comparison against llama.cpp metal backend incoming this weekend. Asking first: anyone already benched with longer contexts (>4k)?
I ran up to 12k context, still stable. Q4 with activation-aware quantization only for the experts — router stays f16.