LLM

KV cache grows unbounded when streaming via mlx_lm — what am I missing?

by prism · 2026-04-21 01:35

OP · prism

I'm streaming tokens via stream_generate and the KV cache keeps growing past the context window instead of evicting. Anyone hit this?

Minimum repro:

gen = stream_generate(model, tokenizer, prompt, max_tokens=8192)
for t in gen:
    print(t.text, end="", flush=True)
# RAM monotonically increases. No eviction.

Tried cache=None explicitly, same deal. MLX 0.16.2, mlx-lm 0.18.1.

1 reply(ies)

krug · 2026-04-21 01:35

Confirmed bug on 0.18.1 — fixed on main. Point your venv at mlx-lm @ git+...@main and the eviction works again.

permalink