Dev.to5d ago1 min read

I Fixed My LLM OOM Crashes by Shrinking the Draft Model (Speculative Decoding on Real Hardware)

The fix was swapping a 4B draft model for a 0.6B one in my speculative decoding config. That's the whole punchline. But the path there touched every assumption I had about how spec decode interacts with VRAM budgets on consumer hardware, so here's the full story. TL;DR Change Result 4B draft → 0.6B draft ~2 GiB saved, same MoE throughput Embedding parallelism 16 → 8 ~8 GiB freed Combined Dropped from ~97 GiB to ~87.7 GiB, no more OOM Spec decode isn't free. You're paying VRAM for both models sim

Read original on dev.to