Dev.to6d ago1 min read

The Memory Bandwidth Gap Is 49x and Growing — Why...

The Wall I Hit on an RTX 4060 Was a Bandwidth Wall Running Qwen3.5-9B on an RTX 4060 8GB gets you about 40 tok/s. Perfectly usable for a reasoning model. But scale up the model size and the numbers crater. 27B drops to 15 tok/s. 32B at Q4 quantization barely holds 10 tok/s. The bottleneck isn't GPU compute. It's memory bandwidth. LLM inference — especially the token generation phase — is rate-limited by how fast model weights can be read out of VRAM. The RTX 4060's GDDR6 bandwidth is 272 GB/s. A

Read original on dev.to