Justy: Okay, so this piece is about how LLM servers actually schedule requests without wasting GPU time on padding. Which sounds boring until you realize that if you're running any kind of inference service at scale, this is literally the difference between your hardware being useful and your hardware being expensive idle.
Cody: Right. And the article's doing something smart—it starts with static batching, which is the intuitive but wrong way, then shows you why continuous batching fixes it.
Justy: So static batching is… you group requests into fixed batches, each batch waits for its slowest request to finish, then the next batch starts. That's it.
Cody: Exactly. And they use a concrete example: three requests in a batch. One needs six tokens, one needs fifty, one needs three hundred. The GPU decodes all of them token-by-token, but the six-token request finishes after step six. Its slot is still there, still active, still burning cycles on padding tokens until the three-hundred-token request finishes three hundred steps later.
Justy: That is so inefficient.
Cody: It's bad. And they include the actual code—static_batching function using Hugging Face transformers, batch size of three, six requests ranging from thirty to three hundred tokens. The output shows exactly what happens: all three slots wait at a barrier until the longest finishes, then the next wave starts.
Justy: Mm-hm. So the fix is continuous batching.
Cody: The fix is to not have a batch barrier at all. The moment request A finishes at step six, you pull a new request into that slot. While requests B and C are still decoding, you're already prefilling a new request that arrived. No padding, no idle slots, no waiting.
Justy: And the way you make that work is ragged batching—you don't pad everything to the same length. You let each request keep its own length in the batch.
Cody: Right. You have to track which tokens belong to which request, but that's not expensive. The KV cache is already per-request anyway. Once you're in the decode loop, each step is just one forward pass that pulls a new token from each active request, and you can dynamically add and remove requests from the active set without any barrier.
Justy: So the implementation is actually not that hard?
Cody: The concept is simple. That's why vLLM and SGLang and other modern frameworks do this by default now.
Justy: And the reason I care about this is because if you're building any LLM service, you either implement continuous batching or your GPU is sitting idle waiting for slow requests. That's not a theoretical loss—that's real throughput, real cost.
Cody: Exactly. If you're serving hundreds of concurrent users, heterogeneous request lengths are the norm. Some user asks for three tokens, another asks for five hundred. Static batching forces you to pick a batch size and a max length, and you're always padding short requests or truncating long ones or both.
Justy: Okay, so one thing I want to push on: the article assumes you have a GPU with enough memory to hold multiple KV caches at once. If memory is tight, continuous batching becomes a trade-off, right?
Cody: Yeah, that's real. KV cache size scales with sequence length and batch size. But the article assumes you're already batching, so you're already paying that memory cost. Continuous batching just lets you use that memory more efficiently by not padding.
Justy: Fair. And for the people who should care about this—that's anyone running inference at scale. Not just researchers, not just hobbyists, but anyone actually serving requests.
Cody: If you're using an off-the-shelf inference service like OpenAI or Anthropic, they're already doing this. If you're running your own stack, understanding this is kind of essential. You can't tune or debug what you don't understand.
Justy: Mm-hm. So if someone's building an LLM API or optimizing their inference costs, this is required reading.
Cody: Absolutely. And if you're not building inference yourself but you're evaluating services, understanding this is how you ask smart questions about throughput and latency guarantees.
Justy: Yeah. Like, "Do you use continuous batching?" is now a fair question to ask a vendor.
Cody: If they don't, you know they're leaving money on the table.
Justy: Yep. Alright, so if you're shipping inference, read this. If you're not but you're curious how the infrastructure actually works, also read it. The code is there, the problem is real, and the solution is practical.