You send a prompt. You get a response. It feels almost instant — a second or two, maybe less. But between your code and that first streamed token, your request passes through multiple layers of infrastructure, gets transformed, routed, processed, and shaped before any "thinking" even begins. Most explanations of how LLM APIs work jump straight to inference — attention mechanisms, transformers, token generation. That's like explaining how a car works by starting at the engine. What about the ignition? The fuel system? The road you chose to drive on? As a developer building on these APIs, I kept noticing something: **the parts I actually control matter more to my output quality than the parts I don't.** The model's behavior isn't just decided during inference. It's shaped earlier — by prompt assembly, context selection, and tool availability. And those are exactly the parts most diagrams skip. I'm not building inference engines. I'm building on top of them. This is the mental model I wish I had when I started. ![What Actually Happens When You Call an LLM API - Infographic](/assets/images/infog/what-actually-happens-when-you-call-an-llm-api.webp) Let's walk through all seven stages, grouped into four phases: what you control, what the provider handles, what the model does, and what happens after. ## Phase 2: Infrastructure ### Section 2 — Infrastructure Entry *"Your request is validated and routed before any AI runs."* Once your assembled request leaves your code, it hits the provider's API gateway. This is standard web infrastructure — the same kind of gateway that sits in front of any large-scale API. Here, your request gets authenticated (is your API key valid?), and rate limits are enforced (have you exceeded your allowed requests per minute?). If you've ever received a `429 Too Many Requests` error, that's this layer doing its job. After authentication, a load balancer routes your request to available compute resources. Providers run large GPU clusters, and the load balancer decides which cluster or instance handles your specific call. This is why identical prompts can return with slightly different latency — the routing path may differ each time. None of this is unique to LLMs. It's the same infrastructure pattern behind any cloud API at scale. But it's worth knowing it exists, because when things go wrong (timeouts, rate limits, inconsistent latency), the problem often lives here, not in the model itself. ### Section 3 — Preparation *"Text becomes numbers. The system decides which model handles it."* Before inference can begin, your text needs to be converted into something the model can process. This is **tokenization** — your prompt gets broken down into token IDs, which are numerical representations of words, subwords, or characters. Different providers use different tokenizers (you may have heard of BPE, SentencePiece, or similar approaches), and the exact way text gets split into tokens varies. This matters for one very practical reason: **token count determines your cost.** Every provider's pricing page bills by the number of input and output tokens. After tokenization, there's another layer that's rarely discussed publicly: **model routing.** If a provider offers multiple model sizes, or runs different versions of a model across their infrastructure, something needs to decide which specific model instance handles your request. The details of how providers route requests internally aren't publicly documented, and I won't pretend to know exactly how any specific provider does it. But knowing this layer exists helps explain why you might occasionally see slightly different behavior or response times on the same prompt and model. ## Phase 4: Output and Loop ### Section 5 — The Tool Loop *"The model may pause, use tools, and continue reasoning."* This is the section most LLM pipeline diagrams skip entirely. And in 2025, it might be the most important one to understand. When you include tool definitions in your API request (Phase 1), you're giving the model the ability to do something other than generate text. It can choose to call a function, run a web search, query a database, or interact with an external service through protocols like MCP (Model Context Protocol). When the model decides to use a tool, here's what happens: 1. Instead of generating a final text response, the model outputs a **tool call** — a structured request saying "I want to call this function with these arguments." 2. Your code (or the provider's infrastructure) **executes the tool** and gets a result. 3. The tool result gets **appended to the context**. 4. Inference runs **again** — the model now has the original prompt plus the tool result, and it continues generating from there. This is the agentic loop. One user prompt can trigger multiple inference passes. The model might search the web, read the results, decide it needs more information, search again, and then finally compose a response. Each loop means more tokens processed, more latency, and more cost. If you're building anything with tool use — and increasingly, most production LLM applications are — understanding this loop is essential. It's the difference between a straightforward API call and a complex chain of operations that can run up your token bill quickly. I learned this the expensive way. ### Section 6 — Post-Processing *"Output is cleaned and validated before returning."* Once the model finishes generating (either because it reached a natural stopping point, hit your max token limit, or completed a tool loop), the raw output tokens go through post-processing. The tokens get converted back to readable text (detokenization). Safety classifiers and content moderation systems evaluate the output — every major provider runs these. If you've ever received a response that was blocked or flagged by a content filter, that's this stage at work. For structured output requests (like JSON mode), validation happens here too. The output gets checked against the expected format. There's a small but useful detail that lives in this stage: the `finish_reason` field in your API response. It tells you *why* generation stopped: - `stop` — the model reached a natural ending - `length` — it hit your max token limit - `content_filter` — the output was blocked by safety systems - `tool_use` — the model wants to call a tool (back to Section 5) This field is your diagnostic tool. If your responses keep getting cut off, check the finish reason before assuming the model is broken. ### Section 7 — Response and Metrics *"Every token is measured, logged, and billed."* Your response arrives either as streaming chunks (Server-Sent Events) or as a complete JSON response, depending on how you configured the request. Every response includes usage metadata: how many tokens were in your prompt, how many tokens were generated, and the total. This is your billing data. Input and output tokens are priced separately, with output tokens typically costing more — check your provider's current pricing page for exact rates, as these change. On the provider side, everything gets logged: which model handled the request, token counts, latency, finish reason, and safety flags. This feeds their monitoring dashboards, abuse detection systems, and capacity planning. ## Frequently Asked Questions ### What is prompt assembly? Prompt assembly is the process of constructing the full input that gets sent to the model. It ### What is tokenization? Tokenization is how text gets converted into numbers the model can process. Your prompt gets broken into small pieces called tokens — sometimes whole words, sometimes parts of words, sometimes individual characters. Each token maps to a numerical ID. Different providers use different tokenization methods (BPE, SentencePiece, and others), which is why the same text can produce different token counts across different models. Token count matters because it directly determines your cost — providers bill by the number of tokens processed. ### What is a KV cache? KV stands for key-value. During the prefill phase, when the model processes all your input tokens, it creates a compressed internal representation of your context. This is the KV cache. Think of it as the model ### What is the prefill phase? The prefill phase is the first step of inference. The model processes all your input tokens in parallel and builds the KV cache. This is why longer prompts mean a longer wait before you see the first output token — more input tokens means more work during prefill. The prefill phase happens once per inference pass. After it completes, the model moves to the decode phase. ### What is the decode loop? The decode loop is how the model generates its response, one token at a time. After the prefill phase builds the KV cache, the model predicts the next most likely token, adds it to the output, and then uses that updated context to predict the token after that. This repeats until the model reaches a stopping point, hits your max token limit, or decides to call a tool. This autoregressive loop is why LLM responses can be streamed — each token is available as soon as it ### What are temperature, top_p, and top_k? These are sampling parameters that control how the model picks each next token during the decode loop. Temperature adjusts randomness — lower values (like 0.1) make the model more predictable and focused, higher values (like 1.0) make it more varied and creative. Top_p (nucleus sampling) limits the model to choosing from the smallest set of tokens whose combined probability exceeds a threshold. Top_k limits the model to choosing from the k most likely tokens. These are set in your API call and are documented in every provider ### What is the tool loop? When you give a model tool definitions in your API request, the model can choose to call a tool instead of generating a text response. When it does, the tool executes, the result gets added to the context, and inference runs again. This cycle can repeat multiple times — the model might call several tools before producing a final answer. This is the agentic loop, and it ### What is MCP? MCP stands for Model Context Protocol. It ### What does finish_reason tell me? The finish_reason field in your API response tells you why the model stopped generating. The common values are: `stop` (the model reached a natural ending), `length` (it hit your max token limit — your response may be incomplete), `content_filter` (the output was blocked by safety systems), and `tool_use` (the model wants to call a tool and is waiting for the result). This field is your first diagnostic check when something seems off with a response. ### What is RAG? RAG stands for Retrieval Augmented Generation. It ### Why are output tokens more expensive than input tokens? Input tokens are processed in parallel during the prefill phase — the model reads them all at once. Output tokens are generated sequentially in the decode loop — one at a time, each requiring a separate forward pass through the model. Sequential generation is more compute-intensive per token than parallel processing, which is why providers typically charge more for output tokens. Check your provider ### What is streaming and why does it exist? Streaming delivers the model ### What is autoregressive generation? Autoregressive means the model generates output one piece at a time, where each new piece depends on everything that came before it. The model predicts the next token, adds it to the sequence, then uses the updated sequence to predict the token after that. It