article thumbnail
LLMs
How LLMs Actually Work
21 min read

You have probably used an AI assistant dozens of times this week. You type something, it responds, and it feels almost like talking to a person. But what is actually happening inside? Not at the marketing-brochure level — at the level of math and data structure. Understanding the mechanics does not make you an ML researcher, but it does make you a better user, a sharper critic, and a more effective builder. This article is that grounding.


The Core Idea: Next Token Prediction

An LLM (Large Language Model) is, at its heart, a very sophisticated next-word predictor. Given a sequence of words (or more precisely, tokens), it predicts what should come next. That sounds almost too simple to explain ChatGPT. But doing that prediction well — over billions of examples, across every domain of human knowledge — turns out to require a model that has learned something that looks a lot like understanding.

The key insight: if you train a model to predict the next word in every sentence ever written, it must implicitly learn grammar, facts, reasoning patterns, code syntax, rhetorical structure, and more. There is no other way to get good at the task. Prediction is the lever that unlocks everything else.


Tokens: Language Becomes Numbers

Before any of the interesting math happens, text has to become numbers. Models do not read words — they read tokens.

A token is a chunk of text, usually a word or part of a word. The tokenizer splits your input into these chunks and assigns each one a number from a fixed vocabulary. OpenAI's GPT models use a tokenizer called BPE (Byte Pair Encoding), which was originally developed for data compression and adapted for NLP. Common words get their own token; rarer words get split into subword pieces.

"tokenization"  →  ["token", "ization"]       →  [3239, 1634]
"unbelievable"  →  ["un", "believ", "able"]   →  [443, 6301, 540]
"cat"           →  ["cat"]                    →  [9246]

A rough rule of thumb: one token ≈ 0.75 English words, or about 4 characters. The sentence you just read is around 20 tokens.

The context window is the maximum number of tokens a model can hold in its working memory at once — both your input and its output. Early GPT models had a 2,048-token window. Modern models like Claude and Gemini 2.5 now support over one million tokens. This matters because the model cannot "see" anything outside its context window — it has no persistent memory beyond what you explicitly include.


Embeddings: Meaning as Geometry

A raw token ID (just a number) carries no meaning on its own. The first thing the model does is look up each token in an embedding table — a learned matrix that converts each token ID into a high-dimensional vector of floating-point numbers.

A typical embedding might have 4,096 dimensions. Each dimension corresponds to some learned feature of meaning, though individual dimensions are not human-interpretable. What matters is the geometric relationship between vectors: tokens with similar meanings end up near each other in this high-dimensional space.

This is the famous example: in a well-trained embedding space, king - man + woman ≈ queen. The geometry encodes analogy. Embeddings are not hand-crafted — they emerge entirely from training.

These embedding vectors are the form in which the model processes everything else. Your entire input becomes a sequence of vectors, one per token, each carrying a compressed representation of that token's meaning. The model then works on those vectors through a series of layers.


The Transformer: The Architecture That Changed Everything

Modern LLMs are all built on the transformer architecture, introduced in the 2017 paper Attention Is All You Need by researchers at Google Brain (now part of Google DeepMind). The lead authors include Ashish Vaswani, Noam Shazeer, and Jakob Uszkoreit, among others. Most of them have since founded or joined AI startups, which tells you something about how foundational that paper was.

Before transformers, the dominant approach was recurrent neural networks (RNNs), which processed sequences one token at a time, left to right. They had a fundamental problem: they forgot things. By the time an RNN reached the end of a long sentence, the early tokens had faded from its effective memory.

Transformers solved this with a mechanism called attention.


Attention: Every Token Talks to Every Other Token

The attention mechanism allows every token in the input to "look at" every other token simultaneously, rather than processing them one by one. For each token, the model computes how relevant every other token is to understanding it, then blends information from those relevant tokens into an updated representation.

Concretely: when processing the word "it" in the sentence "The trophy didn't fit in the suitcase because it was too big", the model needs to figure out what "it" refers to. Attention lets "it" look across the whole sentence, score each word by relevance, and pull in context from "trophy" or "suitcase" accordingly. This is coreference resolution, and it happens implicitly through the attention scores — not through an explicit rule.

The formula for attention involves three learned matrices called Query (Q), Key (K), and Value (V). The query represents what a token is looking for; the keys represent what each token has to offer; the values are the actual information transferred. The attention score between two tokens is the dot product of their query and key vectors, scaled and normalized with a softmax function.

You do not need to memorize the formula. What matters is the intuition: attention is a learned, dynamic, weighted lookup — every token queries every other token, and the model decides what to blend together, and those decisions change depending on context.

Multi-head attention runs this process in parallel across many independent "heads" — typically 32 or 64 in a large model — each learning to attend to different kinds of relationships. One head might focus on syntactic structure; another on long-range semantic dependencies; another on factual co-occurrence. The outputs of all heads are concatenated and projected back together.


The Full Model: Layers of Transformation

A transformer model stacks many of these attention layers one on top of another, typically 32 to 96 layers in a large model. Between each attention layer, there is a feed-forward network (a simple two-layer fully connected network applied to each token independently) that further transforms the representations.

After all the layers, the final vector for the last token in the sequence is passed through a projection layer that maps it to scores over the entire vocabulary — one score per token. A softmax turns those scores into probabilities, and the model samples from that distribution to pick the next token. That token is then appended to the input and the process repeats, one token at a time, until the model generates an end-of-sequence token or hits the length limit.

This step-by-step generation is called autoregressive decoding. The model does not write the whole response at once — it generates one token, feeds it back in, generates another, and so on. That is why streaming responses appear word by word.

Reasoning models are a significant variation on this pattern. Models like OpenAI's GPT-5 (which folded the earlier o-series reasoning line into a single model), Claude's extended thinking mode, and Gemini with thinking generate a long internal chain of reasoning tokens before producing their final answer. That internal scratchpad — sometimes thousands of tokens of working through the problem — is not shown to the user but becomes context for the final response. This is meaningfully different from standard autoregressive generation: the model is not just predicting plausible output, it is allocating tokens to deliberate reasoning. For hard math, logic, and coding problems, this approach dramatically outperforms standard generation, at the cost of higher latency and token count.


Training: How the Model Learns

Building a model requires two major stages.

Pretraining is the massive first phase. The model is initialized with random weights and trained on an enormous corpus of text — web pages, books, code, scientific papers, and more. For today's frontier models, training data runs into the tens of trillions of tokens and training runs on tens of thousands of GPUs for months. The training signal is simple: predict the next token, measure how wrong you were, and adjust the weights slightly to be less wrong next time. Repeat billions of times. This is gradient descent applied at extraordinary scale.

Fine-tuning comes after pretraining. A base model trained purely on next-token prediction is not automatically a good assistant — it might respond to a question by generating more questions, since that is common on the web. Fine-tuning on human-curated examples of good conversations shapes the model into something helpful.

Alignment training comes next — shaping the model to be helpful, honest, and safe rather than just statistically fluent. The original technique was RLHF (Reinforcement Learning from Human Feedback), pioneered by OpenAI and described in papers around InstructGPT: human raters rank model outputs, a separate "reward model" learns to predict those rankings, and the main model is trained to maximize that reward signal. More recent approaches have largely moved beyond or supplemented RLHF. DPO (Direct Preference Optimization) skips the reward model entirely, optimizing directly from human preference pairs — simpler and often more stable. Anthropic's Constitutional AI uses a written set of principles and AI-generated critiques to guide alignment without relying solely on human raters. Most frontier models today use some combination of these techniques, plus proprietary variants. The shared goal is the same: a model that behaves the way humans actually want, not just one that produces plausible-sounding text.


What This Means for How You Use AI

A few practical implications fall directly out of this architecture:

Hallucinations are not bugs, they are a structural feature. The model is always generating the most probable next token given its context. If the true answer to a question is not strongly represented in its training data, it will confidently generate something plausible-sounding that is wrong. It has no mechanism to say "I don't know" unless it was trained to do so explicitly.

Context is everything. The model has no memory outside the context window. Everything it knows about your conversation is in that window. Long conversations eventually push early context out. Repeating key constraints and background in long sessions is not redundant — it is necessary.

Temperature controls creativity vs. reliability. Most model APIs expose a temperature parameter (0 to 1, sometimes higher). At temperature 0, the model always picks the highest-probability next token — deterministic and conservative. At higher temperatures, lower-probability tokens get more chances, producing more varied and creative (but also more error-prone) output. For code generation, use low temperature. For brainstorming, use higher.

More tokens ≠ better answers. Asking a model to "think step by step" genuinely improves results because it forces the generation of intermediate reasoning tokens, which become context for subsequent tokens — the model effectively reasons with its own output. This technique, called chain-of-thought prompting, emerged from observing that the architecture responds well to scaffolded reasoning in the context window. Dedicated reasoning models (GPT-5, Claude extended thinking, Gemini Thinking) automate and deepen this process; for complex analytical tasks, they are worth reaching for instead of a standard model.

Most frontier models are now multimodal. GPT-5, the Claude 4 family, and Gemini natively process images, documents, and in some cases audio and video alongside text. The core transformer architecture described here extends to other modalities by encoding them into the same token-and-embedding framework — images become patches, audio becomes spectrograms, and so on. The prediction mechanics are the same; the input representation is broader.


Where to Go Deeper

Andrej Karpathy's YouTube series, particularly his Neural Networks: Zero to Hero playlist, is the best free resource for building this understanding from scratch, including coding a small GPT from scratch in Python. His blog is equally worth reading.

Hugging Face hosts thousands of open-weight models and provides the Transformers library — the standard toolkit for working with LLMs in Python.

The original Attention Is All You Need paper is surprisingly readable for a research paper and worth at least skimming. Sebastian Raschka's newsletter and books are another excellent resource for practitioners who want mathematical depth without pure research-level abstraction.


LLMs are not magic, and they are not general intelligence. They are extraordinarily well-trained next-token predictors built on an elegant architecture. Understanding that framing does not diminish how useful they are — it makes you better at using them, building with them, and knowing when not to trust them. The model doing the prediction is the same whether it writes a poem or explains a bug, and the gap between "surprisingly capable" and "fundamentally reliable" is one worth keeping in mind.