# Glossary {.unnumbered}
## A
**Activation function**
: A nonlinear function applied element-wise to the output of a linear transformation. Common examples include ReLU, sigmoid, and tanh. Without activation functions, stacking linear layers would produce only linear functions.
**Adam optimizer**
: An optimization algorithm that maintains exponentially decaying averages of past gradients (momentum) and past squared gradients (adaptive learning rates). Combines the benefits of momentum and RMSprop.
**Attention**
: A mechanism for computing weighted combinations of values based on the relevance between queries and keys. Allows models to focus on different parts of the input dynamically.
**Attention weights**
: The probabilities computed by applying softmax to attention scores. Each weight indicates how much a position attends to another position. Weights sum to 1 across attended positions.
**Autoregressive**
: A generation strategy where each token is predicted based only on previously generated tokens. The model generates one token at a time, feeding each output back as input for the next prediction.
## B
**Backpropagation**
: An algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule backward through the computation graph.
**Basis**
: A set of linearly independent vectors that span a vector space. Any vector in the space can be uniquely expressed as a linear combination of basis vectors.
**Batch normalization**
: A technique that normalizes activations across the batch dimension. Contrast with layer normalization, which normalizes across the feature dimension.
**BERT**
: Bidirectional Encoder Representations from Transformers. An encoder-only transformer trained with masked language modeling, where the model predicts randomly masked tokens using bidirectional context.
**Bias**
: A learnable constant term added to the weighted sum in a neuron or linear layer. Allows the model to shift the activation function's input.
**BPE (Byte-Pair Encoding)**
: A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs. Balances vocabulary size with the ability to represent any text.
## C
**Causal masking**
: A masking scheme that prevents positions from attending to future positions. Implemented by setting attention scores to negative infinity for future positions before softmax.
**Chain rule**
: A calculus rule for computing derivatives of composite functions: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$. The foundation of backpropagation.
**Chinchilla scaling**
: The finding that compute-optimal training requires scaling parameters and training tokens equally, roughly 20 tokens per parameter.
**Cross-attention**
: Attention where queries come from one sequence and keys/values come from another. Used in encoder-decoder models for the decoder to attend to encoder outputs.
**Cross-entropy loss**
: A loss function measuring the difference between predicted probabilities and true labels: $-\sum_i y_i \log(p_i)$. The standard loss for classification tasks.
## D
**Decoder**
: In transformers, an architecture component using causal (masked) self-attention to prevent information flow from future tokens. Decoder-only models (like GPT) use this architecture throughout.
**Derivative**
: The instantaneous rate of change of a function. For $f(x)$, the derivative $\frac{df}{dx}$ measures how much $f$ changes per unit change in $x$.
**Dimension**
: The number of components in a vector, or the number of rows/columns in a matrix. In transformers, $d_{model}$ typically refers to the embedding dimension.
**Dot product**
: The sum of element-wise products of two vectors: $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$. Measures similarity when vectors are normalized.
**Dropout**
: A regularization technique that randomly sets activations to zero during training with probability $p$. Prevents overfitting by reducing co-adaptation.
## E
**Eigenvalue**
: A scalar $\lambda$ such that $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ for some nonzero vector $\mathbf{v}$. Eigenvalues characterize how a matrix stretches space along certain directions.
**Eigenvector**
: A nonzero vector $\mathbf{v}$ such that $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ for some scalar $\lambda$. The matrix only scales (doesn't rotate) eigenvectors.
**Embedding**
: A learned mapping from discrete tokens to continuous vectors. The embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ stores one $d$-dimensional vector per vocabulary token.
**Emergent capabilities**
: Abilities that appear suddenly at specific model scales, not present in smaller models. Examples include arithmetic, chain-of-thought reasoning, and in-context learning.
**Encoder**
: In transformers, an architecture component using bidirectional self-attention where all positions can attend to all other positions. Encoder-only models (like BERT) use this architecture throughout.
**Expectation**
: The probability-weighted average of a random variable's values: $\mathbb{E}[X] = \sum_x x \cdot P(X=x)$ for discrete variables.
## F
**Feed-forward network (FFN)**
: A position-wise neural network in transformer blocks, typically two linear layers with a nonlinearity: $\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$.
**Fine-tuning**
: Training a pretrained model on task-specific data, typically with a smaller learning rate. Adapts general knowledge to specific applications.
**Forward propagation**
: Computing the output of a neural network by passing inputs through each layer sequentially, applying weights, biases, and activation functions.
## G
**Gradient**
: The vector of partial derivatives of a function with respect to all its inputs: $\nabla f = [\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}]$. Points in the direction of steepest ascent.
**Gradient descent**
: An optimization algorithm that iteratively updates parameters in the negative gradient direction: $\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$.
**GPT**
: Generative Pre-trained Transformer. A decoder-only transformer trained with causal language modeling to predict the next token given previous tokens.
## H
**Head (attention)**
: One of multiple parallel attention mechanisms in multi-head attention. Each head has its own projection matrices and can learn to attend to different types of relationships.
**Hidden state**
: An intermediate representation within a neural network, not directly observed as input or output. In RNNs, the hidden state carries information across time steps.
## I
**In-context learning**
: The ability of large language models to learn new tasks from examples provided in the prompt, without updating parameters.
**Instruction tuning**
: Fine-tuning a language model on (instruction, response) pairs to improve its ability to follow user instructions.
## K
**Key**
: In attention, vectors that "advertise" what information is available at each position. Keys are compared against queries to compute attention scores.
**KL divergence**
: Kullback-Leibler divergence. A measure of how one probability distribution differs from another: $D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$. Not symmetric.
## L
**Label smoothing**
: A regularization technique that softens target labels from hard one-hot vectors (e.g., $[0, 1, 0]$) to soft targets (e.g., $[0.05, 0.9, 0.05]$). Prevents overconfidence.
**Layer normalization**
: A normalization technique that standardizes activations across the feature dimension for each position independently: $\hat{x} = \frac{x - \mu}{\sigma}$, followed by learned scale and shift.
**Learning rate**
: A hyperparameter controlling the step size in gradient descent. Too large causes instability; too small causes slow convergence.
**Linear combination**
: A sum of vectors scaled by coefficients: $c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k$.
**Linear independence**
: Vectors are linearly independent if none can be expressed as a linear combination of the others. Formally, $c_1\mathbf{v}_1 + \cdots + c_k\mathbf{v}_k = \mathbf{0}$ implies all $c_i = 0$.
**Logit**
: The raw, unnormalized output of a model before applying softmax. The log-odds representation of probabilities.
**Loss function**
: A function measuring how well model predictions match targets. Training minimizes the loss by adjusting parameters.
**LSTM**
: Long Short-Term Memory. An RNN architecture with gates (forget, input, output) that control information flow, addressing the vanishing gradient problem.
## M
**Masked language modeling (MLM)**
: A training objective where random tokens are masked and the model predicts them from bidirectional context. Used by BERT.
**Matrix multiplication**
: The operation $\mathbf{C} = \mathbf{A}\mathbf{B}$ where $C_{ij} = \sum_k A_{ik} B_{kj}$. Requires inner dimensions to match.
**Multi-head attention**
: Attention with multiple parallel heads, each projecting to a lower-dimensional subspace. Outputs are concatenated and projected back to model dimension.
## N
**Neuron**
: The basic unit of a neural network, computing $y = \sigma(\mathbf{w}^T\mathbf{x} + b)$ where $\sigma$ is an activation function.
**Norm**
: A function measuring vector "length." The Euclidean ($L^2$) norm is $\|\mathbf{v}\| = \sqrt{\sum_i v_i^2}$.
## O
**One-hot encoding**
: A representation where a categorical value becomes a vector with 1 in one position and 0s elsewhere. Sparse and high-dimensional.
## P
**Parameter**
: A learnable value in a neural network, such as weights and biases, updated during training via gradient descent.
**Perplexity**
: The exponential of average cross-entropy loss: $\text{PPL} = \exp(\mathcal{L})$. Measures how "surprised" the model is by the data. Lower is better.
**Positional encoding**
: A technique for injecting position information into transformer inputs. The original transformer uses sinusoidal encodings at different frequencies.
**Power law**
: A relationship of the form $y = ax^b$, appearing as a straight line on log-log axes. Scaling laws follow power laws.
**Projection**
: A linear transformation reducing or changing dimensionality: $\mathbf{y} = \mathbf{W}\mathbf{x}$ where $\mathbf{W}$ projects from one space to another.
## Q
**Query**
: In attention, a vector representing what information a position is looking for. Queries are compared against keys to compute attention scores.
## R
**Random variable**
: A variable representing an uncertain outcome. Written in uppercase ($X$) to distinguish from specific values ($x$).
**ReLU**
: Rectified Linear Unit. The activation function $\text{ReLU}(x) = \max(0, x)$. Simple, computationally efficient, and widely used.
**Residual connection**
: A skip connection adding a layer's input to its output: $\mathbf{y} = f(\mathbf{x}) + \mathbf{x}$. Enables training of very deep networks by providing gradient shortcuts.
**RLHF**
: Reinforcement Learning from Human Feedback. A technique for aligning language models with human preferences using a learned reward model and policy optimization.
**RNN**
: Recurrent Neural Network. A network that processes sequences by maintaining a hidden state updated at each time step.
## S
**Scaling laws**
: Empirical relationships showing that language model loss decreases as a power law with compute, data, and parameters.
**Self-attention**
: Attention where queries, keys, and values all come from the same sequence. Each position attends to every other position (including itself).
**Sigmoid**
: The activation function $\sigma(x) = \frac{1}{1 + e^{-x}}$, squashing inputs to the range $(0, 1)$.
**Softmax**
: A function converting a vector of real numbers to a probability distribution: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$. Outputs are positive and sum to 1.
**Subword tokenization**
: Breaking text into units smaller than words but larger than characters. Balances vocabulary size with coverage. BPE is a common algorithm.
## T
**Teacher forcing**
: A training technique where the model receives true previous tokens as input, rather than its own predictions. Standard for training autoregressive models.
**Token**
: The basic unit of text processed by a model, typically a word, subword, or character depending on the tokenization scheme.
**Transformer**
: An architecture based on self-attention that processes all positions in parallel. Introduced in "Attention Is All You Need" (2017).
## V
**Value**
: In attention, vectors containing the actual content that gets retrieved and combined according to attention weights.
**Vanishing gradient**
: A problem where gradients become exponentially small as they propagate through many layers, preventing learning in early layers.
**Variance**
: A measure of spread: $\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$. The expected squared deviation from the mean.
**Vector space**
: A set of vectors closed under addition and scalar multiplication, satisfying certain axioms (associativity, commutativity, identity, etc.).
**Vocabulary**
: The set of all tokens a model can process, with size $V$. Each token maps to a row in the embedding matrix.
## W
**Warmup**
: A learning rate schedule that starts with a small learning rate and gradually increases it. Stabilizes early training when gradients are noisy.
**Weight**
: A learnable parameter in a neural network that scales inputs. Organized into weight matrices for efficient computation.
**Weight sharing**
: Using the same parameters across different parts of a model. RNNs share weights across time steps; transformers share weights across positions.