Glossary

A

Activation function
A nonlinear function applied element-wise to the output of a linear transformation. Common examples include ReLU, sigmoid, and tanh. Without activation functions, stacking linear layers would produce only linear functions.
Adam optimizer
An optimization algorithm that maintains exponentially decaying averages of past gradients (momentum) and past squared gradients (adaptive learning rates). Combines the benefits of momentum and RMSprop.
Attention
A mechanism for computing weighted combinations of values based on the relevance between queries and keys. Allows models to focus on different parts of the input dynamically.
Attention weights
The probabilities computed by applying softmax to attention scores. Each weight indicates how much a position attends to another position. Weights sum to 1 across attended positions.
Autoregressive
A generation strategy where each token is predicted based only on previously generated tokens. The model generates one token at a time, feeding each output back as input for the next prediction.

B

Backpropagation
An algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule backward through the computation graph.
Basis
A set of linearly independent vectors that span a vector space. Any vector in the space can be uniquely expressed as a linear combination of basis vectors.
Batch normalization
A technique that normalizes activations across the batch dimension. Contrast with layer normalization, which normalizes across the feature dimension.
BERT
Bidirectional Encoder Representations from Transformers. An encoder-only transformer trained with masked language modeling, where the model predicts randomly masked tokens using bidirectional context.
Bias
A learnable constant term added to the weighted sum in a neuron or linear layer. Allows the model to shift the activation function’s input.
BPE (Byte-Pair Encoding)
A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs. Balances vocabulary size with the ability to represent any text.

C

Causal masking
A masking scheme that prevents positions from attending to future positions. Implemented by setting attention scores to negative infinity for future positions before softmax.
Chain rule
A calculus rule for computing derivatives of composite functions: \(\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\). The foundation of backpropagation.
Chinchilla scaling
The finding that compute-optimal training requires scaling parameters and training tokens equally, roughly 20 tokens per parameter.
Cross-attention
Attention where queries come from one sequence and keys/values come from another. Used in encoder-decoder models for the decoder to attend to encoder outputs.
Cross-entropy loss
A loss function measuring the difference between predicted probabilities and true labels: \(-\sum_i y_i \log(p_i)\). The standard loss for classification tasks.

D

Decoder
In transformers, an architecture component using causal (masked) self-attention to prevent information flow from future tokens. Decoder-only models (like GPT) use this architecture throughout.
Derivative
The instantaneous rate of change of a function. For \(f(x)\), the derivative \(\frac{df}{dx}\) measures how much \(f\) changes per unit change in \(x\).
Dimension
The number of components in a vector, or the number of rows/columns in a matrix. In transformers, \(d_{model}\) typically refers to the embedding dimension.
Dot product
The sum of element-wise products of two vectors: \(\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i\). Measures similarity when vectors are normalized.
Dropout
A regularization technique that randomly sets activations to zero during training with probability \(p\). Prevents overfitting by reducing co-adaptation.

E

Eigenvalue
A scalar \(\lambda\) such that \(\mathbf{A}\mathbf{v} = \lambda\mathbf{v}\) for some nonzero vector \(\mathbf{v}\). Eigenvalues characterize how a matrix stretches space along certain directions.
Eigenvector
A nonzero vector \(\mathbf{v}\) such that \(\mathbf{A}\mathbf{v} = \lambda\mathbf{v}\) for some scalar \(\lambda\). The matrix only scales (doesn’t rotate) eigenvectors.
Embedding
A learned mapping from discrete tokens to continuous vectors. The embedding matrix \(\mathbf{E} \in \mathbb{R}^{V \times d}\) stores one \(d\)-dimensional vector per vocabulary token.
Emergent capabilities
Abilities that appear suddenly at specific model scales, not present in smaller models. Examples include arithmetic, chain-of-thought reasoning, and in-context learning.
Encoder
In transformers, an architecture component using bidirectional self-attention where all positions can attend to all other positions. Encoder-only models (like BERT) use this architecture throughout.
Expectation
The probability-weighted average of a random variable’s values: \(\mathbb{E}[X] = \sum_x x \cdot P(X=x)\) for discrete variables.

F

Feed-forward network (FFN)
A position-wise neural network in transformer blocks, typically two linear layers with a nonlinearity: \(\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2\).
Fine-tuning
Training a pretrained model on task-specific data, typically with a smaller learning rate. Adapts general knowledge to specific applications.
Forward propagation
Computing the output of a neural network by passing inputs through each layer sequentially, applying weights, biases, and activation functions.

G

Gradient
The vector of partial derivatives of a function with respect to all its inputs: \(\nabla f = [\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}]\). Points in the direction of steepest ascent.
Gradient descent
An optimization algorithm that iteratively updates parameters in the negative gradient direction: \(\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)\).
GPT
Generative Pre-trained Transformer. A decoder-only transformer trained with causal language modeling to predict the next token given previous tokens.

H

Head (attention)
One of multiple parallel attention mechanisms in multi-head attention. Each head has its own projection matrices and can learn to attend to different types of relationships.
Hidden state
An intermediate representation within a neural network, not directly observed as input or output. In RNNs, the hidden state carries information across time steps.

I

In-context learning
The ability of large language models to learn new tasks from examples provided in the prompt, without updating parameters.
Instruction tuning
Fine-tuning a language model on (instruction, response) pairs to improve its ability to follow user instructions.

K

Key
In attention, vectors that “advertise” what information is available at each position. Keys are compared against queries to compute attention scores.
KL divergence
Kullback-Leibler divergence. A measure of how one probability distribution differs from another: \(D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}\). Not symmetric.

L

Label smoothing
A regularization technique that softens target labels from hard one-hot vectors (e.g., \([0, 1, 0]\)) to soft targets (e.g., \([0.05, 0.9, 0.05]\)). Prevents overconfidence.
Layer normalization
A normalization technique that standardizes activations across the feature dimension for each position independently: \(\hat{x} = \frac{x - \mu}{\sigma}\), followed by learned scale and shift.
Learning rate
A hyperparameter controlling the step size in gradient descent. Too large causes instability; too small causes slow convergence.
Linear combination
A sum of vectors scaled by coefficients: \(c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k\).
Linear independence
Vectors are linearly independent if none can be expressed as a linear combination of the others. Formally, \(c_1\mathbf{v}_1 + \cdots + c_k\mathbf{v}_k = \mathbf{0}\) implies all \(c_i = 0\).
Logit
The raw, unnormalized output of a model before applying softmax. The log-odds representation of probabilities.
Loss function
A function measuring how well model predictions match targets. Training minimizes the loss by adjusting parameters.
LSTM
Long Short-Term Memory. An RNN architecture with gates (forget, input, output) that control information flow, addressing the vanishing gradient problem.

M

Masked language modeling (MLM)
A training objective where random tokens are masked and the model predicts them from bidirectional context. Used by BERT.
Matrix multiplication
The operation \(\mathbf{C} = \mathbf{A}\mathbf{B}\) where \(C_{ij} = \sum_k A_{ik} B_{kj}\). Requires inner dimensions to match.
Multi-head attention
Attention with multiple parallel heads, each projecting to a lower-dimensional subspace. Outputs are concatenated and projected back to model dimension.

N

Neuron
The basic unit of a neural network, computing \(y = \sigma(\mathbf{w}^T\mathbf{x} + b)\) where \(\sigma\) is an activation function.
Norm
A function measuring vector “length.” The Euclidean (\(L^2\)) norm is \(\|\mathbf{v}\| = \sqrt{\sum_i v_i^2}\).

O

One-hot encoding
A representation where a categorical value becomes a vector with 1 in one position and 0s elsewhere. Sparse and high-dimensional.

P

Parameter
A learnable value in a neural network, such as weights and biases, updated during training via gradient descent.
Perplexity
The exponential of average cross-entropy loss: \(\text{PPL} = \exp(\mathcal{L})\). Measures how “surprised” the model is by the data. Lower is better.
Positional encoding
A technique for injecting position information into transformer inputs. The original transformer uses sinusoidal encodings at different frequencies.
Power law
A relationship of the form \(y = ax^b\), appearing as a straight line on log-log axes. Scaling laws follow power laws.
Projection
A linear transformation reducing or changing dimensionality: \(\mathbf{y} = \mathbf{W}\mathbf{x}\) where \(\mathbf{W}\) projects from one space to another.

Q

Query
In attention, a vector representing what information a position is looking for. Queries are compared against keys to compute attention scores.

R

Random variable
A variable representing an uncertain outcome. Written in uppercase (\(X\)) to distinguish from specific values (\(x\)).
ReLU
Rectified Linear Unit. The activation function \(\text{ReLU}(x) = \max(0, x)\). Simple, computationally efficient, and widely used.
Residual connection
A skip connection adding a layer’s input to its output: \(\mathbf{y} = f(\mathbf{x}) + \mathbf{x}\). Enables training of very deep networks by providing gradient shortcuts.
RLHF
Reinforcement Learning from Human Feedback. A technique for aligning language models with human preferences using a learned reward model and policy optimization.
RNN
Recurrent Neural Network. A network that processes sequences by maintaining a hidden state updated at each time step.

S

Scaling laws
Empirical relationships showing that language model loss decreases as a power law with compute, data, and parameters.
Self-attention
Attention where queries, keys, and values all come from the same sequence. Each position attends to every other position (including itself).
Sigmoid
The activation function \(\sigma(x) = \frac{1}{1 + e^{-x}}\), squashing inputs to the range \((0, 1)\).
Softmax
A function converting a vector of real numbers to a probability distribution: \(\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\). Outputs are positive and sum to 1.
Subword tokenization
Breaking text into units smaller than words but larger than characters. Balances vocabulary size with coverage. BPE is a common algorithm.

T

Teacher forcing
A training technique where the model receives true previous tokens as input, rather than its own predictions. Standard for training autoregressive models.
Token
The basic unit of text processed by a model, typically a word, subword, or character depending on the tokenization scheme.
Transformer
An architecture based on self-attention that processes all positions in parallel. Introduced in “Attention Is All You Need” (2017).

V

Value
In attention, vectors containing the actual content that gets retrieved and combined according to attention weights.
Vanishing gradient
A problem where gradients become exponentially small as they propagate through many layers, preventing learning in early layers.
Variance
A measure of spread: \(\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]\). The expected squared deviation from the mean.
Vector space
A set of vectors closed under addition and scalar multiplication, satisfying certain axioms (associativity, commutativity, identity, etc.).
Vocabulary
The set of all tokens a model can process, with size \(V\). Each token maps to a row in the embedding matrix.

W

Warmup
A learning rate schedule that starts with a small learning rate and gradually increases it. Stabilizes early training when gradients are noisy.
Weight
A learnable parameter in a neural network that scales inputs. Organized into weight matrices for efficient computation.
Weight sharing
Using the same parameters across different parts of a model. RNNs share weights across time steps; transformers share weights across positions.