Glossary

A

Activation function: A nonlinear function applied element-wise to the output of a linear transformation. Common examples include ReLU, sigmoid, and tanh. Without activation functions, stacking linear layers would produce only linear functions.
Adam optimizer: An optimization algorithm that maintains exponentially decaying averages of past gradients (momentum) and past squared gradients (adaptive learning rates). Combines the benefits of momentum and RMSprop.
Attention: A mechanism for computing weighted combinations of values based on the relevance between queries and keys. Allows models to focus on different parts of the input dynamically.
Attention weights: The probabilities computed by applying softmax to attention scores. Each weight indicates how much a position attends to another position. Weights sum to 1 across attended positions.
Autoregressive: A generation strategy where each token is predicted based only on previously generated tokens. The model generates one token at a time, feeding each output back as input for the next prediction.

B

Backpropagation: An algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule backward through the computation graph.
Basis: A set of linearly independent vectors that span a vector space. Any vector in the space can be uniquely expressed as a linear combination of basis vectors.
Batch normalization: A technique that normalizes activations across the batch dimension. Contrast with layer normalization, which normalizes across the feature dimension.
BERT: Bidirectional Encoder Representations from Transformers. An encoder-only transformer trained with masked language modeling, where the model predicts randomly masked tokens using bidirectional context.
Bias: A learnable constant term added to the weighted sum in a neuron or linear layer. Allows the model to shift the activation function’s input.
BPE (Byte-Pair Encoding): A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs. Balances vocabulary size with the ability to represent any text.

C

Causal masking: A masking scheme that prevents positions from attending to future positions. Implemented by setting attention scores to negative infinity for future positions before softmax.
Chain rule: A calculus rule for computing derivatives of composite functions: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$. The foundation of backpropagation.
Chinchilla scaling: The finding that compute-optimal training requires scaling parameters and training tokens equally, roughly 20 tokens per parameter.
Cross-attention: Attention where queries come from one sequence and keys/values come from another. Used in encoder-decoder models for the decoder to attend to encoder outputs.
Cross-entropy loss: A loss function measuring the difference between predicted probabilities and true labels: $-\sum_i y_i \log(p_i)$. The standard loss for classification tasks.

D

Decoder: In transformers, an architecture component using causal (masked) self-attention to prevent information flow from future tokens. Decoder-only models (like GPT) use this architecture throughout.
Derivative: The instantaneous rate of change of a function. For $f(x)$, the derivative $\frac{df}{dx}$ measures how much $f$ changes per unit change in $x$.
Dimension: The number of components in a vector, or the number of rows/columns in a matrix. In transformers, $d_{model}$ typically refers to the embedding dimension.
Dot product: The sum of element-wise products of two vectors: $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$. Measures similarity when vectors are normalized.
Dropout: A regularization technique that randomly sets activations to zero during training with probability $p$. Prevents overfitting by reducing co-adaptation.

E

Eigenvalue: A scalar $\lambda$ such that $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ for some nonzero vector $\mathbf{v}$. Eigenvalues characterize how a matrix stretches space along certain directions.
Eigenvector: A nonzero vector $\mathbf{v}$ such that $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ for some scalar $\lambda$. The matrix only scales (doesn’t rotate) eigenvectors.
Embedding: A learned mapping from discrete tokens to continuous vectors. The embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ stores one $d$-dimensional vector per vocabulary token.
Emergent capabilities: Abilities that appear suddenly at specific model scales, not present in smaller models. Examples include arithmetic, chain-of-thought reasoning, and in-context learning.
Encoder: In transformers, an architecture component using bidirectional self-attention where all positions can attend to all other positions. Encoder-only models (like BERT) use this architecture throughout.
Expectation: The probability-weighted average of a random variable’s values: $\mathbb{E}[X] = \sum_x x \cdot P(X=x)$ for discrete variables.

F

Feed-forward network (FFN): A position-wise neural network in transformer blocks, typically two linear layers with a nonlinearity: $\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$.
Fine-tuning: Training a pretrained model on task-specific data, typically with a smaller learning rate. Adapts general knowledge to specific applications.
Forward propagation: Computing the output of a neural network by passing inputs through each layer sequentially, applying weights, biases, and activation functions.

G

Gradient: The vector of partial derivatives of a function with respect to all its inputs: $\nabla f = [\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}]$. Points in the direction of steepest ascent.
Gradient descent: An optimization algorithm that iteratively updates parameters in the negative gradient direction: $\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$.
GPT: Generative Pre-trained Transformer. A decoder-only transformer trained with causal language modeling to predict the next token given previous tokens.

H

Head (attention): One of multiple parallel attention mechanisms in multi-head attention. Each head has its own projection matrices and can learn to attend to different types of relationships.
Hidden state: An intermediate representation within a neural network, not directly observed as input or output. In RNNs, the hidden state carries information across time steps.

I

In-context learning: The ability of large language models to learn new tasks from examples provided in the prompt, without updating parameters.
Instruction tuning: Fine-tuning a language model on (instruction, response) pairs to improve its ability to follow user instructions.

K

Key: In attention, vectors that “advertise” what information is available at each position. Keys are compared against queries to compute attention scores.
KL divergence: Kullback-Leibler divergence. A measure of how one probability distribution differs from another: $D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$. Not symmetric.

L

Label smoothing: A regularization technique that softens target labels from hard one-hot vectors (e.g., $[0, 1, 0]$) to soft targets (e.g., $[0.05, 0.9, 0.05]$). Prevents overconfidence.
Layer normalization: A normalization technique that standardizes activations across the feature dimension for each position independently: $\hat{x} = \frac{x - \mu}{\sigma}$, followed by learned scale and shift.
Learning rate: A hyperparameter controlling the step size in gradient descent. Too large causes instability; too small causes slow convergence.
Linear combination: A sum of vectors scaled by coefficients: $c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k$.
Linear independence: Vectors are linearly independent if none can be expressed as a linear combination of the others. Formally, $c_1\mathbf{v}_1 + \cdots + c_k\mathbf{v}_k = \mathbf{0}$ implies all $c_i = 0$.
Logit: The raw, unnormalized output of a model before applying softmax. The log-odds representation of probabilities.
Loss function: A function measuring how well model predictions match targets. Training minimizes the loss by adjusting parameters.
LSTM: Long Short-Term Memory. An RNN architecture with gates (forget, input, output) that control information flow, addressing the vanishing gradient problem.

M

Masked language modeling (MLM): A training objective where random tokens are masked and the model predicts them from bidirectional context. Used by BERT.
Matrix multiplication: The operation $\mathbf{C} = \mathbf{A}\mathbf{B}$ where $C_{ij} = \sum_k A_{ik} B_{kj}$. Requires inner dimensions to match.
Multi-head attention: Attention with multiple parallel heads, each projecting to a lower-dimensional subspace. Outputs are concatenated and projected back to model dimension.

N

Neuron: The basic unit of a neural network, computing $y = \sigma(\mathbf{w}^T\mathbf{x} + b)$ where $\sigma$ is an activation function.
Norm: A function measuring vector “length.” The Euclidean ($L^2$) norm is $\|\mathbf{v}\| = \sqrt{\sum_i v_i^2}$.

O

One-hot encoding: A representation where a categorical value becomes a vector with 1 in one position and 0s elsewhere. Sparse and high-dimensional.

P

Parameter: A learnable value in a neural network, such as weights and biases, updated during training via gradient descent.
Perplexity: The exponential of average cross-entropy loss: $\text{PPL} = \exp(\mathcal{L})$. Measures how “surprised” the model is by the data. Lower is better.
Positional encoding: A technique for injecting position information into transformer inputs. The original transformer uses sinusoidal encodings at different frequencies.
Power law: A relationship of the form $y = ax^b$, appearing as a straight line on log-log axes. Scaling laws follow power laws.
Projection: A linear transformation reducing or changing dimensionality: $\mathbf{y} = \mathbf{W}\mathbf{x}$ where $\mathbf{W}$ projects from one space to another.

Q

Query: In attention, a vector representing what information a position is looking for. Queries are compared against keys to compute attention scores.

R

Random variable: A variable representing an uncertain outcome. Written in uppercase ($X$) to distinguish from specific values ($x$).
ReLU: Rectified Linear Unit. The activation function $\text{ReLU}(x) = \max(0, x)$. Simple, computationally efficient, and widely used.
Residual connection: A skip connection adding a layer’s input to its output: $\mathbf{y} = f(\mathbf{x}) + \mathbf{x}$. Enables training of very deep networks by providing gradient shortcuts.
RLHF: Reinforcement Learning from Human Feedback. A technique for aligning language models with human preferences using a learned reward model and policy optimization.
RNN: Recurrent Neural Network. A network that processes sequences by maintaining a hidden state updated at each time step.

S

Scaling laws: Empirical relationships showing that language model loss decreases as a power law with compute, data, and parameters.
Self-attention: Attention where queries, keys, and values all come from the same sequence. Each position attends to every other position (including itself).
Sigmoid: The activation function $\sigma(x) = \frac{1}{1 + e^{-x}}$, squashing inputs to the range $(0, 1)$.
Softmax: A function converting a vector of real numbers to a probability distribution: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$. Outputs are positive and sum to 1.
Subword tokenization: Breaking text into units smaller than words but larger than characters. Balances vocabulary size with coverage. BPE is a common algorithm.

T

Teacher forcing: A training technique where the model receives true previous tokens as input, rather than its own predictions. Standard for training autoregressive models.
Token: The basic unit of text processed by a model, typically a word, subword, or character depending on the tokenization scheme.
Transformer: An architecture based on self-attention that processes all positions in parallel. Introduced in “Attention Is All You Need” (2017).

V

Value: In attention, vectors containing the actual content that gets retrieved and combined according to attention weights.
Vanishing gradient: A problem where gradients become exponentially small as they propagate through many layers, preventing learning in early layers.
Variance: A measure of spread: $\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$. The expected squared deviation from the mean.
Vector space: A set of vectors closed under addition and scalar multiplication, satisfying certain axioms (associativity, commutativity, identity, etc.).
Vocabulary: The set of all tokens a model can process, with size $V$. Each token maps to a row in the embedding matrix.

W

Warmup: A learning rate schedule that starts with a small learning rate and gradually increases it. Stabilizes early training when gradients are noisy.
Weight: A learnable parameter in a neural network that scales inputs. Organized into weight matrices for efficient computation.
Weight sharing: Using the same parameters across different parts of a model. RNNs share weights across time steps; transformers share weights across positions.

# Glossary {.unnumbered} ## A **Activation function** : A nonlinear function applied element-wise to the output of a linear transformation. Common examples include ReLU, sigmoid, and tanh. Without activation functions, stacking linear layers would produce only linear functions. **Adam optimizer** : An optimization algorithm that maintains exponentially decaying averages of past gradients (momentum) and past squared gradients (adaptive learning rates). Combines the benefits of momentum and RMSprop. **Attention** : A mechanism for computing weighted combinations of values based on the relevance between queries and keys. Allows models to focus on different parts of the input dynamically. **Attention weights** : The probabilities computed by applying softmax to attention scores. Each weight indicates how much a position attends to another position. Weights sum to 1 across attended positions. **Autoregressive** : A generation strategy where each token is predicted based only on previously generated tokens. The model generates one token at a time, feeding each output back as input for the next prediction. ## B **Backpropagation** : An algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule backward through the computation graph. **Basis** : A set of linearly independent vectors that span a vector space. Any vector in the space can be uniquely expressed as a linear combination of basis vectors. **Batch normalization** : A technique that normalizes activations across the batch dimension. Contrast with layer normalization, which normalizes across the feature dimension. **BERT** : Bidirectional Encoder Representations from Transformers. An encoder-only transformer trained with masked language modeling, where the model predicts randomly masked tokens using bidirectional context. **Bias** : A learnable constant term added to the weighted sum in a neuron or linear layer. Allows the model to shift the activation function's input. **BPE (Byte-Pair Encoding)** : A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs. Balances vocabulary size with the ability to represent any text. ## C **Causal masking** : A masking scheme that prevents positions from attending to future positions. Implemented by setting attention scores to negative infinity for future positions before softmax. **Chain rule** : A calculus rule for computing derivatives of composite functions: $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$. The foundation of backpropagation. **Chinchilla scaling** : The finding that compute-optimal training requires scaling parameters and training tokens equally, roughly 20 tokens per parameter. **Cross-attention** : Attention where queries come from one sequence and keys/values come from another. Used in encoder-decoder models for the decoder to attend to encoder outputs. **Cross-entropy loss** : A loss function measuring the difference between predicted probabilities and true labels: $-\sum_i y_i \log(p_i)$. The standard loss for classification tasks. ## D **Decoder** : In transformers, an architecture component using causal (masked) self-attention to prevent information flow from future tokens. Decoder-only models (like GPT) use this architecture throughout. **Derivative** : The instantaneous rate of change of a function. For $f(x)$, the derivative $\frac{df}{dx}$ measures how much $f$ changes per unit change in $x$. **Dimension** : The number of components in a vector, or the number of rows/columns in a matrix. In transformers, $d_{model}$ typically refers to the embedding dimension. **Dot product** : The sum of element-wise products of two vectors: $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$. Measures similarity when vectors are normalized. **Dropout** : A regularization technique that randomly sets activations to zero during training with probability $p$. Prevents overfitting by reducing co-adaptation. ## E **Eigenvalue** : A scalar $\lambda$ such that $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ for some nonzero vector $\mathbf{v}$. Eigenvalues characterize how a matrix stretches space along certain directions. **Eigenvector** : A nonzero vector $\mathbf{v}$ such that $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ for some scalar $\lambda$. The matrix only scales (doesn't rotate) eigenvectors. **Embedding** : A learned mapping from discrete tokens to continuous vectors. The embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$ stores one $d$-dimensional vector per vocabulary token. **Emergent capabilities** : Abilities that appear suddenly at specific model scales, not present in smaller models. Examples include arithmetic, chain-of-thought reasoning, and in-context learning. **Encoder** : In transformers, an architecture component using bidirectional self-attention where all positions can attend to all other positions. Encoder-only models (like BERT) use this architecture throughout. **Expectation** : The probability-weighted average of a random variable's values: $\mathbb{E}[X] = \sum_x x \cdot P(X=x)$ for discrete variables. ## F **Feed-forward network (FFN)** : A position-wise neural network in transformer blocks, typically two linear layers with a nonlinearity: $\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$. **Fine-tuning** : Training a pretrained model on task-specific data, typically with a smaller learning rate. Adapts general knowledge to specific applications. **Forward propagation** : Computing the output of a neural network by passing inputs through each layer sequentially, applying weights, biases, and activation functions. ## G **Gradient** : The vector of partial derivatives of a function with respect to all its inputs: $\nabla f = [\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}]$. Points in the direction of steepest ascent. **Gradient descent** : An optimization algorithm that iteratively updates parameters in the negative gradient direction: $\theta_{t+1} = \theta_t - \alpha \nabla L(\theta_t)$. **GPT** : Generative Pre-trained Transformer. A decoder-only transformer trained with causal language modeling to predict the next token given previous tokens. ## H **Head (attention)** : One of multiple parallel attention mechanisms in multi-head attention. Each head has its own projection matrices and can learn to attend to different types of relationships. **Hidden state** : An intermediate representation within a neural network, not directly observed as input or output. In RNNs, the hidden state carries information across time steps. ## I **In-context learning** : The ability of large language models to learn new tasks from examples provided in the prompt, without updating parameters. **Instruction tuning** : Fine-tuning a language model on (instruction, response) pairs to improve its ability to follow user instructions. ## K **Key** : In attention, vectors that "advertise" what information is available at each position. Keys are compared against queries to compute attention scores. **KL divergence** : Kullback-Leibler divergence. A measure of how one probability distribution differs from another: $D_{KL}(P || Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$. Not symmetric. ## L **Label smoothing** : A regularization technique that softens target labels from hard one-hot vectors (e.g., $[0, 1, 0]$) to soft targets (e.g., $[0.05, 0.9, 0.05]$). Prevents overconfidence. **Layer normalization** : A normalization technique that standardizes activations across the feature dimension for each position independently: $\hat{x} = \frac{x - \mu}{\sigma}$, followed by learned scale and shift. **Learning rate** : A hyperparameter controlling the step size in gradient descent. Too large causes instability; too small causes slow convergence. **Linear combination** : A sum of vectors scaled by coefficients: $c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k$. **Linear independence** : Vectors are linearly independent if none can be expressed as a linear combination of the others. Formally, $c_1\mathbf{v}_1 + \cdots + c_k\mathbf{v}_k = \mathbf{0}$ implies all $c_i = 0$. **Logit** : The raw, unnormalized output of a model before applying softmax. The log-odds representation of probabilities. **Loss function** : A function measuring how well model predictions match targets. Training minimizes the loss by adjusting parameters. **LSTM** : Long Short-Term Memory. An RNN architecture with gates (forget, input, output) that control information flow, addressing the vanishing gradient problem. ## M **Masked language modeling (MLM)** : A training objective where random tokens are masked and the model predicts them from bidirectional context. Used by BERT. **Matrix multiplication** : The operation $\mathbf{C} = \mathbf{A}\mathbf{B}$ where $C_{ij} = \sum_k A_{ik} B_{kj}$. Requires inner dimensions to match. **Multi-head attention** : Attention with multiple parallel heads, each projecting to a lower-dimensional subspace. Outputs are concatenated and projected back to model dimension. ## N **Neuron** : The basic unit of a neural network, computing $y = \sigma(\mathbf{w}^T\mathbf{x} + b)$ where $\sigma$ is an activation function. **Norm** : A function measuring vector "length." The Euclidean ($L^2$) norm is $\|\mathbf{v}\| = \sqrt{\sum_i v_i^2}$. ## O **One-hot encoding** : A representation where a categorical value becomes a vector with 1 in one position and 0s elsewhere. Sparse and high-dimensional. ## P **Parameter** : A learnable value in a neural network, such as weights and biases, updated during training via gradient descent. **Perplexity** : The exponential of average cross-entropy loss: $\text{PPL} = \exp(\mathcal{L})$. Measures how "surprised" the model is by the data. Lower is better. **Positional encoding** : A technique for injecting position information into transformer inputs. The original transformer uses sinusoidal encodings at different frequencies. **Power law** : A relationship of the form $y = ax^b$, appearing as a straight line on log-log axes. Scaling laws follow power laws. **Projection** : A linear transformation reducing or changing dimensionality: $\mathbf{y} = \mathbf{W}\mathbf{x}$ where $\mathbf{W}$ projects from one space to another. ## Q **Query** : In attention, a vector representing what information a position is looking for. Queries are compared against keys to compute attention scores. ## R **Random variable** : A variable representing an uncertain outcome. Written in uppercase ($X$) to distinguish from specific values ($x$). **ReLU** : Rectified Linear Unit. The activation function $\text{ReLU}(x) = \max(0, x)$. Simple, computationally efficient, and widely used. **Residual connection** : A skip connection adding a layer's input to its output: $\mathbf{y} = f(\mathbf{x}) + \mathbf{x}$. Enables training of very deep networks by providing gradient shortcuts. **RLHF** : Reinforcement Learning from Human Feedback. A technique for aligning language models with human preferences using a learned reward model and policy optimization. **RNN** : Recurrent Neural Network. A network that processes sequences by maintaining a hidden state updated at each time step. ## S **Scaling laws** : Empirical relationships showing that language model loss decreases as a power law with compute, data, and parameters. **Self-attention** : Attention where queries, keys, and values all come from the same sequence. Each position attends to every other position (including itself). **Sigmoid** : The activation function $\sigma(x) = \frac{1}{1 + e^{-x}}$, squashing inputs to the range $(0, 1)$. **Softmax** : A function converting a vector of real numbers to a probability distribution: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$. Outputs are positive and sum to 1. **Subword tokenization** : Breaking text into units smaller than words but larger than characters. Balances vocabulary size with coverage. BPE is a common algorithm. ## T **Teacher forcing** : A training technique where the model receives true previous tokens as input, rather than its own predictions. Standard for training autoregressive models. **Token** : The basic unit of text processed by a model, typically a word, subword, or character depending on the tokenization scheme. **Transformer** : An architecture based on self-attention that processes all positions in parallel. Introduced in "Attention Is All You Need" (2017). ## V **Value** : In attention, vectors containing the actual content that gets retrieved and combined according to attention weights. **Vanishing gradient** : A problem where gradients become exponentially small as they propagate through many layers, preventing learning in early layers. **Variance** : A measure of spread: $\text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$. The expected squared deviation from the mean. **Vector space** : A set of vectors closed under addition and scalar multiplication, satisfying certain axioms (associativity, commutativity, identity, etc.). **Vocabulary** : The set of all tokens a model can process, with size $V$. Each token maps to a row in the embedding matrix. ## W **Warmup** : A learning rate schedule that starts with a small learning rate and gradually increases it. Stabilizes early training when gradients are noisy. **Weight** : A learnable parameter in a neural network that scales inputs. Organized into weight matrices for efficient computation. **Weight sharing** : Using the same parameters across different parts of a model. RNNs share weights across time steps; transformers share weights across positions.