7 Embeddings

Learning objectives

After completing this chapter, you will be able to:

Explain why one-hot encoding is insufficient for representing words
Describe how embedding matrices map discrete tokens to continuous vectors
Compute cosine similarity between word embeddings
Understand the Skip-gram objective for learning embeddings from context
Apply byte-pair encoding (BPE) for subword tokenization

Before a neural network can process text, it needs numbers. Words like “cat,” “dog,” and “philosophy” must become vectors that capture their meanings. This chapter develops the theory of embeddings: dense, learned representations that map discrete tokens into continuous vector spaces where similar concepts are nearby.

7.1 The problem with discrete tokens

Consider a vocabulary of $V$ words. The simplest way to convert words to numbers is one-hot encoding: represent word $i$ as a vector $\mathbf{e}_i \in \mathbb{R}^V$ with a 1 in position $i$ and 0s elsewhere.

For a vocabulary of 50,000 words:

\[ \text{cat} \to [0, 0, \ldots, 1, \ldots, 0, 0]^T \quad \text{(1 in position 3,142)} \]

\[ \text{dog} \to [0, 0, \ldots, 1, \ldots, 0, 0]^T \quad \text{(1 in position 7,891)} \]

This representation has serious problems:

High dimensionality. Each vector has dimension $V$, which can be tens or hundreds of thousands. This is computationally expensive.

Sparsity. Each vector has exactly one nonzero entry. Most computation involves multiplying by zeros.

No similarity structure. The dot product of any two different one-hot vectors is zero: $\mathbf{e}_i^T \mathbf{e}_j = 0$ for $i \neq j$. By this metric, “cat” is equally dissimilar to “dog” and to “philosophy.” There’s no notion that some words are more related than others.

No generalization. If the network learns something about “cat,” that knowledge doesn’t transfer to “dog” because their representations share no structure. The network must learn everything about every word from scratch.

What we want is a representation where:

Vectors are dense and low-dimensional (say, 256 or 512 dimensions instead of 50,000)
Similar words have similar vectors
Relationships between words are encoded geometrically

7.2 Word embeddings

A word embedding maps each word to a dense vector. Instead of 50,000-dimensional one-hot vectors, we use vectors with perhaps 256 or 512 dimensions that we learn from data.

We store all word embeddings in a single matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$, where $V$ is the vocabulary size (number of words, e.g., 50,000) and $d$ is the embedding dimension (e.g., 256). Each row of $\mathbf{E}$ contains the embedding for one word: row $i$ holds the embedding for word $i$. So if “cat” has index 3142 in our vocabulary, then row 3142 of $\mathbf{E}$ is the embedding vector for “cat”.

To look up an embedding, we retrieve row $i$ of $\mathbf{E}$. Mathematically, this can be written as a matrix multiplication with a one-hot vector:

\[ \mathbf{e}_i = \mathbf{E}^T \mathbf{x} \]

Here $\mathbf{x} \in \mathbb{R}^V$ is the one-hot vector for word $i$ (all zeros except a 1 at position $i$), $\mathbf{E}^T \in \mathbb{R}^{d \times V}$ is the transposed embedding matrix, and $\mathbf{e}_i \in \mathbb{R}^d$ is the resulting embedding vector.

Why does this work? When you multiply $\mathbf{E}^T$ by a one-hot vector, you’re computing a weighted sum of columns of $\mathbf{E}^T$, but all weights are zero except one. So you just select the $i$-th column of $\mathbf{E}^T$, which is the $i$-th row of $\mathbf{E}$. That’s exactly the embedding we want.

In practice, we skip the matrix multiplication and just look up row $i$ directly. But the matrix formulation matters when we compute gradients during training.

Let’s see a concrete example. Suppose we have a tiny vocabulary of 5 words and embeddings of dimension 3:

Word	Index	Embedding
cat	0	$[0.2, 0.8, -0.1]$
dog	1	$[0.3, 0.7, -0.2]$
fish	2	$[0.1, 0.9, 0.3]$
car	3	$[-0.5, 0.1, 0.6]$
truck	4	$[-0.4, 0.2, 0.5]$

The embedding matrix is:

\[ \mathbf{E} = \begin{bmatrix} 0.2 & 0.8 & -0.1 \\ 0.3 & 0.7 & -0.2 \\ 0.1 & 0.9 & 0.3 \\ -0.5 & 0.1 & 0.6 \\ -0.4 & 0.2 & 0.5 \end{bmatrix} \]

Notice that “cat” and “dog” have similar embeddings (both are animals), and “car” and “truck” have similar embeddings (both are vehicles). The embedding space captures semantic relationships.

7.2.1 Measuring similarity

In a good embedding space, similar words have similar vectors. But what does “similar vectors” mean? We measure it with cosine similarity:

\[ \text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^T \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} \]

Here $\mathbf{u}$ and $\mathbf{v}$ are two embedding vectors in $\mathbb{R}^d$. The numerator $\mathbf{u}^T \mathbf{v} = \sum_{i=1}^d u_i v_i$ is their dot product, and the denominator $\|\mathbf{u}\| \|\mathbf{v}\|$ multiplies their lengths (where $\|\mathbf{u}\| = \sqrt{\sum_{i=1}^d u_i^2}$). Dividing by both lengths normalizes for magnitude, so we only care about direction, not how long the vectors are.

Geometrically, cosine similarity equals $\cos\theta$ where $\theta$ is the angle between the vectors. When $\text{sim} = +1$, the vectors point in the same direction (angle = 0°). When $\text{sim} = 0$, they’re perpendicular (angle = 90°). When $\text{sim} = -1$, they point in opposite directions (angle = 180°).

For our example embeddings:

\[ \text{sim}(\text{cat}, \text{dog}) = \frac{0.2 \cdot 0.3 + 0.8 \cdot 0.7 + (-0.1)(-0.2)}{\sqrt{0.04 + 0.64 + 0.01}\sqrt{0.09 + 0.49 + 0.04}} \]

\[ = \frac{0.06 + 0.56 + 0.02}{\sqrt{0.69}\sqrt{0.62}} = \frac{0.64}{0.831 \cdot 0.787} = \frac{0.64}{0.654} \approx 0.98 \]

Very high similarity! Let’s compare “cat” and “car”:

\[ \text{sim}(\text{cat}, \text{car}) = \frac{0.2 \cdot (-0.5) + 0.8 \cdot 0.1 + (-0.1) \cdot 0.6}{\sqrt{0.69}\sqrt{0.25 + 0.01 + 0.36}} \]

\[ = \frac{-0.1 + 0.08 - 0.06}{\sqrt{0.69}\sqrt{0.62}} = \frac{-0.08}{0.654} \approx -0.12 \]

Much lower (even slightly negative). The embedding space captures that “cat” and “dog” are more similar to each other than either is to “car.”

7.2.2 The geometry of meaning

Good embeddings have remarkable geometric properties. The most famous is analogical reasoning:

\[ \mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}} + \mathbf{e}_{\text{woman}} \approx \mathbf{e}_{\text{queen}} \]

where $\mathbf{e}_{\text{word}}$ denotes the embedding vector for that word.

What does this equation say? Rearranging: $\mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}} \approx \mathbf{e}_{\text{queen}} - \mathbf{e}_{\text{woman}}$. The vector difference from “man” to “king” is approximately equal to the vector difference from “woman” to “queen”. Both differences point in the “royalty” direction.

Why does this work? During training, words that appear in similar contexts get similar embeddings. “King” and “queen” appear in similar contexts (ruling, crowns, thrones), so they’re close. But “king” also appears in contexts similar to “man” (he, his, himself), while “queen” appears in contexts similar to “woman” (she, her, herself). The embedding space learns to separate these orthogonal dimensions: one for gender, one for royalty, and they combine additively.

This isn’t magic or reasoning. It’s a geometric consequence of how embeddings are trained. If the training data consistently uses masculine pronouns with “king” and feminine pronouns with “queen,” the difference vectors $\mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}}$ and $\mathbf{e}_{\text{queen}} - \mathbf{e}_{\text{woman}}$ will both point in the “royalty” direction.

7.3 Learning embeddings

Where do embeddings come from? They’re learned from data. This section covers the classical approach (Word2Vec) which, while no longer state-of-the-art, provides essential intuition for understanding how neural networks learn meaningful representations.

7.3.1 The distributional hypothesis

The foundation of learned embeddings is the distributional hypothesis: words that appear in similar contexts have similar meanings. “You shall know a word by the company it keeps.”

Consider the blanks in:

“The ___ sat on the mat.”
“The ___ chased the mouse.”
“I fed my ___ some treats.”

Words like “cat,” “dog,” and “hamster” could fill these blanks. They appear in similar contexts, so they should have similar embeddings. Words like “democracy” or “algorithm” wouldn’t fit these contexts, so they should have different embeddings.

7.3.2 Skip-gram: predicting context from words

Historical note: Skip-gram (2013) is not used in modern transformers. We cover it because it builds intuition for how embeddings are learned. Modern transformers learn embeddings end-to-end as part of the full model, not as a separate pre-training step.

The Skip-gram model (part of Word2Vec) learns embeddings by training a neural network to predict context words given a center word.

The setup. Given a sentence, we slide a window across it. At each position, the word in the middle is the “center word” and the surrounding words are “context words”.

For example, in “the cat sat on the mat” with window size $c=2$:

Center word	Context words (within 2 positions)
sat	the, cat, on, the

The model learns to predict context words from the center word.

Interestingly, the model uses two separate embedding matrices: $\mathbf{W} \in \mathbb{R}^{V \times d}$ for words when they’re the center, and $\mathbf{W}' \in \mathbb{R}^{V \times d}$ for words when they’re in the context. Here $V$ is the vocabulary size and $d$ is the embedding dimension. We write $\mathbf{w}_i$ for row $i$ of $\mathbf{W}$ and $\mathbf{w}'_i$ for row $i$ of $\mathbf{W}'$.

Given a center word $w_c$, what’s the probability that $w_o$ appears in its context? The model computes:

\[ P(w_o | w_c) = \frac{\exp(\mathbf{w}'_o{}^T \mathbf{w}_c)}{\sum_{w=1}^{V} \exp(\mathbf{w}'_w{}^T \mathbf{w}_c)} \]

Here $\mathbf{w}_c \in \mathbb{R}^d$ is the center word’s embedding (from $\mathbf{W}$) and $\mathbf{w}'_o \in \mathbb{R}^d$ is the context word’s embedding (from $\mathbf{W}'$). The numerator computes the dot product $\mathbf{w}'_o{}^T \mathbf{w}_c$. When this is high, the vectors point in similar directions, indicating a likely context word. The denominator sums over all $V$ words in the vocabulary, turning the scores into a proper probability distribution (this is the softmax).

The training objective is to maximize the probability of the context words we actually observe in the corpus:

\[ \mathcal{L} = \sum_{(w_c, w_o) \in D} \log P(w_o | w_c) \]

where $D$ is the set of all (center, context) pairs extracted from the training text. We take the log because it converts products to sums, which is easier to optimize with gradient descent.

Why does this learn good embeddings? To make $P(w_o | w_c)$ large, the model needs the dot product $\mathbf{w}'_o{}^T \mathbf{w}_c$ to be large. If “cat” often appears near “pet,” “fur,” and “purr,” then $\mathbf{w}_{\text{cat}}$ must point in a direction that has high dot product with $\mathbf{w}'_{\text{pet}}$, $\mathbf{w}'_{\text{fur}}$, and $\mathbf{w}'_{\text{purr}}$. Similarly, “dog” also appears near “pet” and “fur” (though not “purr”, maybe “bark” instead). So $\mathbf{w}_{\text{dog}}$ must also point toward $\mathbf{w}'_{\text{pet}}$ and $\mathbf{w}'_{\text{fur}}$. This forces “cat” and “dog” to have similar embeddings because they’re being pulled toward the same context words.

7.3.3 Negative sampling

There’s a computational problem with skip-gram. The softmax denominator sums over all $V$ words in the vocabulary: $\sum_{w=1}^{V} \exp(\mathbf{w}'_w{}^T \mathbf{w}_c)$. With $V = 50{,}000$ words, that’s 50,000 dot products per training example. Far too slow.

Negative sampling sidesteps this by changing the question. Instead of asking “which of 50,000 words is the context?” (a $V$-way classification), we ask “is this specific word a real context word, or a random fake?” (a binary classification).

The idea is simple. For each real (center, context) pair we observe in the corpus, we also grab $k$ random words that aren’t in the context. The real context word is a positive sample; the random words are negative samples. We train the model to say “yes” to the positive and “no” to the negatives.

This transforms our objective. For center word $w_c$, real context word $w_o$, and $k$ random negative words $w_1, \ldots, w_k$, we maximize:

\[ \mathcal{L} = \log \sigma(\mathbf{w}'_o{}^T \mathbf{w}_c) + \sum_{i=1}^k \log \sigma(-\mathbf{w}'_{w_i}{}^T \mathbf{w}_c) \]

Here $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function, which squashes any real number to a probability between 0 and 1. The first term handles the positive sample: $\mathbf{w}'_o{}^T \mathbf{w}_c$ is the dot product between center and context embeddings, and we want $\sigma$ of this to be close to 1, meaning “yes, this is a real context word.” The sum handles negatives: for each fake word $w_i$, we take the negative of the dot product before applying $\sigma$, because we want the model to output high confidence that these are not context words.

Why does this give the same learning signal as softmax? Think about what the model must do. To make $\sigma(\mathbf{w}'_o{}^T \mathbf{w}_c)$ large, the dot product between real context words and the center must be large, meaning they need to point in similar directions. To make $\sigma(-\mathbf{w}'_{w_i}{}^T \mathbf{w}_c)$ large, the dot product for random words must be small or negative. The model learns to pull real context words close and push random words away. Same outcome as softmax, but we only compute $k+1$ dot products instead of $V$. Typically $k$ is between 5 and 20.

7.3.4 A worked example

Let’s trace through one training step to make this concrete. Our corpus is the sentence “the cat sat on the mat” and we’re using a context window of size 1 (one word on each side of the center). For illustration, we’ll use tiny 3-dimensional embeddings.

When the center word is “cat”, the context words are “the” (to the left) and “sat” (to the right). This gives us two training pairs: (cat, the) and (cat, sat). Let’s process the first one.

Suppose the embeddings are currently (after random initialization):

\[ \mathbf{w}_{\text{cat}} = \begin{bmatrix} 0.1 \\ 0.2 \\ -0.1 \end{bmatrix}, \quad \mathbf{w}'_{\text{the}} = \begin{bmatrix} 0.3 \\ 0.1 \\ 0.2 \end{bmatrix} \]

We compute the dot product to see how aligned these vectors are:

\[ \mathbf{w}'_{\text{the}}{}^T \mathbf{w}_{\text{cat}} = 0.3 \times 0.1 + 0.1 \times 0.2 + 0.2 \times (-0.1) = 0.03 + 0.02 - 0.02 = 0.03 \]

The dot product is 0.03, barely positive. Passing through the sigmoid: $\sigma(0.03) = \frac{1}{1 + e^{-0.03}} \approx 0.507$. The model predicts a 50.7% chance that “the” is a context word of “cat”, essentially a coin flip. Since “the” actually is a context word, this is a poor prediction. Gradient descent will nudge both embeddings to increase their dot product, making them more aligned.

Now for a negative sample. We randomly pick a word that isn’t in the context of “cat”, say “algorithm”. Its embedding is $\mathbf{w}'_{\text{algorithm}} = [0.5, -0.3, 0.1]^T$.

The dot product is:

\[ \mathbf{w}'_{\text{algorithm}}{}^T \mathbf{w}_{\text{cat}} = 0.5 \times 0.1 + (-0.3) \times 0.2 + 0.1 \times (-0.1) = 0.05 - 0.06 - 0.01 = -0.02 \]

For negative samples, we want the model to confidently say “no, this is not a context word.” We achieve this by feeding the negative of the dot product into the sigmoid: $\sigma(-(-0.02)) = \sigma(0.02) \approx 0.505$. This is slightly above 0.5, meaning the model has a slight inclination that “algorithm” isn’t a context word. Gradient descent will push the dot product more negative, increasing this confidence.

After millions of such updates across a large corpus, a pattern emerges. Words that frequently appear together develop aligned embeddings (high dot products). Words that rarely co-occur develop orthogonal or opposing embeddings (low or negative dot products). And crucially, words that share similar contexts, like “cat” and “dog” which both appear near “pet”, “fur”, and “fed”, end up with similar embeddings because they’re both being pulled toward the same set of context words.

7.4 Subword tokenization

Word-level embeddings have a problem: what about words not in the vocabulary? Misspellings, rare words, technical jargon, new slang, and morphological variants (run, running, runner) all need handling. A vocabulary of 50,000 words sounds large, but it can’t cover everything.

Subword tokenization solves this by splitting words into smaller pieces that can be combined. The vocabulary contains common words whole, but rare words get broken into subword units. This way, even a word the model has never seen can be represented as a combination of familiar pieces.

7.4.1 Byte Pair Encoding (BPE)

State of the art: BPE (2015) and its variants are used in most modern large language models, including GPT-2, GPT-3, GPT-4, and LLaMA.

Byte Pair Encoding builds a vocabulary by starting small and iteratively growing it. The algorithm begins with a vocabulary of just individual characters: every letter, digit, and punctuation mark. Then it scans the training corpus, counts how often each pair of adjacent tokens appears, and merges the most frequent pair into a new token. This process repeats until the vocabulary reaches the desired size.

Let’s trace through a tiny example. Suppose our corpus is just three words: “low”, “lower”, and “lowest”. We start by representing these as character sequences:

\[ \text{low} \to \texttt{l o w} \qquad \text{lower} \to \texttt{l o w e r} \qquad \text{lowest} \to \texttt{l o w e s t} \]

The initial vocabulary is $\{\texttt{l}, \texttt{o}, \texttt{w}, \texttt{e}, \texttt{r}, \texttt{s}, \texttt{t}\}$, just the characters that appear. Now we count adjacent pairs across the corpus. The pair $(\texttt{l}, \texttt{o})$ appears 3 times (once in each word). The pair $(\texttt{o}, \texttt{w})$ also appears 3 times. Suppose we break ties alphabetically and merge $(\texttt{l}, \texttt{o})$ first.

After the merge, our vocabulary grows to $\{\texttt{l}, \texttt{o}, \texttt{w}, \texttt{e}, \texttt{r}, \texttt{s}, \texttt{t}, \texttt{lo}\}$, and the corpus becomes:

\[ \text{low} \to \texttt{lo w} \qquad \text{lower} \to \texttt{lo w e r} \qquad \text{lowest} \to \texttt{lo w e s t} \]

We count pairs again. Now $(\texttt{lo}, \texttt{w})$ appears 3 times, the most frequent. We merge it:

\[ \text{low} \to \texttt{low} \qquad \text{lower} \to \texttt{low e r} \qquad \text{lowest} \to \texttt{low e s t} \]

Continuing this process, we might eventually merge $(\texttt{e}, \texttt{r})$ to get $\texttt{er}$, then $(\texttt{e}, \texttt{s})$ to get $\texttt{es}$, then $(\texttt{es}, \texttt{t})$ to get $\texttt{est}$, and so on. The final vocabulary might include tokens like $\texttt{low}$, $\texttt{lower}$, $\texttt{lowest}$, $\texttt{er}$, and $\texttt{est}$.

At inference time, we tokenize new text using the learned merges. A common word like “lowest” might become a single token $[\texttt{lowest}]$. A rare word like “lowest-ever” might become $[\texttt{lowest}, \texttt{-}, \texttt{ever}]$. And a completely novel word like “transformerize” might become $[\texttt{transform}, \texttt{er}, \texttt{ize}]$. Each piece is familiar even though the whole word is new.

7.4.2 Properties of subword tokenization

Subword tokenization has several appealing properties. First, it handles unknown words gracefully. Even a word the model has never encountered can be broken into subword units that the model has seen. The word “transformerize” doesn’t need to be in the vocabulary; its pieces “transform”, “er”, and “ize” are enough.

Second, related words share subword units, which means they share part of their representation. The words “playing”, “played”, and “player” might all contain the token “play”. This gives the model a head start on understanding new morphological variants. If it knows what “play” means, it has a foundation for understanding “playable” even without seeing that exact word in training.

Third, the vocabulary size stays manageable. Common words get single tokens, which is efficient. Rare words get split into multiple tokens, which takes more computation but ensures everything is representable. A vocabulary of 50,000 subword tokens can effectively cover far more than 50,000 words.

Modern transformers use variants of this idea. WordPiece (used in BERT) is similar to BPE but uses a slightly different scoring function for merges. SentencePiece (used in T5 and LLaMA) operates directly on raw text without pre-tokenization, making it language-agnostic. The exact algorithms differ, but the principle is the same: learn a vocabulary of subword units that balances coverage and efficiency.

7.4.3 Token embeddings in transformers

In a transformer, each subword token has a learned embedding stored in the embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$, where $V$ is the vocabulary size (number of unique tokens) and $d$ is the model dimension (embedding size). The total number of parameters in the embedding layer is $V \times d$.

How big is this in practice? Let’s compute for real models:

Model	$V$ (vocab)	$d$ (dimension)	Embedding parameters
GPT-2	50,257	768	38.6 million
GPT-3	50,257	12,288	617 million

GPT-3’s embedding matrix alone has more parameters than entire earlier models.

How are they trained? The embedding matrix $\mathbf{E}$ is initialized randomly and trained jointly with the rest of the model via backpropagation. The training signal comes from the language modeling objective: predict the next token. Embeddings that help make good predictions survive; others get updated.

7.5 Properties of transformer embeddings

State of the art: The approach described here, learned embeddings combined with contextual attention layers, is how all modern large language models work (GPT-4, Claude, LLaMA, etc.).

Embeddings in transformers have interesting properties that emerge from training.

7.5.1 Contextual vs. static embeddings

Word2Vec embeddings are static: “bank” has one embedding regardless of context. But “bank” means different things in “river bank” and “bank account.”

Transformer embeddings start as static (the lookup from $\mathbf{E}$), but then get transformed by the attention layers into contextual embeddings. After passing through transformer layers, the representation of “bank” depends on the surrounding words. In “river bank,” it might be close to “shore” and “water.” In “bank account,” it might be close to “money” and “finance.”

The initial embedding $\mathbf{E}$ provides a starting point. The attention layers modify it based on context. This is one of the key innovations of transformers over earlier approaches.

7.5.2 Embedding space geometry

Studies of transformer embedding spaces reveal structure:

Linear subspaces for concepts. Directions in embedding space often correspond to interpretable concepts. There might be a “gender direction,” a “tense direction,” a “formality direction.”

Clustering by meaning. Words with similar meanings cluster together. Synonyms are close. Categories form regions.

Analogies still work. The word2vec-style analogies often work in transformer embeddings too, though the relationship is more complex because embeddings become contextual after attention.

Anisotropy. Transformer embeddings often occupy a narrow cone rather than filling the space uniformly. This means cosine similarities tend to be high even for unrelated words. Various normalization techniques address this.

7.6 From tokens to sequences

So far we’ve embedded single tokens. But transformers process sequences. Here’s how we go from a sentence to a matrix.

First, we tokenize the text into token indices. For example, “The cat sat” might become [464, 3797, 3332], where each number is an index in the vocabulary. Next, we look up each token’s embedding from the embedding matrix $\mathbf{E}$. Finally, we stack these embeddings into a matrix:

\[ \mathbf{X} = \begin{bmatrix} \mathbf{e}_{464} \\ \mathbf{e}_{3797} \\ \mathbf{e}_{3332} \end{bmatrix} \in \mathbb{R}^{T \times d} \]

Here $T = 3$ is the sequence length (number of tokens), $d$ is the embedding dimension (e.g., 768), and row $t$ of $\mathbf{X}$ is the embedding for token $t$. For “The cat sat” with $d = 768$, we get a $3 \times 768$ matrix. Each row is a 768-dimensional vector representing one token.

This matrix $\mathbf{X}$ is the input to the transformer. The attention layers will transform it, letting tokens “see” each other and build context-aware representations. But that’s for later chapters. The embedding layer’s job is done: convert token indices to vectors.

7.7 Summary

We’ve seen that:

Discrete tokens need to become vectors for neural network processing. One-hot encodings are sparse, high-dimensional, and lack similarity structure.
Word embeddings map tokens to dense vectors where similar words are nearby. The embedding matrix $\mathbf{E}$ is learned from data.
Embeddings are learned by predicting words from context (skip-gram) or context from words (CBOW), based on the distributional hypothesis.
Subword tokenization (BPE, WordPiece) handles rare and unknown words by decomposing them into learned subword units.
Transformer embeddings start static but become contextual after passing through attention layers.

The embedding layer is where text first touches the transformer. It converts a sequence of tokens into a sequence of vectors that the attention mechanism can then relate and transform. In the next chapter, we’ll see how attention works.

# Embeddings ::: {.callout-note appearance="simple"} ## Learning objectives After completing this chapter, you will be able to: - Explain why one-hot encoding is insufficient for representing words - Describe how embedding matrices map discrete tokens to continuous vectors - Compute cosine similarity between word embeddings - Understand the Skip-gram objective for learning embeddings from context - Apply byte-pair encoding (BPE) for subword tokenization ::: Before a neural network can process text, it needs numbers. Words like "cat," "dog," and "philosophy" must become vectors that capture their meanings. This chapter develops the theory of embeddings: dense, learned representations that map discrete tokens into continuous vector spaces where similar concepts are nearby. ## The problem with discrete tokens Consider a vocabulary of $V$ words. The simplest way to convert words to numbers is **one-hot encoding**: represent word $i$ as a vector $\mathbf{e}_i \in \mathbb{R}^V$ with a 1 in position $i$ and 0s elsewhere. For a vocabulary of 50,000 words: $$ \text{cat} \to [0, 0, \ldots, 1, \ldots, 0, 0]^T \quad \text{(1 in position 3,142)} $$ $$ \text{dog} \to [0, 0, \ldots, 1, \ldots, 0, 0]^T \quad \text{(1 in position 7,891)} $$ This representation has serious problems: **High dimensionality.** Each vector has dimension $V$, which can be tens or hundreds of thousands. This is computationally expensive. **Sparsity.** Each vector has exactly one nonzero entry. Most computation involves multiplying by zeros. **No similarity structure.** The dot product of any two different one-hot vectors is zero: $\mathbf{e}_i^T \mathbf{e}_j = 0$ for $i \neq j$. By this metric, "cat" is equally dissimilar to "dog" and to "philosophy." There's no notion that some words are more related than others. **No generalization.** If the network learns something about "cat," that knowledge doesn't transfer to "dog" because their representations share no structure. The network must learn everything about every word from scratch. What we want is a representation where: - Vectors are dense and low-dimensional (say, 256 or 512 dimensions instead of 50,000) - Similar words have similar vectors - Relationships between words are encoded geometrically ## Word embeddings A **word embedding** maps each word to a dense vector. Instead of 50,000-dimensional one-hot vectors, we use vectors with perhaps 256 or 512 dimensions that we learn from data. We store all word embeddings in a single matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$, where $V$ is the vocabulary size (number of words, e.g., 50,000) and $d$ is the embedding dimension (e.g., 256). Each row of $\mathbf{E}$ contains the embedding for one word: row $i$ holds the embedding for word $i$. So if "cat" has index 3142 in our vocabulary, then row 3142 of $\mathbf{E}$ is the embedding vector for "cat". To look up an embedding, we retrieve row $i$ of $\mathbf{E}$. Mathematically, this can be written as a matrix multiplication with a one-hot vector: $$ \mathbf{e}_i = \mathbf{E}^T \mathbf{x} $$ Here $\mathbf{x} \in \mathbb{R}^V$ is the one-hot vector for word $i$ (all zeros except a 1 at position $i$), $\mathbf{E}^T \in \mathbb{R}^{d \times V}$ is the transposed embedding matrix, and $\mathbf{e}_i \in \mathbb{R}^d$ is the resulting embedding vector. Why does this work? When you multiply $\mathbf{E}^T$ by a one-hot vector, you're computing a weighted sum of columns of $\mathbf{E}^T$, but all weights are zero except one. So you just select the $i$-th column of $\mathbf{E}^T$, which is the $i$-th row of $\mathbf{E}$. That's exactly the embedding we want. In practice, we skip the matrix multiplication and just look up row $i$ directly. But the matrix formulation matters when we compute gradients during training. Let's see a concrete example. Suppose we have a tiny vocabulary of 5 words and embeddings of dimension 3: | Word | Index | Embedding | |------|-------|-----------| | cat | 0 | $[0.2, 0.8, -0.1]$ | | dog | 1 | $[0.3, 0.7, -0.2]$ | | fish | 2 | $[0.1, 0.9, 0.3]$ | | car | 3 | $[-0.5, 0.1, 0.6]$ | | truck | 4 | $[-0.4, 0.2, 0.5]$ | The embedding matrix is: $$ \mathbf{E} = \begin{bmatrix} 0.2 & 0.8 & -0.1 \\ 0.3 & 0.7 & -0.2 \\ 0.1 & 0.9 & 0.3 \\ -0.5 & 0.1 & 0.6 \\ -0.4 & 0.2 & 0.5 \end{bmatrix} $$ Notice that "cat" and "dog" have similar embeddings (both are animals), and "car" and "truck" have similar embeddings (both are vehicles). The embedding space captures semantic relationships. ### Measuring similarity In a good embedding space, similar words have similar vectors. But what does "similar vectors" mean? We measure it with **cosine similarity**: $$ \text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^T \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} $$ Here $\mathbf{u}$ and $\mathbf{v}$ are two embedding vectors in $\mathbb{R}^d$. The numerator $\mathbf{u}^T \mathbf{v} = \sum_{i=1}^d u_i v_i$ is their dot product, and the denominator $\|\mathbf{u}\| \|\mathbf{v}\|$ multiplies their lengths (where $\|\mathbf{u}\| = \sqrt{\sum_{i=1}^d u_i^2}$). Dividing by both lengths normalizes for magnitude, so we only care about *direction*, not how long the vectors are. Geometrically, cosine similarity equals $\cos\theta$ where $\theta$ is the angle between the vectors. When $\text{sim} = +1$, the vectors point in the same direction (angle = 0°). When $\text{sim} = 0$, they're perpendicular (angle = 90°). When $\text{sim} = -1$, they point in opposite directions (angle = 180°). For our example embeddings: $$ \text{sim}(\text{cat}, \text{dog}) = \frac{0.2 \cdot 0.3 + 0.8 \cdot 0.7 + (-0.1)(-0.2)}{\sqrt{0.04 + 0.64 + 0.01}\sqrt{0.09 + 0.49 + 0.04}} $$ $$ = \frac{0.06 + 0.56 + 0.02}{\sqrt{0.69}\sqrt{0.62}} = \frac{0.64}{0.831 \cdot 0.787} = \frac{0.64}{0.654} \approx 0.98 $$ Very high similarity! Let's compare "cat" and "car": $$ \text{sim}(\text{cat}, \text{car}) = \frac{0.2 \cdot (-0.5) + 0.8 \cdot 0.1 + (-0.1) \cdot 0.6}{\sqrt{0.69}\sqrt{0.25 + 0.01 + 0.36}} $$ $$ = \frac{-0.1 + 0.08 - 0.06}{\sqrt{0.69}\sqrt{0.62}} = \frac{-0.08}{0.654} \approx -0.12 $$ Much lower (even slightly negative). The embedding space captures that "cat" and "dog" are more similar to each other than either is to "car." ### The geometry of meaning Good embeddings have remarkable geometric properties. The most famous is **analogical reasoning**: $$ \mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}} + \mathbf{e}_{\text{woman}} \approx \mathbf{e}_{\text{queen}} $$ where $\mathbf{e}_{\text{word}}$ denotes the embedding vector for that word. What does this equation say? Rearranging: $\mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}} \approx \mathbf{e}_{\text{queen}} - \mathbf{e}_{\text{woman}}$. The vector *difference* from "man" to "king" is approximately equal to the vector difference from "woman" to "queen". Both differences point in the "royalty" direction. Why does this work? During training, words that appear in similar contexts get similar embeddings. "King" and "queen" appear in similar contexts (ruling, crowns, thrones), so they're close. But "king" also appears in contexts similar to "man" (he, his, himself), while "queen" appears in contexts similar to "woman" (she, her, herself). The embedding space learns to separate these orthogonal dimensions: one for gender, one for royalty, and they combine additively. This isn't magic or reasoning. It's a geometric consequence of how embeddings are trained. If the training data consistently uses masculine pronouns with "king" and feminine pronouns with "queen," the difference vectors $\mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}}$ and $\mathbf{e}_{\text{queen}} - \mathbf{e}_{\text{woman}}$ will both point in the "royalty" direction. ## Learning embeddings Where do embeddings come from? They're learned from data. This section covers the classical approach (Word2Vec) which, while no longer state-of-the-art, provides essential intuition for understanding how neural networks learn meaningful representations. ### The distributional hypothesis The foundation of learned embeddings is the **distributional hypothesis**: words that appear in similar contexts have similar meanings. "You shall know a word by the company it keeps." Consider the blanks in: - "The ___ sat on the mat." - "The ___ chased the mouse." - "I fed my ___ some treats." Words like "cat," "dog," and "hamster" could fill these blanks. They appear in similar contexts, so they should have similar embeddings. Words like "democracy" or "algorithm" wouldn't fit these contexts, so they should have different embeddings. ### Skip-gram: predicting context from words *Historical note: Skip-gram (2013) is not used in modern transformers. We cover it because it builds intuition for how embeddings are learned. Modern transformers learn embeddings end-to-end as part of the full model, not as a separate pre-training step.* The **Skip-gram** model (part of Word2Vec) learns embeddings by training a neural network to predict context words given a center word. **The setup.** Given a sentence, we slide a window across it. At each position, the word in the middle is the "center word" and the surrounding words are "context words". For example, in "the cat sat on the mat" with window size $c=2$: | Center word | Context words (within 2 positions) | |-------------|-----------------------------------| | sat | the, cat, on, the | The model learns to predict context words from the center word. Interestingly, the model uses *two* separate embedding matrices: $\mathbf{W} \in \mathbb{R}^{V \times d}$ for words when they're the center, and $\mathbf{W}' \in \mathbb{R}^{V \times d}$ for words when they're in the context. Here $V$ is the vocabulary size and $d$ is the embedding dimension. We write $\mathbf{w}_i$ for row $i$ of $\mathbf{W}$ and $\mathbf{w}'_i$ for row $i$ of $\mathbf{W}'$. Given a center word $w_c$, what's the probability that $w_o$ appears in its context? The model computes: $$ P(w_o | w_c) = \frac{\exp(\mathbf{w}'_o{}^T \mathbf{w}_c)}{\sum_{w=1}^{V} \exp(\mathbf{w}'_w{}^T \mathbf{w}_c)} $$ Here $\mathbf{w}_c \in \mathbb{R}^d$ is the center word's embedding (from $\mathbf{W}$) and $\mathbf{w}'_o \in \mathbb{R}^d$ is the context word's embedding (from $\mathbf{W}'$). The numerator computes the dot product $\mathbf{w}'_o{}^T \mathbf{w}_c$. When this is high, the vectors point in similar directions, indicating a likely context word. The denominator sums over all $V$ words in the vocabulary, turning the scores into a proper probability distribution (this is the softmax). The training objective is to maximize the probability of the context words we actually observe in the corpus: $$ \mathcal{L} = \sum_{(w_c, w_o) \in D} \log P(w_o | w_c) $$ where $D$ is the set of all (center, context) pairs extracted from the training text. We take the log because it converts products to sums, which is easier to optimize with gradient descent. Why does this learn good embeddings? To make $P(w_o | w_c)$ large, the model needs the dot product $\mathbf{w}'_o{}^T \mathbf{w}_c$ to be large. If "cat" often appears near "pet," "fur," and "purr," then $\mathbf{w}_{\text{cat}}$ must point in a direction that has high dot product with $\mathbf{w}'_{\text{pet}}$, $\mathbf{w}'_{\text{fur}}$, and $\mathbf{w}'_{\text{purr}}$. Similarly, "dog" also appears near "pet" and "fur" (though not "purr", maybe "bark" instead). So $\mathbf{w}_{\text{dog}}$ must also point toward $\mathbf{w}'_{\text{pet}}$ and $\mathbf{w}'_{\text{fur}}$. This forces "cat" and "dog" to have similar embeddings because they're being pulled toward the same context words. ### Negative sampling There's a computational problem with skip-gram. The softmax denominator sums over all $V$ words in the vocabulary: $\sum_{w=1}^{V} \exp(\mathbf{w}'_w{}^T \mathbf{w}_c)$. With $V = 50{,}000$ words, that's 50,000 dot products per training example. Far too slow. **Negative sampling** sidesteps this by changing the question. Instead of asking "which of 50,000 words is the context?" (a $V$-way classification), we ask "is this specific word a real context word, or a random fake?" (a binary classification). The idea is simple. For each real (center, context) pair we observe in the corpus, we also grab $k$ random words that *aren't* in the context. The real context word is a **positive sample**; the random words are **negative samples**. We train the model to say "yes" to the positive and "no" to the negatives. This transforms our objective. For center word $w_c$, real context word $w_o$, and $k$ random negative words $w_1, \ldots, w_k$, we maximize: $$ \mathcal{L} = \log \sigma(\mathbf{w}'_o{}^T \mathbf{w}_c) + \sum_{i=1}^k \log \sigma(-\mathbf{w}'_{w_i}{}^T \mathbf{w}_c) $$ Here $\sigma(x) = \frac{1}{1 + e^{-x}}$ is the sigmoid function, which squashes any real number to a probability between 0 and 1. The first term handles the positive sample: $\mathbf{w}'_o{}^T \mathbf{w}_c$ is the dot product between center and context embeddings, and we want $\sigma$ of this to be close to 1, meaning "yes, this is a real context word." The sum handles negatives: for each fake word $w_i$, we take the *negative* of the dot product before applying $\sigma$, because we want the model to output high confidence that these are *not* context words. Why does this give the same learning signal as softmax? Think about what the model must do. To make $\sigma(\mathbf{w}'_o{}^T \mathbf{w}_c)$ large, the dot product between real context words and the center must be large, meaning they need to point in similar directions. To make $\sigma(-\mathbf{w}'_{w_i}{}^T \mathbf{w}_c)$ large, the dot product for random words must be small or negative. The model learns to pull real context words close and push random words away. Same outcome as softmax, but we only compute $k+1$ dot products instead of $V$. Typically $k$ is between 5 and 20. ### A worked example Let's trace through one training step to make this concrete. Our corpus is the sentence "the cat sat on the mat" and we're using a context window of size 1 (one word on each side of the center). For illustration, we'll use tiny 3-dimensional embeddings. When the center word is "cat", the context words are "the" (to the left) and "sat" (to the right). This gives us two training pairs: (cat, the) and (cat, sat). Let's process the first one. Suppose the embeddings are currently (after random initialization): $$ \mathbf{w}_{\text{cat}} = \begin{bmatrix} 0.1 \\ 0.2 \\ -0.1 \end{bmatrix}, \quad \mathbf{w}'_{\text{the}} = \begin{bmatrix} 0.3 \\ 0.1 \\ 0.2 \end{bmatrix} $$ We compute the dot product to see how aligned these vectors are: $$ \mathbf{w}'_{\text{the}}{}^T \mathbf{w}_{\text{cat}} = 0.3 \times 0.1 + 0.1 \times 0.2 + 0.2 \times (-0.1) = 0.03 + 0.02 - 0.02 = 0.03 $$ The dot product is 0.03, barely positive. Passing through the sigmoid: $\sigma(0.03) = \frac{1}{1 + e^{-0.03}} \approx 0.507$. The model predicts a 50.7% chance that "the" is a context word of "cat", essentially a coin flip. Since "the" actually *is* a context word, this is a poor prediction. Gradient descent will nudge both embeddings to increase their dot product, making them more aligned. Now for a negative sample. We randomly pick a word that isn't in the context of "cat", say "algorithm". Its embedding is $\mathbf{w}'_{\text{algorithm}} = [0.5, -0.3, 0.1]^T$. The dot product is: $$ \mathbf{w}'_{\text{algorithm}}{}^T \mathbf{w}_{\text{cat}} = 0.5 \times 0.1 + (-0.3) \times 0.2 + 0.1 \times (-0.1) = 0.05 - 0.06 - 0.01 = -0.02 $$ For negative samples, we want the model to confidently say "no, this is not a context word." We achieve this by feeding the *negative* of the dot product into the sigmoid: $\sigma(-(-0.02)) = \sigma(0.02) \approx 0.505$. This is slightly above 0.5, meaning the model has a slight inclination that "algorithm" isn't a context word. Gradient descent will push the dot product more negative, increasing this confidence. After millions of such updates across a large corpus, a pattern emerges. Words that frequently appear together develop aligned embeddings (high dot products). Words that rarely co-occur develop orthogonal or opposing embeddings (low or negative dot products). And crucially, words that share similar contexts, like "cat" and "dog" which both appear near "pet", "fur", and "fed", end up with similar embeddings because they're both being pulled toward the same set of context words. ## Subword tokenization Word-level embeddings have a problem: what about words not in the vocabulary? Misspellings, rare words, technical jargon, new slang, and morphological variants (run, running, runner) all need handling. A vocabulary of 50,000 words sounds large, but it can't cover everything. **Subword tokenization** solves this by splitting words into smaller pieces that can be combined. The vocabulary contains common words whole, but rare words get broken into subword units. This way, even a word the model has never seen can be represented as a combination of familiar pieces. ### Byte Pair Encoding (BPE) *State of the art: BPE (2015) and its variants are used in most modern large language models, including GPT-2, GPT-3, GPT-4, and LLaMA.* **Byte Pair Encoding** builds a vocabulary by starting small and iteratively growing it. The algorithm begins with a vocabulary of just individual characters: every letter, digit, and punctuation mark. Then it scans the training corpus, counts how often each pair of adjacent tokens appears, and merges the most frequent pair into a new token. This process repeats until the vocabulary reaches the desired size. Let's trace through a tiny example. Suppose our corpus is just three words: "low", "lower", and "lowest". We start by representing these as character sequences: $$ \text{low} \to \texttt{l o w} \qquad \text{lower} \to \texttt{l o w e r} \qquad \text{lowest} \to \texttt{l o w e s t} $$ The initial vocabulary is $\{\texttt{l}, \texttt{o}, \texttt{w}, \texttt{e}, \texttt{r}, \texttt{s}, \texttt{t}\}$, just the characters that appear. Now we count adjacent pairs across the corpus. The pair $(\texttt{l}, \texttt{o})$ appears 3 times (once in each word). The pair $(\texttt{o}, \texttt{w})$ also appears 3 times. Suppose we break ties alphabetically and merge $(\texttt{l}, \texttt{o})$ first. After the merge, our vocabulary grows to $\{\texttt{l}, \texttt{o}, \texttt{w}, \texttt{e}, \texttt{r}, \texttt{s}, \texttt{t}, \texttt{lo}\}$, and the corpus becomes: $$ \text{low} \to \texttt{lo w} \qquad \text{lower} \to \texttt{lo w e r} \qquad \text{lowest} \to \texttt{lo w e s t} $$ We count pairs again. Now $(\texttt{lo}, \texttt{w})$ appears 3 times, the most frequent. We merge it: $$ \text{low} \to \texttt{low} \qquad \text{lower} \to \texttt{low e r} \qquad \text{lowest} \to \texttt{low e s t} $$ Continuing this process, we might eventually merge $(\texttt{e}, \texttt{r})$ to get $\texttt{er}$, then $(\texttt{e}, \texttt{s})$ to get $\texttt{es}$, then $(\texttt{es}, \texttt{t})$ to get $\texttt{est}$, and so on. The final vocabulary might include tokens like $\texttt{low}$, $\texttt{lower}$, $\texttt{lowest}$, $\texttt{er}$, and $\texttt{est}$. At inference time, we tokenize new text using the learned merges. A common word like "lowest" might become a single token $[\texttt{lowest}]$. A rare word like "lowest-ever" might become $[\texttt{lowest}, \texttt{-}, \texttt{ever}]$. And a completely novel word like "transformerize" might become $[\texttt{transform}, \texttt{er}, \texttt{ize}]$. Each piece is familiar even though the whole word is new. ### Properties of subword tokenization Subword tokenization has several appealing properties. First, it handles unknown words gracefully. Even a word the model has never encountered can be broken into subword units that the model *has* seen. The word "transformerize" doesn't need to be in the vocabulary; its pieces "transform", "er", and "ize" are enough. Second, related words share subword units, which means they share part of their representation. The words "playing", "played", and "player" might all contain the token "play". This gives the model a head start on understanding new morphological variants. If it knows what "play" means, it has a foundation for understanding "playable" even without seeing that exact word in training. Third, the vocabulary size stays manageable. Common words get single tokens, which is efficient. Rare words get split into multiple tokens, which takes more computation but ensures everything is representable. A vocabulary of 50,000 subword tokens can effectively cover far more than 50,000 words. Modern transformers use variants of this idea. **WordPiece** (used in BERT) is similar to BPE but uses a slightly different scoring function for merges. **SentencePiece** (used in T5 and LLaMA) operates directly on raw text without pre-tokenization, making it language-agnostic. The exact algorithms differ, but the principle is the same: learn a vocabulary of subword units that balances coverage and efficiency. ### Token embeddings in transformers In a transformer, each subword token has a learned embedding stored in the embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$, where $V$ is the vocabulary size (number of unique tokens) and $d$ is the model dimension (embedding size). The total number of parameters in the embedding layer is $V \times d$. How big is this in practice? Let's compute for real models: | Model | $V$ (vocab) | $d$ (dimension) | Embedding parameters | |-------|-------------|-----------------|---------------------| | GPT-2 | 50,257 | 768 | 38.6 million | | GPT-3 | 50,257 | 12,288 | 617 million | GPT-3's embedding matrix alone has more parameters than entire earlier models. **How are they trained?** The embedding matrix $\mathbf{E}$ is initialized randomly and trained jointly with the rest of the model via backpropagation. The training signal comes from the language modeling objective: predict the next token. Embeddings that help make good predictions survive; others get updated. ## Properties of transformer embeddings *State of the art: The approach described here, learned embeddings combined with contextual attention layers, is how all modern large language models work (GPT-4, Claude, LLaMA, etc.).* Embeddings in transformers have interesting properties that emerge from training. ### Contextual vs. static embeddings Word2Vec embeddings are **static**: "bank" has one embedding regardless of context. But "bank" means different things in "river bank" and "bank account." Transformer embeddings start as static (the lookup from $\mathbf{E}$), but then get transformed by the attention layers into **contextual embeddings**. After passing through transformer layers, the representation of "bank" depends on the surrounding words. In "river bank," it might be close to "shore" and "water." In "bank account," it might be close to "money" and "finance." The initial embedding $\mathbf{E}$ provides a starting point. The attention layers modify it based on context. This is one of the key innovations of transformers over earlier approaches. ### Embedding space geometry Studies of transformer embedding spaces reveal structure: **Linear subspaces for concepts.** Directions in embedding space often correspond to interpretable concepts. There might be a "gender direction," a "tense direction," a "formality direction." **Clustering by meaning.** Words with similar meanings cluster together. Synonyms are close. Categories form regions. **Analogies still work.** The word2vec-style analogies often work in transformer embeddings too, though the relationship is more complex because embeddings become contextual after attention. **Anisotropy.** Transformer embeddings often occupy a narrow cone rather than filling the space uniformly. This means cosine similarities tend to be high even for unrelated words. Various normalization techniques address this. ## From tokens to sequences So far we've embedded single tokens. But transformers process *sequences*. Here's how we go from a sentence to a matrix. First, we tokenize the text into token indices. For example, "The cat sat" might become [464, 3797, 3332], where each number is an index in the vocabulary. Next, we look up each token's embedding from the embedding matrix $\mathbf{E}$. Finally, we stack these embeddings into a matrix: $$ \mathbf{X} = \begin{bmatrix} \mathbf{e}_{464} \\ \mathbf{e}_{3797} \\ \mathbf{e}_{3332} \end{bmatrix} \in \mathbb{R}^{T \times d} $$ Here $T = 3$ is the sequence length (number of tokens), $d$ is the embedding dimension (e.g., 768), and row $t$ of $\mathbf{X}$ is the embedding for token $t$. For "The cat sat" with $d = 768$, we get a $3 \times 768$ matrix. Each row is a 768-dimensional vector representing one token. This matrix $\mathbf{X}$ is the input to the transformer. The attention layers will transform it, letting tokens "see" each other and build context-aware representations. But that's for later chapters. The embedding layer's job is done: convert token indices to vectors. ## Summary We've seen that: - Discrete tokens need to become vectors for neural network processing. One-hot encodings are sparse, high-dimensional, and lack similarity structure. - Word embeddings map tokens to dense vectors where similar words are nearby. The embedding matrix $\mathbf{E}$ is learned from data. - Embeddings are learned by predicting words from context (skip-gram) or context from words (CBOW), based on the distributional hypothesis. - Subword tokenization (BPE, WordPiece) handles rare and unknown words by decomposing them into learned subword units. - Transformer embeddings start static but become contextual after passing through attention layers. The embedding layer is where text first touches the transformer. It converts a sequence of tokens into a sequence of vectors that the attention mechanism can then relate and transform. In the next chapter, we'll see how attention works.