11  Positional encoding

NoteLearning objectives

After completing this chapter, you will be able to:

  • Explain why attention is permutation-invariant and why this is problematic
  • Describe the sinusoidal positional encoding scheme
  • Compute positional encodings for any position and dimension
  • Understand how different frequencies encode position at different scales
  • Derive why positional encodings allow learning relative positions

We’ve developed embeddings that map words to dense vectors, and attention mechanisms that let positions exchange information based on relevance. But there’s a fundamental problem lurking in the mathematics: attention has no notion of position. The self-attention operation we derived is completely permutation-invariant. If we shuffle the input sequence, the attention weights change (because different positions now contain different content), but the mechanism treats position 1 the same as position 100. There’s nothing in the math that says “this token came first” or “these tokens are adjacent.”

Why is this a problem? Consider “The dog bit the man” versus “The man bit the dog.” Same words, completely different meanings. Order matters in language. Recurrent networks handled this naturally because they processed sequences step by step. The hidden state at time \(t\) inherently knows it came after time \(t-1\). But transformers process all positions in parallel, gaining speed at the cost of losing positional information. This chapter develops positional encoding, the mechanism that injects order into the permutation-invariant attention operation.

11.1 The position problem

Let’s make the problem concrete. When we compute self-attention, we project inputs to queries, keys, and values, then compute attention scores via dot products:

\[ \alpha_{ij} = \frac{\exp(\mathbf{q}_i^T \mathbf{k}_j / \sqrt{d})}{\sum_k \exp(\mathbf{q}_i^T \mathbf{k}_k / \sqrt{d})} \]

The score between position \(i\) and position \(j\) depends only on the content at those positions (via \(\mathbf{q}_i\) and \(\mathbf{k}_j\)), not on the fact that position \(i\) comes before or after position \(j\). If we swap the content of positions 2 and 5, the attention pattern changes, but not because of position. The mechanism has no way to know that position 2 is near position 1 while position 5 is far away.

We could add position as a feature: append the position index to each embedding. Suppose we have a 512-dimensional embedding for the word “cat” at position 3. We could create a 513-dimensional vector by appending the number 3: \([\text{cat embedding}, 3]\). But this has serious issues.

First, positions would be unbounded integers (1, 2, 3, …, 1000, …). A model trained on sequences of length 100 would see position indices from 1 to 100. At test time, if we feed it a sequence of length 200, it encounters position indices 101-200 that it has never seen during training. How should it handle position 150? It has no training data for that value. The model can’t generalize to longer sequences.

Second, the relative scale matters. Consider the distance between consecutive positions. Positions 1 and 2 differ by 1. Positions 100 and 101 also differ by 1. But if we use raw integers, the relative difference is very different: going from 1 to 2 is a 100% increase, while going from 100 to 101 is only a 1% increase. Should early positions be treated as more “spread out” than later positions? That seems arbitrary. We want a representation where the difference between consecutive positions is consistent regardless of absolute position.

Third, position indices have unbounded magnitude. The first position might be represented as 1, but the 10,000th position is represented as 10,000. When these values interact with learned weights in attention, the very large values at late positions could dominate the dot products, creating numerical instability. We need positions to have bounded, comparable magnitudes.

11.2 Adding positional information

The transformer uses positional encoding: we add a position-dependent vector to each token’s embedding. If \(\mathbf{e}_t\) is the embedding for token \(t\) and \(\mathbf{p}_t\) is the positional encoding for position \(t\), the input to the transformer is:

\[ \mathbf{x}_t = \mathbf{e}_t + \mathbf{p}_t \]

The positional encoding \(\mathbf{p}_t \in \mathbb{R}^d\) must have the same dimension \(d\) as the embeddings so we can add them. We’re not concatenating (which would increase dimension), we’re adding. This means position information gets mixed with content information in the same vector space. Each dimension of the resulting vector \(\mathbf{x}_t\) contains both “what is this token?” and “where is this token?” information.

Why add rather than concatenate? Adding keeps the dimension constant, which simplifies the architecture. More importantly, it forces the model to learn how to disentangle position and content. The embedding space becomes richer: similar words at similar positions will have similar vectors, but the same word at different positions will differ slightly. The attention mechanism can learn to use or ignore positional information as needed for different tasks.

The key question is: how do we construct \(\mathbf{p}_t\)? We need a function that maps each position \(t\) to a \(d\)-dimensional vector, satisfying several desiderata. The positions should have bounded values (to avoid numerical issues), nearby positions should have similar encodings (to support learning that adjacent words are related), and the encoding should generalize to unseen positions (to handle sequences longer than those seen during training).

11.3 Sinusoidal positional encoding

Let’s start with what’s actually in a positional encoding vector, then understand why. Suppose we have \(d = 8\) dimensions (in practice, transformers use 512 or 768, but 8 is easier to visualize). Here are the positional encoding vectors for the first few positions:

Position 0: \[ \mathbf{p}_0 = [0.00, 1.00, 0.00, 1.00, 0.00, 1.00, 0.00, 1.00] \]

Position 1: \[ \mathbf{p}_1 = [0.84, 0.54, 0.10, 0.99, 0.01, 1.00, 0.00, 1.00] \]

Position 2: \[ \mathbf{p}_2 = [0.91, -0.42, 0.20, 0.98, 0.02, 1.00, 0.00, 1.00] \]

Position 3: \[ \mathbf{p}_3 = [0.14, -0.99, 0.30, 0.95, 0.03, 1.00, 0.00, 1.00] \]

Notice the pattern. The first two dimensions change rapidly from position to position. The middle dimensions change more slowly. The last two dimensions barely change at all. Each pair of dimensions captures position information at a different timescale.

Let’s see how this is constructed. The formula is:

\[ p_{t,2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad p_{t,2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right) \]

Here \(p_{t,j}\) is the \(j\)-th dimension of the positional encoding at position \(t\). The formula tells us: - Dimensions 0 and 1 form a pair using frequency \(\omega_0 = 1/10000^{0/8} = 1\) - Dimensions 2 and 3 form a pair using frequency \(\omega_1 = 1/10000^{2/8} = 0.1\) - Dimensions 4 and 5 form a pair using frequency \(\omega_2 = 1/10000^{4/8} = 0.01\) - Dimensions 6 and 7 form a pair using frequency \(\omega_3 = 1/10000^{6/8} = 0.001\)

Each pair uses sine for the even dimension and cosine for the odd dimension, at their shared frequency.

11.3.1 How different frequencies encode position

The key insight is that different frequencies capture position at different scales. Imagine plotting each frequency band across positions: high-frequency bands oscillate rapidly, completing many cycles, while low-frequency bands change very slowly. Together, they create a unique “fingerprint” for each position.

High frequency (dimensions 0-1, \(\omega_0 = 1\)): Completes one full cycle every \(2\pi \approx 6\) positions. Changes rapidly from position to position. These dimensions can distinguish “position 5” from “position 6” but cycle back to similar values every 6 positions. Good for fine-grained, local position information.

Medium frequency (dimensions 2-3, \(\omega_1 = 0.1\)): Completes one full cycle every \(2\pi/0.1 \approx 63\) positions. Changes more slowly. These dimensions are similar for nearby positions (5 and 6 have almost the same value) but different for distant positions (5 and 50 differ significantly). Good for medium-scale position information.

Low frequency (dimensions 4-5, \(\omega_2 = 0.01\)): Completes one full cycle every \(2\pi/0.01 \approx 628\) positions. Changes very slowly. Position 5 and position 50 have nearly identical values in these dimensions. But position 5 and position 500 differ. Good for coarse position information, distinguishing “early in sequence” from “late in sequence.”

Very low frequency (dimensions 6-7, \(\omega_3 = 0.001\)): Barely changes across typical sequences (one cycle every ~6280 positions). Acts like a constant “bias” that varies only across very long sequences.

By combining all frequency bands, each position gets a unique \(d\)-dimensional “fingerprint.” Position 5 might share some values with position 6 (in the low-frequency dimensions), but they differ in high-frequency dimensions. Position 5 and position 500 differ in both medium and low-frequency dimensions.

11.3.2 Visualizing the full encoding matrix

Let’s see what the complete positional encoding looks like for many positions and dimensions. Each cell shows the value \(p_{t,j}\) at position \(t\) and dimension \(j\):

Figure 11.1: Heatmap of positional encodings. Rows are positions (0-63), columns are dimensions (0-15). Bright values are near +1, dark values are near -1. Left columns (low dimension indices, high frequency) show rapid vertical stripes indicating fast oscillation across positions. Right columns (high dimension indices, low frequency) show slow gradual changes. Each position (each row) has a unique pattern across dimensions.

Look at the pattern. The leftmost dimensions (0-1, high frequency) create tight vertical stripes, changing rapidly from position to position. The rightmost dimensions (14-15, low frequency) change very slowly, appearing almost constant across nearby positions. Each row (each position) has a unique pattern of bright and dark cells across the 16 dimensions.

11.3.3 Why both sine and cosine?

Looking at our example vectors again: \[ \mathbf{p}_0 = [0.00, 1.00, 0.00, 1.00, \ldots] \] \[ \mathbf{p}_1 = [0.84, 0.54, 0.10, 0.99, \ldots] \]

Notice dimensions 0 and 1: at position 0, we have \((sin, cos) = (0.00, 1.00)\). At position 1, we have \((0.84, 0.54)\). Why do we need both? Why not just use \([0.00, 0.00, \ldots]\) and \([0.84, 0.10, \ldots]\) (only sines)?

The problem is ambiguity. Consider \(\sin(\theta)\). If we know \(\sin(\theta) = 0.5\), we can’t determine \(\theta\) uniquely. It could be \(30°\) or it could be \(150°\). Two different positions would have identical sine values, making them indistinguishable.

But if we know both \(\sin(\theta) = 0.5\) AND \(\cos(\theta) = 0.866\), we can uniquely determine \(\theta = 30°\). The pair \((sin, cos)\) pins down the position on the unit circle. The two functions are 90° out of phase, providing complementary information.

Geometrically, as position increases, the pair \((\sin(\omega t), \cos(\omega t))\) traces a circular path. Each position gets a unique point on the circle (until wrapping around after one full cycle). With multiple frequency bands, we get multiple circles rotating at different speeds, creating a rich, unique encoding for each position.

11.3.4 Why this encoding works

Looking back at our position 5 encoding \(\mathbf{p}_5 = [-0.96, 0.28, 0.48, 0.88, 0.05, 0.999, 0.005, 1.000]\), why is this better than simpler alternatives like \(\mathbf{p}_5 = [5, 5, 5, \ldots]\)?

Bounded values. All values stay between -1 and 1, regardless of position. Position 1 and position 10,000 both have values in \([-1, 1]\). If we used \(\mathbf{p}_t = [t, t, t, \ldots]\), position 10,000 would have huge values that could destabilize training (gradients would explode, attention scores would saturate).

Unique fingerprints. Each position gets a unique pattern across dimensions. Even though dimension 7 is nearly 1.000 for all positions (low frequency barely varies), the combination of all 8 dimensions uniquely identifies each position. The heatmap we saw earlier shows this: each row (each position) has a distinct pattern of light and dark cells.

Smoothness. Adjacent positions are similar. We saw that position 5 and 6 differ mainly in the high-frequency dimensions (0-1) but are nearly identical in low-frequency dimensions (4-7). This helps the model generalize: what it learns about position 5 can transfer to nearby position 6. With random initialization, positions 5 and 6 might be arbitrarily far apart in embedding space.

Relative position encoding. This is the most important property. The sinusoidal structure means the model can learn to detect relative positions. The encoding for position \(t+k\) is mathematically related to the encoding for position \(t\) via rotation matrices. Without diving into the full derivation, the key is: if the model wants to implement “look 3 tokens back,” it can learn a single transformation that works everywhere, rather than learning separate patterns for “look back from position 10” versus “look back from position 100.”

11.3.5 Why these specific frequencies?

The constant 10000 in \(\omega_i = 1/10000^{2i/d}\) might seem arbitrary. Why not 100 or 100000?

The choice determines the range of wavelengths (how fast the slowest frequency varies). With 10000 and \(d=512\): - Fastest frequency: wavelength ≈ 6 positions - Slowest frequency: wavelength ≈ 62,800 positions

This range covers typical sequence lengths (512-2048 tokens) while ensuring the slowest frequency provides a stable “regional” signal that barely varies across the sequence.

If we used 100 instead, the slowest frequency would have wavelength ≈ 628 positions. For a sequence of length 2048, even the slowest frequency would complete 3 full cycles, losing its role as a stable coarse indicator. If we used 100000, the slowest frequency would barely move even across sequences of 10,000 tokens, wasting representational capacity.

The value 10000 is a practical compromise for sequences up to several thousand tokens.

11.3.6 Computing the encoding step-by-step

Let’s work through the formula with \(d = 8\) and compute the encoding for position \(t = 5\).

Step 1: Compute frequencies

For \(d = 8\), we have 4 frequency bands (\(d/2 = 4\) pairs of dimensions):

\[ \omega_0 = \frac{1}{10000^{0/8}} = 1.0, \quad \omega_1 = \frac{1}{10000^{2/8}} = 0.1 \]

\[ \omega_2 = \frac{1}{10000^{4/8}} = 0.01, \quad \omega_3 = \frac{1}{10000^{6/8}} = 0.001 \]

Step 2: Apply formula for each dimension

At position \(t = 5\):

Dim Formula Computation Value
0 \(\sin(\omega_0 \cdot t)\) \(\sin(1.0 \cdot 5) = \sin(5)\) -0.96
1 \(\cos(\omega_0 \cdot t)\) \(\cos(1.0 \cdot 5) = \cos(5)\) 0.28
2 \(\sin(\omega_1 \cdot t)\) \(\sin(0.1 \cdot 5) = \sin(0.5)\) 0.48
3 \(\cos(\omega_1 \cdot t)\) \(\cos(0.1 \cdot 5) = \cos(0.5)\) 0.88
4 \(\sin(\omega_2 \cdot t)\) \(\sin(0.01 \cdot 5) = \sin(0.05)\) 0.05
5 \(\cos(\omega_2 \cdot t)\) \(\cos(0.01 \cdot 5) = \cos(0.05)\) 0.999
6 \(\sin(\omega_3 \cdot t)\) \(\sin(0.001 \cdot 5) = \sin(0.005)\) 0.005
7 \(\cos(\omega_3 \cdot t)\) \(\cos(0.001 \cdot 5) = \cos(0.005)\) 1.000

Step 3: Assemble the vector

\[ \mathbf{p}_5 = [-0.96, 0.28, 0.48, 0.88, 0.05, 0.999, 0.005, 1.000] \]

This is the 8-dimensional positional encoding for position 5. Notice how: - Dimensions 0-1 (high frequency) have values far from their starting point \((0, 1)\) - Dimensions 2-3 (medium frequency) have changed moderately - Dimensions 4-5 (low frequency) have barely changed - Dimensions 6-7 (very low frequency) are almost identical to position 0

Now let’s compare positions 5 and 6 to see how the encoding distinguishes adjacent positions:

Dim Pos 5 Pos 6 Difference
0 -0.96 -0.28 0.68 (large)
1 0.28 0.96 0.68 (large)
2 0.48 0.56 0.08 (small)
3 0.88 0.83 0.05 (small)
4-7 ≈same ≈same ≈0 (tiny)

The high-frequency dimensions (0-1) change significantly between adjacent positions, allowing the model to distinguish “position 5” from “position 6.” The low-frequency dimensions stay nearly constant, providing a stable “regional” signal that groups nearby positions together.

This multi-scale representation is the key insight. High-frequency dimensions distinguish adjacent positions. Low-frequency dimensions distinguish distant regions. Together, they provide both fine-grained and coarse-grained positional information, creating a unique fingerprint for each position.

11.4 Learned positional embeddings

Modern transformers often use learned positional embeddings instead of fixed sinusoidal encodings. We treat position encodings like word embeddings and learn them during training. We create a learnable matrix \(\mathbf{P} \in \mathbb{R}^{L_{\max} \times d}\) where \(L_{\max}\) is the maximum sequence length we plan to handle. Row \(t\) of \(\mathbf{P}\) is the encoding for position \(t\). These encodings are initialized randomly and updated during training via backpropagation.

The advantage is simplicity. We don’t need to commit to a particular functional form (sinusoids). The model learns whatever positional representation is most useful for the task. Empirically, learned positional embeddings often work as well as or better than sinusoidal encodings.

The disadvantage is generalization to longer sequences. If we train with \(L_{\max} = 512\), we have learned encodings for positions 0 through 511. At test time, if we encounter a sequence of length 1000, we have no learned encoding for positions 512-999. We could extrapolate by reusing the learned encodings (wrapping around, or extending the pattern), but there’s no guarantee this works well. Sinusoidal encodings generalize naturally to any length because they’re defined by a function, not a lookup table.

In practice, this limitation is often acceptable. Most applications have a known maximum sequence length, and we train on sequences up to that length. For applications requiring variable or very long sequences, researchers use alternative approaches like relative positional encodings (where we encode the distance between positions rather than absolute position) or rotary positional embeddings (which combine learned and functional forms).

11.5 What if we skip positional encoding?

What happens if we skip positional encoding entirely? The transformer can still process the sequence, but it loses all sense of order. It becomes a bag-of-words model, treating “dog bites man” and “man bites dog” identically. The attention mechanism can still find relevant words (if you’re processing “dog” you might attend to “man” and “bites”), but it can’t distinguish “the word before this one” from “the word after this one.”

For some tasks, this might be tolerable. Consider sentiment classification: determining whether a movie review is positive or negative. Word order matters (“not good” vs “good”), but for many reviews, the overall sentiment is clear from the words present regardless of order. A bag-of-words model can achieve reasonable accuracy. But for most language tasks, order is crucial. In machine translation, “Le chat noir” (the black cat) and “Le noir chat” (the black cat, but ungrammatical in French) have different meanings and grammaticality. In question answering, “Who did Alice meet?” and “Who met Alice?” are different questions. Without positional encoding, the transformer can’t distinguish these.

We can verify this experimentally. If we train a transformer without positional encoding on a task that requires understanding word order (like parsing or translation), performance collapses. The model learns something, but far less than with positional encoding. This confirms that the permutation-invariance of bare attention is a bug, not a feature, for sequential data.

Conversely, positional encoding is only necessary because we chose attention as our mixing mechanism. If we used a different architecture (like a CNN with positional filters, or an RNN with sequential processing), position would be implicit. We use positional encoding specifically to fix the position-blindness of attention, gaining the benefits of parallel processing and long-range dependencies while recovering the sequential structure that language requires.

We’ve now added position to our tokens. The input to the transformer is embeddings plus positional encoding, giving each position a unique combination of content and location information. The next step is to process these position-aware embeddings through the transformer’s core architecture: stacked blocks of attention and feed-forward networks that refine the representations layer by layer.