6 Sequence modeling

Learning objectives

After completing this chapter, you will be able to:

Explain why feedforward networks struggle with sequential data
Describe how recurrent neural networks process sequences step by step
Identify the vanishing and exploding gradient problems in RNNs
Understand how LSTM and GRU architectures address gradient flow issues
Recognize the limitations that motivated the transformer architecture

Language, music, stock prices, weather patterns: the world is full of sequential data where order matters. The sentence “the cat sat on the mat” means something different from “the mat sat on the cat.” Standard feedforward networks treat inputs as fixed-size vectors with no notion of order. This chapter explores how to model sequences, the challenges that arise, and why these challenges ultimately motivated the transformer architecture.

6.1 The problem of sequential data

A sequence is an ordered list of elements: $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ where $T$ is the sequence length. In natural language processing, each $x_t$ might be a word (or subword token). In time series, each $x_t$ might be a measurement at time $t$.

Sequential data has several properties that make it challenging:

Variable length. Sentences can have 5 words or 50 words. A model should handle both without being redesigned.

Order dependence. The meaning depends on the order. “Dog bites man” and “man bites dog” contain the same words but mean different things.

Long-range dependencies. Elements far apart can be related. In “The cat, which had been sleeping peacefully on the warm windowsill all afternoon, suddenly woke up,” the verb “woke” must agree with “cat,” not the nearby “windowsill” or “afternoon.”

Context sensitivity. The same word means different things in different contexts. “Bank” means something different in “river bank” versus “bank account.”

Let’s try to apply a feedforward network to sequences and see what goes wrong.

6.2 Why feedforward networks fail

Suppose we want to classify sentences as positive or negative sentiment. A feedforward network needs a fixed-size input vector. How do we represent a sentence?

Approach 1: Fixed-length input. Pad short sentences with zeros and truncate long ones to some maximum length $T_{\max}$. Represent each word as a one-hot vector of vocabulary size $V$. The input becomes a $T_{\max} \times V$ matrix, which we flatten to a vector of dimension $T_{\max} \cdot V$.

Problems: For vocabulary $V = 10{,}000$ and maximum length $T_{\max} = 100$, the input has dimension 1 million. Most entries are zero (sparse). The network must learn separate weights for “good” in position 1, “good” in position 2, etc. It can’t generalize that “good” means the same thing regardless of position.

Approach 2: Bag of words. Sum or average the word vectors, ignoring order. The input is a $V$-dimensional vector counting word occurrences.

Problems: “The food was good, not bad” and “The food was bad, not good” have identical bag-of-words representations but opposite meanings. Order information is completely lost.

Approach 3: n-grams. Include pairs or triples of consecutive words as features.

Problems: The number of possible n-grams explodes combinatorially. Bigrams give $V^2$ features, trigrams give $V^3$. Most are never seen in training. Long-range dependencies spanning more than $n$ words are still missed.

The fundamental issue is that feedforward networks process inputs as unstructured vectors. They have no built-in notion of sequence, position, or the idea that the same pattern (like the word “good”) might appear at different positions.

6.3 Recurrent neural networks

Recurrent neural networks (RNNs) address this by processing sequences one element at a time while maintaining a hidden state that summarizes what has been seen so far.

At each timestep $t$, the RNN:

Takes the current input $x_t$ and the previous hidden state $\mathbf{h}_{t-1}$
Computes a new hidden state $\mathbf{h}_t$
Optionally produces an output $\mathbf{y}_t$

The core equation is:

\[ \mathbf{h}_t = \sigma(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}) \]

where $\mathbf{W}_h \in \mathbb{R}^{d \times d}$ is the hidden-to-hidden weight matrix, $\mathbf{W}_x \in \mathbb{R}^{d \times n}$ is the input-to-hidden weight matrix, $\mathbf{b} \in \mathbb{R}^d$ is the bias, and $\sigma$ is typically tanh. The hidden dimension $d$ is a hyperparameter.

The key insight: the same weights are used at every timestep. The network learns to process one element and update its state, then applies this same computation repeatedly. This is called weight sharing or parameter tying.

Let’s trace through a concrete example. Suppose we have a tiny RNN with hidden dimension $d = 2$, processing a sequence of 3 scalar inputs: $x_1 = 1, x_2 = -1, x_3 = 2$. Let:

\[ \mathbf{W}_h = \begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}, \quad \mathbf{W}_x = \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \]

Initialize $\mathbf{h}_0 = [0, 0]^T$ (no prior context).

Timestep 1 ($x_1 = 1$):

\[ \mathbf{h}_1 = \tanh\left(\begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}\begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} \cdot 1\right) = \tanh\begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} = \begin{bmatrix} 0.291 \\ 0.380 \end{bmatrix} \]

Timestep 2 ($x_2 = -1$):

\[ \mathbf{h}_2 = \tanh\left(\begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}\begin{bmatrix} 0.291 \\ 0.380 \end{bmatrix} + \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} \cdot (-1)\right) \]

\[ = \tanh\left(\begin{bmatrix} 0.146 + 0.038 \\ 0.058 + 0.228 \end{bmatrix} + \begin{bmatrix} -0.3 \\ -0.4 \end{bmatrix}\right) = \tanh\begin{bmatrix} -0.116 \\ -0.114 \end{bmatrix} = \begin{bmatrix} -0.115 \\ -0.114 \end{bmatrix} \]

Timestep 3 ($x_3 = 2$):

\[ \mathbf{h}_3 = \tanh\left(\begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}\begin{bmatrix} -0.115 \\ -0.114 \end{bmatrix} + \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} \cdot 2\right) \]

\[ = \tanh\left(\begin{bmatrix} -0.058 - 0.011 \\ -0.023 - 0.068 \end{bmatrix} + \begin{bmatrix} 0.6 \\ 0.8 \end{bmatrix}\right) = \tanh\begin{bmatrix} 0.531 \\ 0.709 \end{bmatrix} = \begin{bmatrix} 0.486 \\ 0.610 \end{bmatrix} \]

The final hidden state $\mathbf{h}_3 = [0.486, 0.610]^T$ encodes information about the entire sequence. Notice how each hidden state depends on all previous inputs through the chain of computations.

6.3.1 Unrolling the RNN

We can visualize an RNN by “unrolling” it through time:

\[ \mathbf{h}_0 \xrightarrow{x_1} \mathbf{h}_1 \xrightarrow{x_2} \mathbf{h}_2 \xrightarrow{x_3} \cdots \xrightarrow{x_T} \mathbf{h}_T \]

Each arrow represents the same computation with the same weights. The unrolled network looks like a very deep feedforward network, but with tied weights across layers (timesteps).

6.3.2 Training RNNs: backpropagation through time

To train an RNN, we use a variant of backpropagation called backpropagation through time (BPTT). We unroll the network, compute the loss (e.g., at the final timestep or at each timestep), and backpropagate through the unrolled graph.

The gradient with respect to a parameter like $\mathbf{W}_h$ accumulates contributions from each timestep:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_h} = \sum_{t=1}^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_h} \]

But here’s where trouble begins. To compute $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_1}$ when the loss depends on $\mathbf{h}_T$, we must trace the chain of dependencies:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T} \cdot \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_{T-1}} \cdot \frac{\partial \mathbf{h}_{T-1}}{\partial \mathbf{h}_{T-2}} \cdots \frac{\partial \mathbf{h}_2}{\partial \mathbf{h}_1} \]

This is a product of $T-1$ Jacobian matrices. Let’s examine what these Jacobians look like.

6.4 The vanishing and exploding gradient problem

From the RNN equation $\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b})$, the Jacobian of $\mathbf{h}_t$ with respect to $\mathbf{h}_{t-1}$ is:

\[ \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right) \cdot \mathbf{W}_h \]

where $\mathbf{z}_t = \mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}$ and $\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)$. The diagonal matrix $\text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right)$ has entries in $(0, 1]$ since $\frac{d}{dz}\tanh \leq 1$.

The gradient from timestep $T$ back to timestep $1$ involves:

\[ \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^T \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \prod_{t=2}^T \text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right) \cdot \mathbf{W}_h \]

This is a product of $T-1$ matrices. What happens to this product as $T$ grows?

6.4.1 The problem mathematically

Consider the simplified case where all the $\text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right)$ matrices are roughly the identity (i.e., $\frac{d}{dz}\tanh$ values are close to 1). Then:

\[ \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} \approx \mathbf{W}_h^{T-1} \]

Let $\mathbf{W}_h$ have eigenvalue decomposition $\mathbf{W}_h = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^{-1}$ where $\boldsymbol{\Lambda}$ is diagonal with eigenvalues $\lambda_1, \ldots, \lambda_d$. Then:

\[ \mathbf{W}_h^{T-1} = \mathbf{V} \boldsymbol{\Lambda}^{T-1} \mathbf{V}^{-1} \]

The eigenvalues get raised to the power $T-1$:

If $|\lambda_i| < 1$: $\lambda_i^{T-1} \to 0$ exponentially fast. Vanishing gradient.
If $|\lambda_i| > 1$: $\lambda_i^{T-1} \to \infty$ exponentially fast. Exploding gradient.
If $|\lambda_i| = 1$: $\lambda_i^{T-1}$ stays bounded. Stable.

For a typical random matrix, some eigenvalues will have magnitude greater than 1 and some less than 1. Over long sequences, the gradient either explodes or vanishes depending on which eigenvalues dominate.

6.4.2 Concrete example

Let’s see this with numbers. Suppose $\mathbf{W}_h = \begin{bmatrix} 0.9 & 0 \\ 0 & 1.1 \end{bmatrix}$ (diagonal for simplicity, eigenvalues 0.9 and 1.1). After $T = 50$ timesteps:

\[ \mathbf{W}_h^{49} = \begin{bmatrix} 0.9^{49} & 0 \\ 0 & 1.1^{49} \end{bmatrix} = \begin{bmatrix} 0.0052 & 0 \\ 0 & 97.02 \end{bmatrix} \]

The first component shrinks to nearly zero (vanishing). The second component grows to nearly 100 (exploding). After 100 timesteps, these become $2.7 \times 10^{-5}$ and $9417$ respectively. The imbalance is extreme.

In practice, the $\frac{d}{dz}\tanh$ factors make vanishing more common than exploding, since $\frac{d}{dz}\tanh \leq 1$ uniformly shrinks gradients. But either way, learning long-range dependencies becomes very difficult.

6.4.3 Why this matters

When gradients vanish, the network can’t learn long-range dependencies. If the loss depends on the relationship between $x_1$ and $x_T$ (like subject-verb agreement in a long sentence), the gradient signal from $\mathbf{h}_T$ is essentially zero by the time it reaches $\mathbf{h}_1$. The network has no way to learn that early inputs matter.

When gradients explode, training becomes unstable. Parameters get enormous updates and diverge. A common fix is gradient clipping: if $\|\nabla_{\boldsymbol{\theta}} \mathcal{L}\| > \tau$ for some threshold $\tau$, rescale the gradient to have norm $\tau$. This prevents explosion but doesn’t help with vanishing.

6.5 LSTM: a solution to vanishing gradients

The Long Short-Term Memory (LSTM) architecture addresses vanishing gradients by introducing a cell state $\mathbf{c}_t$ that acts as a “memory highway.” Information can flow through the cell state with minimal transformation, allowing gradients to propagate over long distances.

An LSTM has three gates that control information flow:

Forget gate: Decides what to remove from the cell state. \[ \mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \]

Input gate: Decides what new information to add. \[ \mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \]

Output gate: Decides what to output from the cell state. \[ \mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \]

Here $[\mathbf{h}_{t-1}, \mathbf{x}_t]$ denotes concatenation and $\sigma$ is the sigmoid function (outputting values in $(0, 1)$ that act as “soft switches”).

The cell state update is: \[ \tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \] \[ \mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \]

The hidden state is: \[ \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) \]

The key equation is $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$. When the forget gate $\mathbf{f}_t$ is close to 1 and the input gate $\mathbf{i}_t$ is close to 0, the cell state is simply copied: $\mathbf{c}_t \approx \mathbf{c}_{t-1}$. This additive update (rather than multiplicative) creates a gradient highway. The gradient $\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_1}$ can be close to 1 even for large $T$, as long as the forget gates stay open.

6.5.1 The gradient flow in LSTM

Let’s trace the gradient through the cell state. We have:

\[ \frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) \]

The Jacobian is just a diagonal matrix of forget gate values. If $\mathbf{f}_t \approx \mathbf{1}$ (forget gate fully open), then:

\[ \frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_1} = \prod_{t=2}^T \text{diag}(\mathbf{f}_t) \approx \mathbf{I} \]

Compare this to the vanilla RNN where we had products of $\mathbf{W}_h$ matrices. The LSTM’s additive cell state update transforms the gradient from a product of matrices (which can vanish or explode) to a product of diagonal matrices with entries in $(0, 1)$ (which vanishes more slowly).

This is not a complete solution. If the forget gates are consistently below 1, gradients still eventually vanish. But LSTMs can learn to keep forget gates open for important information, allowing much longer effective memory than vanilla RNNs.

6.6 The limitations of recurrence

LSTMs and the related GRU (Gated Recurrent Unit) architecture significantly improved sequence modeling. They enabled breakthroughs in machine translation, speech recognition, and text generation. But they have fundamental limitations:

Sequential computation. An RNN must process tokens one at a time because $\mathbf{h}_t$ depends on $\mathbf{h}_{t-1}$. For a sequence of length $T$, we need $T$ sequential steps. This can’t be parallelized, making training slow on modern GPUs that excel at parallel computation.

Long-range dependencies remain hard. Even LSTMs struggle with very long sequences. The gradient must still flow through many timesteps, and information must be compressed into a fixed-size hidden state. Empirically, LSTMs work well for dependencies spanning tens or low hundreds of tokens, but struggle beyond that.

The bottleneck problem. All information about the past must be compressed into the hidden state vector $\mathbf{h}_t$. For tasks like machine translation, the encoder must compress an entire source sentence into a single vector before the decoder can start. Important details get lost.

Consider translating a long English sentence to French. An RNN encoder produces a single vector representing the whole sentence. The decoder must reconstruct the French sentence from this compressed representation. Early words in the source get processed first, potentially getting “overwritten” by later words in the hidden state.

These limitations motivated the search for architectures that could:

Process sequences in parallel rather than sequentially
Directly connect any position to any other position, regardless of distance
Avoid compressing everything into a fixed bottleneck

The solution was attention: a mechanism that lets the model dynamically focus on different parts of the input. Attention was first used as an add-on to RNNs, dramatically improving machine translation. The transformer architecture then showed that attention alone, without any recurrence, could achieve state-of-the-art results while being much faster to train.

In the next part of this book, we’ll build up to the transformer by first understanding embeddings (how we represent discrete tokens as vectors) and then developing the attention mechanism in detail.

6.7 Summary

We’ve seen that:

Sequential data has properties (variable length, order dependence, long-range dependencies) that standard feedforward networks can’t handle well.
RNNs process sequences by maintaining a hidden state that updates at each timestep, with shared weights across time.
Training RNNs involves backpropagation through time, which leads to products of many matrices.
These products cause gradients to vanish or explode, making long-range dependencies hard to learn.
LSTMs mitigate vanishing gradients with additive cell state updates and gating mechanisms.
But recurrence has fundamental limitations: sequential computation, remaining difficulty with very long range dependencies, and information bottlenecks.

The transformer architecture, which we’ll build toward over the next several chapters, addresses all these limitations by replacing recurrence with attention.

# Sequence modeling ::: {.callout-note appearance="simple"} ## Learning objectives After completing this chapter, you will be able to: - Explain why feedforward networks struggle with sequential data - Describe how recurrent neural networks process sequences step by step - Identify the vanishing and exploding gradient problems in RNNs - Understand how LSTM and GRU architectures address gradient flow issues - Recognize the limitations that motivated the transformer architecture ::: Language, music, stock prices, weather patterns: the world is full of sequential data where order matters. The sentence "the cat sat on the mat" means something different from "the mat sat on the cat." Standard feedforward networks treat inputs as fixed-size vectors with no notion of order. This chapter explores how to model sequences, the challenges that arise, and why these challenges ultimately motivated the transformer architecture. ## The problem of sequential data A **sequence** is an ordered list of elements: $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ where $T$ is the sequence length. In natural language processing, each $x_t$ might be a word (or subword token). In time series, each $x_t$ might be a measurement at time $t$. Sequential data has several properties that make it challenging: **Variable length.** Sentences can have 5 words or 50 words. A model should handle both without being redesigned. **Order dependence.** The meaning depends on the order. "Dog bites man" and "man bites dog" contain the same words but mean different things. **Long-range dependencies.** Elements far apart can be related. In "The cat, which had been sleeping peacefully on the warm windowsill all afternoon, suddenly woke up," the verb "woke" must agree with "cat," not the nearby "windowsill" or "afternoon." **Context sensitivity.** The same word means different things in different contexts. "Bank" means something different in "river bank" versus "bank account." Let's try to apply a feedforward network to sequences and see what goes wrong. ## Why feedforward networks fail Suppose we want to classify sentences as positive or negative sentiment. A feedforward network needs a fixed-size input vector. How do we represent a sentence? **Approach 1: Fixed-length input.** Pad short sentences with zeros and truncate long ones to some maximum length $T_{\max}$. Represent each word as a one-hot vector of vocabulary size $V$. The input becomes a $T_{\max} \times V$ matrix, which we flatten to a vector of dimension $T_{\max} \cdot V$. Problems: For vocabulary $V = 10{,}000$ and maximum length $T_{\max} = 100$, the input has dimension 1 million. Most entries are zero (sparse). The network must learn separate weights for "good" in position 1, "good" in position 2, etc. It can't generalize that "good" means the same thing regardless of position. **Approach 2: Bag of words.** Sum or average the word vectors, ignoring order. The input is a $V$-dimensional vector counting word occurrences. Problems: "The food was good, not bad" and "The food was bad, not good" have identical bag-of-words representations but opposite meanings. Order information is completely lost. **Approach 3: n-grams.** Include pairs or triples of consecutive words as features. Problems: The number of possible n-grams explodes combinatorially. Bigrams give $V^2$ features, trigrams give $V^3$. Most are never seen in training. Long-range dependencies spanning more than $n$ words are still missed. The fundamental issue is that feedforward networks process inputs as unstructured vectors. They have no built-in notion of sequence, position, or the idea that the same pattern (like the word "good") might appear at different positions. ## Recurrent neural networks **Recurrent neural networks** (RNNs) address this by processing sequences one element at a time while maintaining a **hidden state** that summarizes what has been seen so far. At each timestep $t$, the RNN: 1. Takes the current input $x_t$ and the previous hidden state $\mathbf{h}_{t-1}$ 2. Computes a new hidden state $\mathbf{h}_t$ 3. Optionally produces an output $\mathbf{y}_t$ The core equation is: $$ \mathbf{h}_t = \sigma(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}) $$ where $\mathbf{W}_h \in \mathbb{R}^{d \times d}$ is the hidden-to-hidden weight matrix, $\mathbf{W}_x \in \mathbb{R}^{d \times n}$ is the input-to-hidden weight matrix, $\mathbf{b} \in \mathbb{R}^d$ is the bias, and $\sigma$ is typically tanh. The hidden dimension $d$ is a hyperparameter. The key insight: **the same weights are used at every timestep**. The network learns to process one element and update its state, then applies this same computation repeatedly. This is called **weight sharing** or **parameter tying**. Let's trace through a concrete example. Suppose we have a tiny RNN with hidden dimension $d = 2$, processing a sequence of 3 scalar inputs: $x_1 = 1, x_2 = -1, x_3 = 2$. Let: $$ \mathbf{W}_h = \begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}, \quad \mathbf{W}_x = \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} $$ Initialize $\mathbf{h}_0 = [0, 0]^T$ (no prior context). **Timestep 1** ($x_1 = 1$): $$ \mathbf{h}_1 = \tanh\left(\begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}\begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} \cdot 1\right) = \tanh\begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} = \begin{bmatrix} 0.291 \\ 0.380 \end{bmatrix} $$ **Timestep 2** ($x_2 = -1$): $$ \mathbf{h}_2 = \tanh\left(\begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}\begin{bmatrix} 0.291 \\ 0.380 \end{bmatrix} + \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} \cdot (-1)\right) $$ $$ = \tanh\left(\begin{bmatrix} 0.146 + 0.038 \\ 0.058 + 0.228 \end{bmatrix} + \begin{bmatrix} -0.3 \\ -0.4 \end{bmatrix}\right) = \tanh\begin{bmatrix} -0.116 \\ -0.114 \end{bmatrix} = \begin{bmatrix} -0.115 \\ -0.114 \end{bmatrix} $$ **Timestep 3** ($x_3 = 2$): $$ \mathbf{h}_3 = \tanh\left(\begin{bmatrix} 0.5 & 0.1 \\ 0.2 & 0.6 \end{bmatrix}\begin{bmatrix} -0.115 \\ -0.114 \end{bmatrix} + \begin{bmatrix} 0.3 \\ 0.4 \end{bmatrix} \cdot 2\right) $$ $$ = \tanh\left(\begin{bmatrix} -0.058 - 0.011 \\ -0.023 - 0.068 \end{bmatrix} + \begin{bmatrix} 0.6 \\ 0.8 \end{bmatrix}\right) = \tanh\begin{bmatrix} 0.531 \\ 0.709 \end{bmatrix} = \begin{bmatrix} 0.486 \\ 0.610 \end{bmatrix} $$ The final hidden state $\mathbf{h}_3 = [0.486, 0.610]^T$ encodes information about the entire sequence. Notice how each hidden state depends on all previous inputs through the chain of computations. ### Unrolling the RNN We can visualize an RNN by "unrolling" it through time: $$ \mathbf{h}_0 \xrightarrow{x_1} \mathbf{h}_1 \xrightarrow{x_2} \mathbf{h}_2 \xrightarrow{x_3} \cdots \xrightarrow{x_T} \mathbf{h}_T $$ Each arrow represents the same computation with the same weights. The unrolled network looks like a very deep feedforward network, but with tied weights across layers (timesteps). ### Training RNNs: backpropagation through time To train an RNN, we use a variant of backpropagation called **backpropagation through time** (BPTT). We unroll the network, compute the loss (e.g., at the final timestep or at each timestep), and backpropagate through the unrolled graph. The gradient with respect to a parameter like $\mathbf{W}_h$ accumulates contributions from each timestep: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}_h} = \sum_{t=1}^T \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_h} $$ But here's where trouble begins. To compute $\frac{\partial \mathcal{L}}{\partial \mathbf{h}_1}$ when the loss depends on $\mathbf{h}_T$, we must trace the chain of dependencies: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{h}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T} \cdot \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_{T-1}} \cdot \frac{\partial \mathbf{h}_{T-1}}{\partial \mathbf{h}_{T-2}} \cdots \frac{\partial \mathbf{h}_2}{\partial \mathbf{h}_1} $$ This is a product of $T-1$ Jacobian matrices. Let's examine what these Jacobians look like. ## The vanishing and exploding gradient problem From the RNN equation $\mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b})$, the Jacobian of $\mathbf{h}_t$ with respect to $\mathbf{h}_{t-1}$ is: $$ \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right) \cdot \mathbf{W}_h $$ where $\mathbf{z}_t = \mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}$ and $\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)$. The diagonal matrix $\text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right)$ has entries in $(0, 1]$ since $\frac{d}{dz}\tanh \leq 1$. The gradient from timestep $T$ back to timestep $1$ involves: $$ \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} = \prod_{t=2}^T \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \prod_{t=2}^T \text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right) \cdot \mathbf{W}_h $$ This is a product of $T-1$ matrices. What happens to this product as $T$ grows? ### The problem mathematically Consider the simplified case where all the $\text{diag}\left(\frac{d}{dz}\tanh(\mathbf{z}_t)\right)$ matrices are roughly the identity (i.e., $\frac{d}{dz}\tanh$ values are close to 1). Then: $$ \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_1} \approx \mathbf{W}_h^{T-1} $$ Let $\mathbf{W}_h$ have eigenvalue decomposition $\mathbf{W}_h = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^{-1}$ where $\boldsymbol{\Lambda}$ is diagonal with eigenvalues $\lambda_1, \ldots, \lambda_d$. Then: $$ \mathbf{W}_h^{T-1} = \mathbf{V} \boldsymbol{\Lambda}^{T-1} \mathbf{V}^{-1} $$ The eigenvalues get raised to the power $T-1$: - If $|\lambda_i| < 1$: $\lambda_i^{T-1} \to 0$ exponentially fast. **Vanishing gradient.** - If $|\lambda_i| > 1$: $\lambda_i^{T-1} \to \infty$ exponentially fast. **Exploding gradient.** - If $|\lambda_i| = 1$: $\lambda_i^{T-1}$ stays bounded. Stable. For a typical random matrix, some eigenvalues will have magnitude greater than 1 and some less than 1. Over long sequences, the gradient either explodes or vanishes depending on which eigenvalues dominate. ### Concrete example Let's see this with numbers. Suppose $\mathbf{W}_h = \begin{bmatrix} 0.9 & 0 \\ 0 & 1.1 \end{bmatrix}$ (diagonal for simplicity, eigenvalues 0.9 and 1.1). After $T = 50$ timesteps: $$ \mathbf{W}_h^{49} = \begin{bmatrix} 0.9^{49} & 0 \\ 0 & 1.1^{49} \end{bmatrix} = \begin{bmatrix} 0.0052 & 0 \\ 0 & 97.02 \end{bmatrix} $$ The first component shrinks to nearly zero (vanishing). The second component grows to nearly 100 (exploding). After 100 timesteps, these become $2.7 \times 10^{-5}$ and $9417$ respectively. The imbalance is extreme. In practice, the $\frac{d}{dz}\tanh$ factors make vanishing more common than exploding, since $\frac{d}{dz}\tanh \leq 1$ uniformly shrinks gradients. But either way, learning long-range dependencies becomes very difficult. ### Why this matters When gradients vanish, the network can't learn long-range dependencies. If the loss depends on the relationship between $x_1$ and $x_T$ (like subject-verb agreement in a long sentence), the gradient signal from $\mathbf{h}_T$ is essentially zero by the time it reaches $\mathbf{h}_1$. The network has no way to learn that early inputs matter. When gradients explode, training becomes unstable. Parameters get enormous updates and diverge. A common fix is **gradient clipping**: if $\|\nabla_{\boldsymbol{\theta}} \mathcal{L}\| > \tau$ for some threshold $\tau$, rescale the gradient to have norm $\tau$. This prevents explosion but doesn't help with vanishing. ## LSTM: a solution to vanishing gradients The **Long Short-Term Memory** (LSTM) architecture addresses vanishing gradients by introducing a **cell state** $\mathbf{c}_t$ that acts as a "memory highway." Information can flow through the cell state with minimal transformation, allowing gradients to propagate over long distances. An LSTM has three gates that control information flow: **Forget gate**: Decides what to remove from the cell state. $$ \mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) $$ **Input gate**: Decides what new information to add. $$ \mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) $$ **Output gate**: Decides what to output from the cell state. $$ \mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) $$ Here $[\mathbf{h}_{t-1}, \mathbf{x}_t]$ denotes concatenation and $\sigma$ is the sigmoid function (outputting values in $(0, 1)$ that act as "soft switches"). The cell state update is: $$ \tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) $$ $$ \mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t $$ The hidden state is: $$ \mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) $$ The key equation is $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$. When the forget gate $\mathbf{f}_t$ is close to 1 and the input gate $\mathbf{i}_t$ is close to 0, the cell state is simply copied: $\mathbf{c}_t \approx \mathbf{c}_{t-1}$. This additive update (rather than multiplicative) creates a gradient highway. The gradient $\frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_1}$ can be close to 1 even for large $T$, as long as the forget gates stay open. ### The gradient flow in LSTM Let's trace the gradient through the cell state. We have: $$ \frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t) $$ The Jacobian is just a diagonal matrix of forget gate values. If $\mathbf{f}_t \approx \mathbf{1}$ (forget gate fully open), then: $$ \frac{\partial \mathbf{c}_T}{\partial \mathbf{c}_1} = \prod_{t=2}^T \text{diag}(\mathbf{f}_t) \approx \mathbf{I} $$ Compare this to the vanilla RNN where we had products of $\mathbf{W}_h$ matrices. The LSTM's additive cell state update transforms the gradient from a product of matrices (which can vanish or explode) to a product of diagonal matrices with entries in $(0, 1)$ (which vanishes more slowly). This is not a complete solution. If the forget gates are consistently below 1, gradients still eventually vanish. But LSTMs can learn to keep forget gates open for important information, allowing much longer effective memory than vanilla RNNs. ## The limitations of recurrence LSTMs and the related GRU (Gated Recurrent Unit) architecture significantly improved sequence modeling. They enabled breakthroughs in machine translation, speech recognition, and text generation. But they have fundamental limitations: **Sequential computation.** An RNN must process tokens one at a time because $\mathbf{h}_t$ depends on $\mathbf{h}_{t-1}$. For a sequence of length $T$, we need $T$ sequential steps. This can't be parallelized, making training slow on modern GPUs that excel at parallel computation. **Long-range dependencies remain hard.** Even LSTMs struggle with very long sequences. The gradient must still flow through many timesteps, and information must be compressed into a fixed-size hidden state. Empirically, LSTMs work well for dependencies spanning tens or low hundreds of tokens, but struggle beyond that. **The bottleneck problem.** All information about the past must be compressed into the hidden state vector $\mathbf{h}_t$. For tasks like machine translation, the encoder must compress an entire source sentence into a single vector before the decoder can start. Important details get lost. Consider translating a long English sentence to French. An RNN encoder produces a single vector representing the whole sentence. The decoder must reconstruct the French sentence from this compressed representation. Early words in the source get processed first, potentially getting "overwritten" by later words in the hidden state. These limitations motivated the search for architectures that could: 1. Process sequences in parallel rather than sequentially 2. Directly connect any position to any other position, regardless of distance 3. Avoid compressing everything into a fixed bottleneck The solution was **attention**: a mechanism that lets the model dynamically focus on different parts of the input. Attention was first used as an add-on to RNNs, dramatically improving machine translation. The transformer architecture then showed that attention alone, without any recurrence, could achieve state-of-the-art results while being much faster to train. In the next part of this book, we'll build up to the transformer by first understanding embeddings (how we represent discrete tokens as vectors) and then developing the attention mechanism in detail. ## Summary We've seen that: - Sequential data has properties (variable length, order dependence, long-range dependencies) that standard feedforward networks can't handle well. - RNNs process sequences by maintaining a hidden state that updates at each timestep, with shared weights across time. - Training RNNs involves backpropagation through time, which leads to products of many matrices. - These products cause gradients to vanish or explode, making long-range dependencies hard to learn. - LSTMs mitigate vanishing gradients with additive cell state updates and gating mechanisms. - But recurrence has fundamental limitations: sequential computation, remaining difficulty with very long range dependencies, and information bottlenecks. The transformer architecture, which we'll build toward over the next several chapters, addresses all these limitations by replacing recurrence with attention.