Vectors of meaning

A mathematical journey into transformers

Author

Valentin Radu

Published

February 5, 2026

Preface

This book is a mathematical journey into the architecture that powers modern language models. We start with the basics: what is a vector? what does a derivative measure? These simple questions become the foundation for increasingly sophisticated machinery. By the final chapters, we derive complete transformer equations with explicit forward and backward propagation, every gradient computed, every dimension specified. The progression is exponential: early chapters move slowly through fundamentals, later chapters synthesize everything into a unified mathematical framework.

What this book is

Vectors of meaning takes a math-first approach. Every concept is derived rigorously, with explicit dimensions, indices, and gradient computations. We show every step: no operation is left as an exercise, no mechanism assumed familiar. By the final chapter, a reader will have the complete mathematical specification to implement a transformer from scratch.

The book progresses in three parts:

Part I: Foundations establishes the mathematical prerequisites. We cover linear algebra (vectors, matrices, eigenvalues, and the geometric intuition behind them), calculus (derivatives, the chain rule, and backpropagation), probability (distributions, expectations, and information theory), notation conventions, neural network basics (neurons, layers, loss functions, gradient descent), and sequence modeling challenges (why recurrent networks struggle and what transformers solve).

Part II: Building blocks develops the components that make transformers work. We explore embeddings (how discrete tokens become continuous vectors), the attention mechanism (weighted combinations based on relevance), self-attention (tokens attending to each other), multi-head attention (parallel attention with different perspectives), and positional encoding (injecting sequence order into permutation-invariant attention).

Part III: The transformer architecture brings everything together. We derive the complete encoder-decoder transformer with full forward and backward propagation, examine training objectives (language modeling, masked language modeling, RLHF), and explore scaling laws (the mathematical relationships between compute, data, parameters, and performance).

Who this book is for

This book is for readers who want to understand transformers through their actual mathematics. It’s for students and researchers building rigorous foundations, engineers moving beyond API calls to genuine understanding, and anyone curious about how language models work at a fundamental level.

If you’ve read papers and found yourself lost in notation, or watched explanations that glossed over the details, this book provides the complete formal treatment.

Prerequisites

We assume familiarity with:

Linear algebra: vectors, matrices, matrix multiplication
Calculus: derivatives, partial derivatives, the chain rule
Probability: random variables, distributions, expectations

We review these topics in the foundations section, but they shouldn’t be entirely new. Some comfort with mathematical notation helps, though we define everything we use.

How to read this book

Each chapter builds on previous ones. The foundations establish tools we use throughout; the building blocks are assembled into the complete architecture. Reading linearly works best, though readers comfortable with the prerequisites might skim Part I.

Mathematical derivations include explicit dimensions and worked examples. We prioritize clarity over brevity: if a step isn’t obvious, we show it. Code appears sparingly and only where it genuinely clarifies a concept that mathematics alone cannot.

The book uses a consistent notation system detailed in Chapter 4. Bold uppercase ($\mathbf{W}$) denotes matrices, bold lowercase ($\mathbf{x}$) denotes vectors, and regular font denotes scalars. Derivatives use Leibniz notation ($\frac{df}{dx}$) throughout.

We begin with the mathematics that underlies everything: linear algebra.

Citation

If you use this book in your work, please cite it as:

Radu, V. (2024). Vectors of meaning: A mathematical journey into transformers. https://vom.radval.me. DOI: 10.5281/zenodo.18490032

@book{radu2024vectors,
  title     = {Vectors of meaning: A mathematical journey into transformers},
  author    = {Radu, Valentin},
  year      = {2024},
  url       = {https://vom.radval.me},
  doi       = {10.5281/zenodo.18490032}
}

# Preface {.unnumbered} This book is a mathematical journey into the architecture that powers modern language models. We start with the basics: what is a vector? what does a derivative measure? These simple questions become the foundation for increasingly sophisticated machinery. By the final chapters, we derive complete transformer equations with explicit forward and backward propagation, every gradient computed, every dimension specified. The progression is exponential: early chapters move slowly through fundamentals, later chapters synthesize everything into a unified mathematical framework. ## What this book is Vectors of meaning takes a math-first approach. Every concept is derived rigorously, with explicit dimensions, indices, and gradient computations. We show every step: no operation is left as an exercise, no mechanism assumed familiar. By the final chapter, a reader will have the complete mathematical specification to implement a transformer from scratch. The book progresses in three parts: **Part I: Foundations** establishes the mathematical prerequisites. We cover linear algebra (vectors, matrices, eigenvalues, and the geometric intuition behind them), calculus (derivatives, the chain rule, and backpropagation), probability (distributions, expectations, and information theory), notation conventions, neural network basics (neurons, layers, loss functions, gradient descent), and sequence modeling challenges (why recurrent networks struggle and what transformers solve). **Part II: Building blocks** develops the components that make transformers work. We explore embeddings (how discrete tokens become continuous vectors), the attention mechanism (weighted combinations based on relevance), self-attention (tokens attending to each other), multi-head attention (parallel attention with different perspectives), and positional encoding (injecting sequence order into permutation-invariant attention). **Part III: The transformer architecture** brings everything together. We derive the complete encoder-decoder transformer with full forward and backward propagation, examine training objectives (language modeling, masked language modeling, RLHF), and explore scaling laws (the mathematical relationships between compute, data, parameters, and performance). ## Who this book is for This book is for readers who want to understand transformers through their actual mathematics. It's for students and researchers building rigorous foundations, engineers moving beyond API calls to genuine understanding, and anyone curious about how language models work at a fundamental level. If you've read papers and found yourself lost in notation, or watched explanations that glossed over the details, this book provides the complete formal treatment. ## Prerequisites We assume familiarity with: - **Linear algebra**: vectors, matrices, matrix multiplication - **Calculus**: derivatives, partial derivatives, the chain rule - **Probability**: random variables, distributions, expectations We review these topics in the foundations section, but they shouldn't be entirely new. Some comfort with mathematical notation helps, though we define everything we use. ## How to read this book Each chapter builds on previous ones. The foundations establish tools we use throughout; the building blocks are assembled into the complete architecture. Reading linearly works best, though readers comfortable with the prerequisites might skim Part I. Mathematical derivations include explicit dimensions and worked examples. We prioritize clarity over brevity: if a step isn't obvious, we show it. Code appears sparingly and only where it genuinely clarifies a concept that mathematics alone cannot. The book uses a consistent notation system detailed in Chapter 4. Bold uppercase ($\mathbf{W}$) denotes matrices, bold lowercase ($\mathbf{x}$) denotes vectors, and regular font denotes scalars. Derivatives use Leibniz notation ($\frac{df}{dx}$) throughout. --- We begin with the mathematics that underlies everything: linear algebra. ## Citation {.appendix} If you use this book in your work, please cite it as: > Radu, V. (2024). *Vectors of meaning: A mathematical journey into transformers*. <https://vom.radval.me>. DOI: [10.5281/zenodo.18490032](https://doi.org/10.5281/zenodo.18490032) ```bibtex @book{radu2024vectors, title = {Vectors of meaning: A mathematical journey into transformers}, author = {Radu, Valentin}, year = {2024}, url = {https://vom.radval.me}, doi = {10.5281/zenodo.18490032} } ```