14 Scaling laws

Learning objectives

After completing this chapter, you will be able to:

State the power law relationships between loss and compute, data, and parameters
Apply Chinchilla-optimal ratios to determine model size given compute budget
Explain emergent capabilities and why they appear at specific scales
Identify practical limits to scaling: data constraints, compute costs, diminishing returns
Use scaling laws to predict model performance and plan training runs

One of the most important discoveries in deep learning is that language model performance follows predictable mathematical relationships as we increase compute, data, and parameters (Kaplan et al. 2020). These scaling laws guide how to allocate resources and predict capabilities of future models.

14.1 The empirical observation

When we train language models across many orders of magnitude in size, a striking pattern emerges: loss decreases as a power law with each resource. This was not predicted from theory. Researchers discovered it by training hundreds of models at different scales and plotting the results.

14.1.1 How the laws were discovered

In January 2020, researchers at OpenAI published “Scaling Laws for Neural Language Models” (Kaplan et al. 2020). They trained over 400 transformer language models ranging from 768 parameters to 1.5 billion parameters on datasets from 22 million to 23 billion tokens. They varied model width, depth, batch size, and learning rate systematically.

When they plotted test loss against model size on logarithmic axes, the points fell on straight lines. A straight line on log-log axes indicates a power law relationship: $L = aN^{-\alpha}$, or equivalently $\log L = \log a - \alpha \log N$. The slope gives the exponent $\alpha$.

This was surprising. There was no theoretical reason to expect such clean relationships across four orders of magnitude in model size. Yet the pattern held consistently.

14.1.2 Loss versus parameters

For a model with $N$ parameters trained on sufficient data:

\[ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} \]

The constants come from fitting curves to empirical data:

$N_c \approx 8.8 \times 10^{13}$: This is where the power law, if extended, would hit zero loss. It has no physical meaning since we cannot build a model with $10^{13}$ parameters (roughly 10,000 times larger than GPT-4). It is simply a fitted constant that makes the formula match observations.
$\alpha_N \approx 0.076$: This exponent determines how fast loss decreases with scale. The value 0.076 means doubling parameters reduces loss by a factor of $2^{0.076} \approx 1.054$, roughly 5% improvement. This small exponent explains why we need enormous scale increases for meaningful gains.

These specific numbers came from fitting to OpenAI’s training runs on their specific data (WebText, a curated web scrape). Different data, different tokenization, or different architectures would yield different constants. The power law form appears universal; the specific coefficients are not.

14.1.3 Loss versus data

For a model trained on $D$ tokens with sufficient parameters:

\[ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} \]

$D_c \approx 5.4 \times 10^{13}$: Another fitted constant with no direct physical interpretation. It represents the amount of data where the model would achieve near-zero loss if the power law continued indefinitely.
$\alpha_D \approx 0.095$: Slightly larger than $\alpha_N$, meaning data scales somewhat more efficiently than parameters. Doubling data gives roughly 7% loss reduction.

The methodology was the same: train models with fixed size on varying amounts of data, plot loss versus data tokens on log-log axes, fit a line.

14.1.4 Loss versus compute

For optimal allocation of compute budget $C$ (measured in FLOPs):

\[ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} \]

$C_c \approx 3.1 \times 10^8$: This small value reflects that even modest compute budgets (a few hundred million FLOPs, achievable on a laptop) can train a simple language model with some predictive ability.
$\alpha_C \approx 0.050$: The smallest exponent, because compute compounds both parameters and data. Each 10x in compute gives about 12% loss reduction.

14.1.5 Why these specific numbers?

The honest answer: we do not know why these exponents take these values. The power law form suggests deep regularities in how neural networks learn from data, but no first-principles theory predicts $\alpha_N = 0.076$ rather than 0.1 or 0.05.

Several observations constrain the values:

Exponents less than 1: If $\alpha > 1$, doubling resources would more than double the improvement, leading to explosive gains. We observe diminishing returns instead.
Exponents greater than 0: If $\alpha \leq 0$, scaling would not help. Empirically, bigger models perform better.
Similar magnitudes: All three exponents are between 0.05 and 0.1. This may reflect that the fundamental bottleneck is similar regardless of which resource we scale.

The universality across architectures is remarkable. GPT-style decoders, BERT-style encoders, and various modifications all show power law scaling with similar exponents. This suggests the laws capture something about learning from text, not about specific architectural choices.

14.1.6 Chinchilla revisions

In 2022, DeepMind’s Chinchilla paper (Hoffmann et al. 2022) revisited these measurements with more careful experiments. They found slightly different exponents and, more importantly, different optimal tradeoffs between parameters and data. The original OpenAI work suggested making models as large as possible; Chinchilla showed that balanced scaling of parameters and data is more efficient.

This illustrates an important point: the specific constants are empirical measurements subject to revision. The qualitative finding (power law scaling with diminishing returns) is robust; the exact numbers depend on experimental details.

14.2 The unified scaling law

These individual relationships combine into a single formula:

\[ L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D} \]

This captures how loss depends jointly on model size and data. The formula reveals diminishing returns: as you scale one resource, the benefit decreases unless you also scale the other.

14.3 Compute-optimal training

Given a fixed compute budget, how should we split it between model size and training tokens?

14.3.1 The Chinchilla finding

Early scaling work suggested making models as large as possible for a given compute budget. But Hoffmann et al. (Hoffmann et al. 2022) showed this is suboptimal. They found that parameters and training tokens should scale equally:

\[ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} \]

For a compute budget $C$, the optimal model has roughly $N \approx C^{0.5}$ parameters trained on $D \approx C^{0.5}$ tokens.

14.3.2 The practical rule

A useful approximation: train on roughly 20 tokens per parameter. A 7 billion parameter model should train on about 140 billion tokens. A 70 billion parameter model should train on about 1.4 trillion tokens.

This overturned previous practice. GPT-3 (Brown et al. 2020) (175B parameters) trained on 300B tokens, far below the compute-optimal ratio. Chinchilla (70B parameters) trained on 1.4T tokens achieved similar performance with 4x fewer parameters, making inference much cheaper.

14.3.3 Why this matters

Inference cost scales with parameter count. A model trained compute-optimally has fewer parameters for the same capability, reducing deployment costs. This shifts the economics: spend more on training (one-time cost) to save on inference (ongoing cost).

14.4 What drives scaling

14.4.1 The loss decomposition

Test loss can be decomposed into irreducible entropy and reducible error:

\[ L = L_\infty + L_{reducible} \]

The irreducible entropy $L_\infty$ represents fundamental uncertainty in the data. Consider the sentence “I flipped a coin and got ___“. No model, however large, can predict whether the next word is”heads” or “tails” better than chance. The outcome is genuinely random given the context. Natural language contains many such unpredictable elements: which synonym an author chose, random numbers in text, names of people in stories.

Estimates suggest $L_\infty \approx 1.5$ nats for natural language, corresponding to a perplexity of about 4.5. This means even a perfect model would be “choosing between” about 4-5 equally likely options on average. The reducible error $L_{reducible}$ is everything above this floor, and it decreases with scale following the power laws.

14.4.2 What is a power law?

Before asking why scaling follows power laws, we should understand what a power law actually is and how it differs from other relationships.

A power law relates two quantities where one is proportional to the other raised to some power:

\[ y = a \cdot x^b \]

Here $a$ is a constant multiplier and $b$ is the exponent. In scaling laws, we typically write this as $L = (N_c/N)^\alpha$, which is equivalent with $a = N_c^\alpha$ and $b = -\alpha$.

Contrast with linear relationships: In a linear relationship $y = ax$, doubling $x$ always doubles $y$. The absolute increase is proportional to $x$: going from 10 to 20 adds the same as going from 100 to 110 in relative terms, but very different amounts absolutely.

Contrast with exponential relationships: In an exponential $y = a \cdot b^x$, each unit increase in $x$ multiplies $y$ by a fixed factor. Exponentials grow (or decay) extremely fast. Moore’s law (transistors doubling every 18 months) is exponential in time.

Power laws are in between: In a power law $y = ax^b$ with $0 < b < 1$, doubling $x$ multiplies $y$ by $2^b$, a fixed factor less than 2. This is “sublinear” growth: you get improvement, but with diminishing returns. Each doubling helps less in absolute terms.

A concrete example: Suppose loss scales as $L = 10 \cdot N^{-0.1}$.

Parameters $N$	Loss $L$	Improvement from previous
1 million	5.01	-
10 million	3.98	21%
100 million	3.16	21%
1 billion	2.51	21%
10 billion	2.00	21%

Each 10x increase in parameters gives the same percentage improvement (about 21%), but the absolute improvement shrinks: from 5.01 to 3.98 is a drop of 1.03, but from 2.51 to 2.00 is only 0.51. This is the “diminishing returns” character of power laws.

The log-log signature: Power laws have a distinctive visual signature. If you plot $y$ vs $x$ on regular axes, you see a curve that drops steeply at first then flattens (for negative exponents like in scaling laws). But if you plot $\log y$ vs $\log x$, you get a straight line:

\[ \log y = \log a + b \log x \]

Figure 14.1: Power law on linear axes (left) shows a curve; the same relationship on log-log axes (right) becomes a straight line. The slope of the line equals the exponent $-\alpha$.

This is why researchers always plot scaling results on log-log axes (Figure 14.1). A straight line confirms power law behavior; the slope gives the exponent directly.

Why are power laws special? Power laws are “scale-free.” The relationship looks the same whether you are at small scale or large scale. If you zoom in on any part of a log-log plot of a power law, it looks identical to any other part. This self-similarity suggests the underlying process has no characteristic scale.

Compare this to a relationship with a characteristic scale, like $y = e^{-x/x_0}$. The behavior changes fundamentally around $x = x_0$. Power laws have no such transition point (until other effects intervene, like the irreducible loss floor).

14.4.3 Why power laws in neural networks?

Power laws appear throughout nature: earthquake magnitudes, city populations, word frequencies, species abundances. When we see a power law, it usually indicates some underlying scale-free process. But why would neural network training produce power laws?

Several theories have been proposed. None is fully satisfactory, but together they illuminate different aspects of the phenomenon.

The manifold hypothesis

Natural language does not fill the entire space of possible token sequences. Most random sequences are gibberish. Meaningful text lies on a lower-dimensional structure within the high-dimensional space of all possible sequences.

Imagine the space of all 1000-token sequences. With vocabulary 50,000, this space has $50000^{1000}$ points. But coherent English text occupies a tiny fraction of this space. The “manifold hypothesis” says this fraction forms a smooth, lower-dimensional surface.

A neural network approximates this manifold. A small network can only capture a crude approximation, like fitting a plane to a curved surface. As we add parameters, the network can represent finer details: curves, bumps, wrinkles. The error in approximating a smooth manifold typically decreases as a power of the approximation capacity.

Why power law specifically? If the manifold has intrinsic dimension $d$ and we approximate it with a model of capacity $N$, approximation theory suggests error scales as $N^{-\alpha}$ where $\alpha$ depends on the smoothness of the manifold and the dimension. For sufficiently smooth manifolds, this gives power law scaling.

The theory has limitations. We do not know the actual dimension or smoothness of the “language manifold.” The theory predicts that $\alpha$ should depend on these properties, but empirically the exponent is remarkably consistent across different data distributions. This suggests something more universal is at play.

The random feature perspective

Another view focuses on what happens inside the network. A randomly initialized neural network already computes a large set of random features (nonlinear combinations of inputs). Training selects which features to use for prediction.

Consider a network with $N$ parameters computing $\sim N$ random features. Some features are useful for predicting text; most are not. As $N$ increases, we get more features, and by chance, some of the new features are useful. The probability of finding a useful feature among random ones often follows power law statistics.

More precisely: suppose the “usefulness” of random features follows a heavy-tailed distribution, where a few features are very useful and most are nearly useless. This is plausible because useful features (like “detects question syntax” or “tracks subject-verb agreement”) are specific, while random features are generic. The number of useful features you find among $N$ random ones scales as $N^\alpha$ for some $\alpha < 1$.

This explains why the exponent is less than 1: doubling parameters does not double the number of useful features, because useful features are rare. It also suggests the exponent should be universal, depending on the statistics of useful features rather than the specific task.

The loss landscape perspective

Training a neural network means minimizing a loss function over a high-dimensional parameter space. The geometry of this “loss landscape” affects what solutions we find.

Small networks have rugged loss landscapes with many local minima separated by high barriers. The optimizer gets stuck in mediocre solutions. Large networks have smoother landscapes where minima are connected by low-loss paths. The optimizer can find better solutions.

Why does landscape smoothness improve with scale? One argument: in high dimensions, most directions are “neutral” (neither uphill nor downhill). Saddle points are more common than local minima. A large network has so many parameters that it can almost always find a direction to escape bad regions. The loss landscape becomes a gentle slope toward good solutions rather than a maze of traps.

This connects to the observation that larger models are easier to train, not harder. Learning rate and other hyperparameters transfer across scales. If the landscape became more complex with scale, we would expect training to become more difficult, but the opposite occurs.

The statistical mechanics perspective

Physicists have studied power laws for over a century in the context of phase transitions and critical phenomena. Systems at criticality (the boundary between two phases, like water at the boiling point) exhibit power law correlations.

Some researchers propose that neural networks during training operate near a critical point. The network balances between underfitting (too simple, high bias) and overfitting (too complex, high variance). At this boundary, power law scaling emerges naturally.

The analogy goes further. In statistical mechanics, power laws arise when the system has no characteristic scale. A network near the interpolation threshold (just enough capacity to fit the training data) might similarly be “scale-free,” with features at all sizes contributing to the prediction.

This theory makes a specific prediction: the scaling exponent should be related to “critical exponents” that characterize the universality class of the learning process. Different architectures might fall into different universality classes, but within a class, the exponent should be fixed. This matches the observation that transformers of different sizes and configurations show similar exponents.

The data structure hypothesis

Perhaps the power law comes not from the model but from the data. Natural language has power law statistics at multiple levels: word frequencies follow Zipf’s law, phrase frequencies decay as power laws, topic distributions are heavy-tailed.

If learning proceeds by capturing patterns from most common to least common, and pattern frequencies follow a power law, then the rate of improvement might inherit this power law structure. A model of capacity $N$ can capture patterns down to some frequency threshold. As $N$ increases, the threshold drops, capturing rarer patterns. The reduction in loss depends on how much probability mass lies in the newly captured patterns.

Under Zipfian statistics, this gives power law scaling with an exponent determined by the Zipf exponent. English has Zipf exponent approximately 1, which would predict a specific scaling exponent. The observed exponents are in the right ballpark, though the detailed predictions do not match perfectly.

Synthesis: no single explanation

Each theory captures something real:

The manifold hypothesis explains why larger models generalize better
The random feature view explains why useful capacity grows sublinearly
The loss landscape theory explains why larger models are easier to train
Statistical mechanics provides a framework for universality
Data structure explains why the exponent might be consistent across tasks

The remarkable fact is that all these mechanisms, arising from different aspects of learning, conspire to produce nearly the same power law exponents. This suggests an underlying unity we do not yet understand.

The exponents $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$ are not predicted by any theory. They are measured. A complete theory of deep learning would derive these numbers from first principles. We are far from that today.

14.4.4 The irreducible loss

The scaling laws predict that loss approaches zero as resources approach infinity. This cannot be literally true. The irreducible loss $L_\infty$ sets a floor.

We can write the full scaling law as:

\[ L(N, D) = L_\infty + \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} \]

For current models, $L_\infty$ is small compared to the reducible terms, so we often ignore it. But as models approach human-level performance, the irreducible term will dominate. Further scaling will yield diminishing returns as we bump against fundamental unpredictability in text.

Estimating $L_\infty$ is difficult because it requires extrapolating current trends far beyond observed scales. Different estimation methods give values between 1.0 and 1.8 nats. This uncertainty matters for predicting ultimate model capabilities.

14.5 Emergent capabilities

Beyond smooth loss improvements, scaling produces qualitative jumps in capability (Wei et al. 2022). These “emergent” abilities appear suddenly at specific scales, surprising researchers who expected gradual improvement.

14.5.1 Phase transitions

Some abilities appear suddenly at specific scales:

Multi-digit arithmetic: Models below 10 billion parameters fail at three-digit addition, performing near random chance. Above this threshold, accuracy jumps to over 80%. There is no gradual improvement; the capability switches on.
Chain-of-thought reasoning: When prompted to “think step by step,” small models ignore the instruction or produce incoherent steps. Large models (roughly 100B+ parameters) suddenly use the steps productively, improving accuracy on math and logic problems by 20-50 percentage points.
In-context learning: The ability to learn new tasks from a few examples in the prompt. Small models treat examples as context but do not generalize. Large models extract the pattern and apply it to new instances.
Word unscrambling: Given “elppa” and asked for the unscrambled word, small models fail entirely. Large models succeed, suggesting they can mentally manipulate token sequences.
Logical fallacy detection: Identifying invalid arguments requires understanding both the logical structure and the content. This emerges around 50-100B parameters.

The pattern is consistent: performance is flat (near random) below a threshold, then jumps sharply. The transition occurs over less than one order of magnitude in model size, sometimes just a 2-3x increase.

14.5.2 Why emergence happens

Several hypotheses explain sudden capability emergence:

Circuit formation: Perhaps a capability requires multiple components working together. A model might separately learn “parse numbers,” “apply addition algorithm,” and “format output.” Only when all components are present does the capability work. Below the threshold, one component is missing. Above it, the circuit is complete.

Representation phase transitions: The internal representations might undergo qualitative changes at scale. Small models represent words; larger models might represent abstract concepts, relationships, or reasoning patterns. When the representation crosses a complexity threshold, new capabilities become possible.

Task decomposition: Complex tasks decompose into subtasks. A model might need to solve subtasks reliably before the full task works. If subtask accuracy is 60%, and the full task requires three subtasks, success rate is $0.6^3 = 22\%$. Improving subtask accuracy to 90% gives $0.9^3 = 73\%$. Small improvements in components yield large improvements in composite tasks.

Measurement artifacts: A controversial view holds that emergence is partly an illusion of how we measure. If we use accuracy (right/wrong) as our metric, a task that requires 90% of steps correct will show 0% accuracy until the model crosses 90% per-step accuracy, then jump to high accuracy. Smoother metrics like log-probability might show gradual improvement where accuracy shows a jump.

Research suggests both perspectives have merit. Some emergence is real (qualitative changes in what models can do), some is measurement artifact (smooth underlying improvement appearing discontinuous due to threshold metrics).

14.5.3 Predicting emergence

We cannot reliably predict when capabilities will emerge. This is a major challenge for AI safety and planning.

Loss decreases smoothly and predictably. We can forecast the loss of a 10x larger model with reasonable accuracy. But we cannot forecast what new capabilities that loss improvement will unlock. A model with 5% lower loss might have no new abilities, or it might suddenly solve a class of problems it previously failed on completely.

Several approaches attempt to predict emergence:

Extrapolating capability curves: If we measure a capability at multiple scales, we might extrapolate when it will reach useful levels. But this requires the capability to show some signal below the emergence threshold, which is often not the case.

Proxy tasks: Sometimes simpler versions of a task show gradual improvement. If “two-digit addition” improves gradually, we might predict when “five-digit addition” will emerge. But the relationship between proxy and target tasks is often unclear.

Theoretical analysis: Understanding why a capability requires scale might predict when it emerges. If we knew that chain-of-thought requires a certain minimum context window or representation capacity, we could predict the threshold. But we rarely have such understanding.

Empirical discovery: In practice, we discover emergent capabilities by training large models and testing them. This is expensive and reactive. We often do not know what a model can do until we try.

The unpredictability of emergence is one reason AI development is hard to forecast. Scaling laws tell us loss will decrease, but not what that means for real-world capabilities.

14.6 Limits to scaling

Scaling laws suggest that we could keep improving models forever by adding more compute, data, and parameters. In practice, we face hard limits on each resource.

14.6.1 Data constraints

The Chinchilla-optimal recipe requires roughly 20 tokens of training data per parameter. A 1 trillion parameter model needs 20 trillion tokens. Where does this data come from?

The internet: Common Crawl, a snapshot of the public web, contains roughly 100 trillion tokens. But most of this is low quality: spam, duplicate content, machine-generated text, navigation menus, cookie notices. After filtering for quality, perhaps 10-20 trillion tokens of useful text remain.

Books: Project Gutenberg, library digitization efforts, and commercial ebooks provide perhaps 100 billion tokens of high-quality, edited prose. This is a small fraction of web data but higher quality per token.

Code: GitHub and other repositories contain trillions of tokens of code. Code is highly structured and teaches models about logic, syntax, and precise instruction-following. Most frontier models train on significant code despite targeting natural language.

Scientific literature: Academic papers, patents, and technical documents provide specialized knowledge. PubMed alone contains billions of tokens of biomedical text.

Curated datasets: Wikipedia, StackOverflow, Reddit (with filtering), and other curated sources provide moderate amounts of high-quality text.

The total pool of quality text is finite. Current estimates suggest 5-15 trillion tokens of “good” training data exist in digitized form. Models are already training on significant fractions of this. Llama 2 trained on 2 trillion tokens; GPT-4 likely used more.

Strategies for data scarcity:

Synthetic data: Use existing models to generate training data for new models. This works but risks “model collapse” if the synthetic data distribution drifts from natural text. Careful filtering is required.

Multi-epoch training: Train on the same data multiple times. Diminishing returns set in after 2-4 epochs; the model memorizes rather than generalizes. But some repetition is better than no data.

Multi-modal data: Images, video, and audio contain information that text alone does not. A model that can learn from video (roughly 100x more data than text) escapes the text data limit. This requires architectural changes.

Active data collection: Pay humans to write text specifically for training. Expensive, but produces high-quality, targeted data. Used for instruction tuning and RLHF, but too expensive for pretraining at scale.

14.6.2 Compute constraints

Training frontier models requires extraordinary computational resources.

Current scale: GPT-4 reportedly required 25,000 A100 GPUs training for approximately 3 months. At cloud rental prices of $2 per GPU-hour, the compute cost alone exceeds $100 million. Actual costs are higher due to engineering, failed runs, and infrastructure.

Hardware availability: The world produces perhaps 500,000 high-end AI GPUs per year. A single frontier training run might consume 5-10% of annual production. Supply chains, chip manufacturing capacity, and geopolitical factors limit GPU availability.

Energy consumption: A large training run consumes tens of megawatts continuously for months. Data centers require this power plus cooling. Locating facilities near cheap, abundant power is a real constraint. A 1 gigawatt data center (plausible for frontier AI by 2030) would consume as much power as a small city.

Memory and communication: Models with trillions of parameters do not fit in a single GPU’s memory (currently 80GB for high-end chips). They must be split across thousands of GPUs with high-bandwidth interconnects. The communication overhead becomes a bottleneck. Training efficiency (actual FLOPS achieved vs. theoretical peak) drops as models span more devices.

Physical limits: Moore’s law has slowed. Transistor density improvements that once doubled every 18 months now take 3+ years. New architectures (wafer-scale chips, optical interconnects, analog computing) might help, but fundamental physical limits loom within a few decades.

14.6.3 Economic constraints

Even if data and compute were physically available, the economics become challenging.

Training costs: Current frontier models cost $100M-$1B to train. If scaling laws hold, a model 100x better would require roughly $10B-$100B in compute costs. This approaches the R&D budgets of entire nations.

Diminishing returns: The power law exponent $\alpha_C \approx 0.05$ means 10x more compute yields only 12% lower loss. To halve the loss requires $2^{1/0.05} \approx 10^6$ times more compute. At some point, the marginal improvement per dollar becomes negligible compared to other research directions.

Inference economics: Larger models cost more to run. A model with 10x more parameters costs roughly 10x more per query. If the capability improvement is only 12%, the cost-effectiveness of serving the model decreases. Training a compute-optimal (smaller) model often dominates training the largest possible model.

Opportunity cost: Resources spent scaling one model are not spent on architectural innovations, dataset improvements, or new training methods. The optimal allocation between scaling and research is unclear, but pure scaling is unlikely to be optimal.

14.6.4 The data wall

Many researchers believe data is the binding constraint. We can build bigger GPUs, but we cannot create more Shakespeare or more Wikipedia. Synthetic data helps but has limits. This “data wall” may force a shift from scaling to other approaches: better architectures, improved training efficiency, or learning from less data (sample efficiency).

Whether the data wall is hard (fundamentally insurmountable) or soft (surmountable with synthetic data and multi-modal learning) is actively debated. The answer will determine the trajectory of AI development.

14.7 Practical implications

Scaling laws are not just academic curiosities. They have transformed how practitioners, researchers, and policymakers approach language models.

14.7.1 For practitioners

When designing a model for a specific task, scaling laws provide a framework for resource allocation.

Estimating compute budgets: Start with the target capability. If existing models at scale X achieve 70% accuracy on your task, and the task-specific scaling exponent suggests 10x compute yields 10% improvement, you can estimate the compute needed for 90% accuracy. This is imprecise but better than guessing.

Choosing model size: Given a compute budget $C$ (in FLOPs), the Chinchilla-optimal model has approximately:

\[ N_{opt} \approx 0.7 \times 10^{9} \times \left(\frac{C}{10^{21}}\right)^{0.5} \text{ parameters} \]

\[ D_{opt} \approx 20 \times N_{opt} \text{ tokens} \]

For example, with $C = 10^{23}$ FLOPs (a modest budget by frontier standards), the optimal model has about 7 billion parameters trained on 140 billion tokens.

Data-limited scenarios: If you have less data than $20N$ tokens, you are data-limited. Options include: - Use a smaller model that matches your data - Accept some compute inefficiency and train the larger model anyway (sometimes worthwhile if inference cost matters) - Augment data through synthetic generation or multi-epoch training

Inference optimization: Remember that training cost is paid once; inference cost is paid per query. If your application will serve millions of queries, a smaller model trained slightly suboptimally for compute efficiency might be cheaper overall than the compute-optimal model.

Fine-tuning considerations: Scaling laws apply to pretraining. Fine-tuning on task-specific data has different dynamics. A model pretrained at scale retains its capabilities when fine-tuned with much less data. You do not need to repeat the scaling analysis for fine-tuning; use the largest pretrained model you can afford to run.

14.7.2 For researchers

Scaling laws enable efficient research by allowing extrapolation from small experiments.

Predicting performance: Train models at 1%, 3%, and 10% of your target scale. Plot loss vs. compute on log-log axes. If the points form a line, extrapolate to predict the full-scale result. This lets you estimate whether a research direction is promising before committing full resources.

Comparing methods: If method A achieves loss $L_A$ at compute $C$, and method B achieves $L_B$ at the same compute, you can estimate the “effective compute multiplier” of method B: how much compute would method A need to match method B’s loss? This normalizes for the quality vs. efficiency tradeoff.

Hyperparameter transfer: Scaling laws suggest that optimal hyperparameters (learning rate, batch size) transfer across scales with predictable adjustments. Learning rate typically scales as $N^{-0.5}$; batch size scales as $N^{0.5}$. This reduces the hyperparameter search space at large scale.

Identifying scaling bottlenecks: If your model scales worse than expected (exponent smaller than baseline), something is limiting scaling. This might be data quality, architectural bottlenecks, or optimization issues. Scaling experiments can diagnose problems.

Efficient ablations: To test whether a modification helps, compare scaling curves rather than single points. A modification that helps at small scale but hurts the scaling exponent will eventually underperform the baseline. Conversely, a modification that looks neutral at small scale but improves the exponent is valuable.

14.7.3 For society and policy

Scaling laws have implications beyond technical AI development.

Capability forecasting: If scaling laws hold, we can estimate when models might achieve specific capabilities. If current frontier models are at $10^{25}$ FLOPs and a capability is expected to emerge at $10^{27}$ FLOPs, we can estimate the timeline based on hardware and investment trends. This informs safety research priorities.

Compute governance: Compute is measurable and controllable in ways that algorithms and data are not. Understanding how compute translates to capability informs policies about compute access, export controls, and international agreements.

Economic projections: The cost of AI capabilities is predictable from scaling laws. If capability X requires $10^{26}$ FLOPs, and compute costs fall 30% per year, we can project when X becomes economically viable for various applications.

Safety implications: Emergent capabilities mean that a model slightly larger than the current frontier might have qualitatively new abilities. This argues for careful, incremental scaling with evaluation at each step, rather than racing to the largest possible model.

Resource allocation: Understanding diminishing returns helps allocate AI investment. If the next 10x in compute yields only modest improvements, perhaps resources are better spent on data quality, alignment research, or deployment infrastructure.

14.8 Beyond loss

Loss is a convenient metric because it is smooth, continuous, and easy to measure. But we ultimately care about task performance: can the model answer questions, write code, follow instructions? The relationship between loss and task performance is complex.

14.8.1 Task performance scaling

Different tasks scale differently with model size and loss:

Knowledge-intensive tasks (trivia, factual questions) scale well with model size. Larger models memorize more facts from training data. The scaling exponent for knowledge retrieval is relatively high.

Reasoning tasks (math problems, logic puzzles) scale steeply but with higher variance. Small models fail completely; large models show rapid improvement. The emergence phenomenon is strongest for reasoning.

Pattern matching tasks (sentiment classification, language identification) scale slowly because small models already perform well. The task saturates before scaling laws matter much.

Generation quality (coherent writing, appropriate tone) scales steadily. Larger models produce more fluent, coherent, and contextually appropriate text. Human evaluations correlate with log model size.

Instruction following scales with both pretraining and instruction tuning. Larger pretrained models learn to follow instructions more easily during fine-tuning.

The relationship between loss and task accuracy is often sigmoidal. At high loss, accuracy is near chance. As loss decreases, accuracy improves slowly, then rapidly, then saturates. The “knee” of the sigmoid varies by task. This means:

A 10% loss improvement might yield 1% accuracy improvement on an easy task (already saturated)
The same loss improvement might yield 20% accuracy improvement on a hard task (in the steep region)
Or 0% improvement on an impossibly hard task (below the threshold)

Predicting which tasks benefit from scaling requires understanding where each task sits on its sigmoid curve.

14.8.2 Efficiency innovations

Scaling laws describe a particular architecture (transformers) trained in a particular way (standard pretraining). Innovations can shift the curves, achieving better loss for the same compute.

Architectural improvements: Flash attention reduces memory and compute for attention operations by 2-4x. Mixture-of-experts models activate only a fraction of parameters per token, achieving better loss per training FLOP (though inference cost remains high). Sparse attention patterns, linear attention, and state-space models each offer different tradeoffs.

Each innovation can be characterized by its “effective compute multiplier”: how much baseline compute would achieve the same loss? Flash attention might provide a 2x multiplier; mixture-of-experts might provide 4x for training (less for inference).

Training improvements: Better optimizers (AdamW, Lion), learning rate schedules (cosine decay, warmup), and regularization techniques (dropout, weight decay) improve training efficiency. These compound: a 10% improvement from the optimizer and 10% from the schedule yields 21% overall.

Data improvements: Filtering training data for quality, deduplicating, and balancing domains improves loss more than raw data quantity. A dataset that is 10x smaller but 10x higher quality might train a better model. This shifts the data scaling curve, achieving lower loss per token.

Quantization and distillation: Running models at lower precision (8-bit, 4-bit) or distilling large models into smaller ones does not improve scaling laws but changes the pareto frontier of capability vs. inference cost. A distilled 7B model might match a dense 30B model on many tasks at 4x lower inference cost.

Important caveat: Innovations rarely change the scaling exponents. They shift the curves vertically (better loss at all scales) but the slope remains similar. A 2x efficiency improvement saves one “doubling” worth of compute but does not change how many doublings are needed for a given improvement. This is why scaling laws remain relevant even as techniques improve.

14.9 Mathematical framework

Understanding the mathematics behind scaling laws helps us apply them correctly and understand their limitations.

14.9.1 The power law form

A power law has the form $y = ax^b$ where $a$ and $b$ are constants. Power laws are “scale-invariant”: if you zoom in or out on a log-log plot, the relationship looks identical. This makes them natural for phenomena spanning many orders of magnitude.

Taking logarithms linearizes the relationship:

\[ \log L = \log a - \alpha \log N \]

On log-log axes, this is a straight line with slope $-\alpha$ and intercept $\log a$. The negative slope reflects that loss decreases as resources increase.

Why plot on log-log axes?: If the relationship is truly a power law, log-log plotting reveals it immediately as a straight line. If the relationship is exponential, polynomial, or some other form, it will curve on log-log axes. Log-log plots also make it easy to visualize many orders of magnitude on a single graph.

Reading the exponent: The slope of the log-log line gives the exponent directly. A slope of $-0.076$ means doubling $N$ multiplies $L$ by $2^{-0.076} = 0.949$, a 5.1% reduction. A slope of $-0.1$ would give $2^{-0.1} = 0.933$, a 6.7% reduction. Small changes in exponent matter when compounded over many doublings.

14.9.2 The unified scaling law

When both parameters and data vary, the loss follows:

\[ L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D} \]

This formula captures the interaction between resources. Let us unpack it.

When $N$ is very large (effectively infinite), the first term vanishes:

\[ L(N \to \infty, D) = \left(\frac{D_c}{D}\right)^{\alpha_D} \]

We recover the data-only scaling law. Similarly, when $D$ is very large:

\[ L(N, D \to \infty) = \left(\frac{N_c}{N}\right)^{\alpha_N} \]

We recover the parameter-only scaling law.

When both are finite, neither resource alone limits performance. The formula interpolates smoothly between regimes.

The structure $[A + B]^{\alpha_D}$ with $A = (N_c/N)^{\alpha_N/\alpha_D}$ and $B = D_c/D$ has a specific meaning: it is as if we are adding two “effective data deficits.” Insufficient parameters act like insufficient data, with a conversion factor $\alpha_N/\alpha_D \approx 0.8$ between them.

14.9.3 Fitting scaling laws

To fit scaling laws empirically:

Step 1: Train models at multiple scales. Vary $N$ (or $D$, or both) across at least 2-3 orders of magnitude. For example, train models with 10M, 30M, 100M, 300M, 1B, and 3B parameters. More points give more reliable fits.

Step 2: Measure test loss. Use a held-out test set that the model never saw during training. The test set should be large enough that measurement noise is small. Compute loss in nats (natural log) or bits (log base 2) consistently.

Step 3: Fit in log space. Transform to $\log L$ vs. $\log N$. Fit a line using ordinary least squares:

\[ \log L_i = \beta_0 + \beta_1 \log N_i + \epsilon_i \]

The fitted slope $\hat{\beta}_1 = -\hat{\alpha}$ gives the exponent. The intercept $\hat{\beta}_0 = \log \hat{a}$ gives the prefactor.

Step 4: Validate. Hold out one or two data points from the fit. Predict their loss from the fitted law. If predictions are accurate (within a few percent), the law is reliable. If predictions are far off, the law may not hold, or more data points are needed.

Step 5: Estimate uncertainty. Bootstrap or compute standard errors on the fitted parameters. The uncertainty in $\alpha$ translates to uncertainty in extrapolated loss. Small errors in $\alpha$ compound when extrapolating many orders of magnitude.

Common pitfalls:

Insufficient scale range: Fitting over less than two orders of magnitude often gives unreliable exponents. The curvature from irreducible loss or saturation can masquerade as different slopes.
Underfitting at small scale: Very small models may not train to convergence or may have different optimization dynamics. Exclude the smallest models if they deviate systematically.
Overfitting at large scale: If your largest model trains on so much data that it starts memorizing, the test loss may be artificially low. Ensure sufficient test data.

14.9.4 Computing optimal allocations

Given a compute budget $C$, we want to find the optimal $N$ and $D$. Training a model with $N$ parameters on $D$ tokens requires approximately $C = 6ND$ FLOPs (a commonly used approximation where the factor 6 accounts for forward and backward passes with some overhead).

The constraint is $D = C / (6N)$. Substituting into the loss formula:

\[ L(N) = L\left(N, \frac{C}{6N}\right) \]

Minimizing over $N$ gives the optimal model size for budget $C$. Taking derivatives and solving (algebra omitted) yields:

\[ N_{opt} \propto C^{a}, \quad D_{opt} \propto C^{1-a} \]

where $a$ depends on the exponents $\alpha_N$ and $\alpha_D$. The original OpenAI paper found $a \approx 0.73$ (favor larger models). Chinchilla found $a \approx 0.5$ (balance parameters and data). The difference comes from different experimental setups and fitting procedures.

The practical rule “$D \approx 20N$” corresponds to $a = 0.5$ (Chinchilla) with the $C = 6ND$ compute formula.

14.9.5 Extrapolation risks

Scaling laws are empirical fits, not physical laws. Extrapolating beyond observed scales carries risks:

Phase transitions: The scaling exponent might change at some scale. Perhaps models above 10 trillion parameters enter a new regime with different dynamics. We would not know until we train such models.

Optimization breakdown: Very large models might be harder to optimize, with instabilities, gradient issues, or sensitivity to hyperparameters that do not appear at smaller scales.

Data distribution shift: If training data quality or distribution changes at scale (e.g., running out of high-quality data and using more synthetic data), the scaling law from earlier data might not apply.

Irreducible loss dominance: As models improve, the irreducible loss $L_\infty$ becomes a larger fraction of total loss. Near this floor, the power law bends and eventually flattens.

Unknown unknowns: There might be phenomena we have not anticipated that change the picture at larger scales.

How far can we safely extrapolate? A common heuristic: extrapolate at most 1 order of magnitude beyond your largest training run. Extrapolating 2+ orders of magnitude is speculation, not prediction.

14.9.6 Confidence intervals

When presenting scaling law predictions, include uncertainty. If your fit gives $\alpha = 0.08 \pm 0.01$, propagate this to predictions:

At $N = 10^{12}$ parameters:

\[ L = (N_c / N)^\alpha = (8.8 \times 10^{13} / 10^{12})^{0.08} = 88^{0.08} \approx 1.43 \]

With $\alpha = 0.07$: $L \approx 1.37$. With $\alpha = 0.09$: $L \approx 1.50$.

The uncertainty grows with extrapolation distance. A 12% uncertainty in exponent becomes 20%+ uncertainty in loss over 2 orders of magnitude.

14.10 Summary

Scaling laws reveal that language model performance improves predictably with compute, data, and parameters. The key insights:

Power laws: Loss decreases as power laws with each resource
Compute-optimal training: Balance parameters and data (roughly 20 tokens per parameter)
Emergent capabilities: Qualitative abilities appear at specific scales
Limits: Data scarcity, compute costs, and diminishing returns constrain scaling

These relationships have transformed how we build language models, shifting focus from architectural innovation to efficient scaling. Understanding them is essential for anyone working with modern transformers.

# Scaling laws ::: {.callout-note appearance="simple"} ## Learning objectives After completing this chapter, you will be able to: - State the power law relationships between loss and compute, data, and parameters - Apply Chinchilla-optimal ratios to determine model size given compute budget - Explain emergent capabilities and why they appear at specific scales - Identify practical limits to scaling: data constraints, compute costs, diminishing returns - Use scaling laws to predict model performance and plan training runs ::: One of the most important discoveries in deep learning is that language model performance follows predictable mathematical relationships as we increase compute, data, and parameters [@kaplan2020scaling]. These scaling laws guide how to allocate resources and predict capabilities of future models. ## The empirical observation When we train language models across many orders of magnitude in size, a striking pattern emerges: loss decreases as a power law with each resource. This was not predicted from theory. Researchers discovered it by training hundreds of models at different scales and plotting the results. ### How the laws were discovered In January 2020, researchers at OpenAI published "Scaling Laws for Neural Language Models" [@kaplan2020scaling]. They trained over 400 transformer language models ranging from 768 parameters to 1.5 billion parameters on datasets from 22 million to 23 billion tokens. They varied model width, depth, batch size, and learning rate systematically. When they plotted test loss against model size on logarithmic axes, the points fell on straight lines. A straight line on log-log axes indicates a power law relationship: $L = aN^{-\alpha}$, or equivalently $\log L = \log a - \alpha \log N$. The slope gives the exponent $\alpha$. This was surprising. There was no theoretical reason to expect such clean relationships across four orders of magnitude in model size. Yet the pattern held consistently. ### Loss versus parameters For a model with $N$ parameters trained on sufficient data: $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} $$ The constants come from fitting curves to empirical data: - **$N_c \approx 8.8 \times 10^{13}$**: This is where the power law, if extended, would hit zero loss. It has no physical meaning since we cannot build a model with $10^{13}$ parameters (roughly 10,000 times larger than GPT-4). It is simply a fitted constant that makes the formula match observations. - **$\alpha_N \approx 0.076$**: This exponent determines how fast loss decreases with scale. The value 0.076 means doubling parameters reduces loss by a factor of $2^{0.076} \approx 1.054$, roughly 5% improvement. This small exponent explains why we need enormous scale increases for meaningful gains. These specific numbers came from fitting to OpenAI's training runs on their specific data (WebText, a curated web scrape). Different data, different tokenization, or different architectures would yield different constants. The power law form appears universal; the specific coefficients are not. ### Loss versus data For a model trained on $D$ tokens with sufficient parameters: $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} $$ - **$D_c \approx 5.4 \times 10^{13}$**: Another fitted constant with no direct physical interpretation. It represents the amount of data where the model would achieve near-zero loss if the power law continued indefinitely. - **$\alpha_D \approx 0.095$**: Slightly larger than $\alpha_N$, meaning data scales somewhat more efficiently than parameters. Doubling data gives roughly 7% loss reduction. The methodology was the same: train models with fixed size on varying amounts of data, plot loss versus data tokens on log-log axes, fit a line. ### Loss versus compute For optimal allocation of compute budget $C$ (measured in FLOPs): $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} $$ - **$C_c \approx 3.1 \times 10^8$**: This small value reflects that even modest compute budgets (a few hundred million FLOPs, achievable on a laptop) can train a simple language model with some predictive ability. - **$\alpha_C \approx 0.050$**: The smallest exponent, because compute compounds both parameters and data. Each 10x in compute gives about 12% loss reduction. ### Why these specific numbers? The honest answer: we do not know why these exponents take these values. The power law form suggests deep regularities in how neural networks learn from data, but no first-principles theory predicts $\alpha_N = 0.076$ rather than 0.1 or 0.05. Several observations constrain the values: 1. **Exponents less than 1**: If $\alpha > 1$, doubling resources would more than double the improvement, leading to explosive gains. We observe diminishing returns instead. 2. **Exponents greater than 0**: If $\alpha \leq 0$, scaling would not help. Empirically, bigger models perform better. 3. **Similar magnitudes**: All three exponents are between 0.05 and 0.1. This may reflect that the fundamental bottleneck is similar regardless of which resource we scale. The universality across architectures is remarkable. GPT-style decoders, BERT-style encoders, and various modifications all show power law scaling with similar exponents. This suggests the laws capture something about learning from text, not about specific architectural choices. ### Chinchilla revisions In 2022, DeepMind's Chinchilla paper [@hoffmann2022training] revisited these measurements with more careful experiments. They found slightly different exponents and, more importantly, different optimal tradeoffs between parameters and data. The original OpenAI work suggested making models as large as possible; Chinchilla showed that balanced scaling of parameters and data is more efficient. This illustrates an important point: the specific constants are empirical measurements subject to revision. The qualitative finding (power law scaling with diminishing returns) is robust; the exact numbers depend on experimental details. ## The unified scaling law These individual relationships combine into a single formula: $$ L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D} $$ This captures how loss depends jointly on model size and data. The formula reveals diminishing returns: as you scale one resource, the benefit decreases unless you also scale the other. ## Compute-optimal training Given a fixed compute budget, how should we split it between model size and training tokens? ### The Chinchilla finding Early scaling work suggested making models as large as possible for a given compute budget. But Hoffmann et al. [@hoffmann2022training] showed this is suboptimal. They found that parameters and training tokens should scale equally: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ For a compute budget $C$, the optimal model has roughly $N \approx C^{0.5}$ parameters trained on $D \approx C^{0.5}$ tokens. ### The practical rule A useful approximation: train on roughly 20 tokens per parameter. A 7 billion parameter model should train on about 140 billion tokens. A 70 billion parameter model should train on about 1.4 trillion tokens. This overturned previous practice. GPT-3 [@brown2020language] (175B parameters) trained on 300B tokens, far below the compute-optimal ratio. Chinchilla (70B parameters) trained on 1.4T tokens achieved similar performance with 4x fewer parameters, making inference much cheaper. ### Why this matters Inference cost scales with parameter count. A model trained compute-optimally has fewer parameters for the same capability, reducing deployment costs. This shifts the economics: spend more on training (one-time cost) to save on inference (ongoing cost). ## What drives scaling ### The loss decomposition Test loss can be decomposed into irreducible entropy and reducible error: $$ L = L_\infty + L_{reducible} $$ The irreducible entropy $L_\infty$ represents fundamental uncertainty in the data. Consider the sentence "I flipped a coin and got ___". No model, however large, can predict whether the next word is "heads" or "tails" better than chance. The outcome is genuinely random given the context. Natural language contains many such unpredictable elements: which synonym an author chose, random numbers in text, names of people in stories. Estimates suggest $L_\infty \approx 1.5$ nats for natural language, corresponding to a perplexity of about 4.5. This means even a perfect model would be "choosing between" about 4-5 equally likely options on average. The reducible error $L_{reducible}$ is everything above this floor, and it decreases with scale following the power laws. ### What is a power law? Before asking why scaling follows power laws, we should understand what a power law actually is and how it differs from other relationships. A power law relates two quantities where one is proportional to the other raised to some power: $$ y = a \cdot x^b $$ Here $a$ is a constant multiplier and $b$ is the exponent. In scaling laws, we typically write this as $L = (N_c/N)^\alpha$, which is equivalent with $a = N_c^\alpha$ and $b = -\alpha$. **Contrast with linear relationships**: In a linear relationship $y = ax$, doubling $x$ always doubles $y$. The absolute increase is proportional to $x$: going from 10 to 20 adds the same as going from 100 to 110 in relative terms, but very different amounts absolutely. **Contrast with exponential relationships**: In an exponential $y = a \cdot b^x$, each unit increase in $x$ multiplies $y$ by a fixed factor. Exponentials grow (or decay) extremely fast. Moore's law (transistors doubling every 18 months) is exponential in time. **Power laws are in between**: In a power law $y = ax^b$ with $0 < b < 1$, doubling $x$ multiplies $y$ by $2^b$, a fixed factor less than 2. This is "sublinear" growth: you get improvement, but with diminishing returns. Each doubling helps less in absolute terms. **A concrete example**: Suppose loss scales as $L = 10 \cdot N^{-0.1}$. | Parameters $N$ | Loss $L$ | Improvement from previous | |----------------|----------|---------------------------| | 1 million | 5.01 | - | | 10 million | 3.98 | 21% | | 100 million | 3.16 | 21% | | 1 billion | 2.51 | 21% | | 10 billion | 2.00 | 21% | Each 10x increase in parameters gives the same *percentage* improvement (about 21%), but the *absolute* improvement shrinks: from 5.01 to 3.98 is a drop of 1.03, but from 2.51 to 2.00 is only 0.51. This is the "diminishing returns" character of power laws. **The log-log signature**: Power laws have a distinctive visual signature. If you plot $y$ vs $x$ on regular axes, you see a curve that drops steeply at first then flattens (for negative exponents like in scaling laws). But if you plot $\log y$ vs $\log x$, you get a straight line: $$ \log y = \log a + b \log x $$ ![Power law on linear axes (left) shows a curve; the same relationship on log-log axes (right) becomes a straight line. The slope of the line equals the exponent $-\alpha$.](../diagrams/svg/power-law-plots.svg){#fig-power-law-plots} This is why researchers always plot scaling results on log-log axes (@fig-power-law-plots). A straight line confirms power law behavior; the slope gives the exponent directly. **Why are power laws special?** Power laws are "scale-free." The relationship looks the same whether you are at small scale or large scale. If you zoom in on any part of a log-log plot of a power law, it looks identical to any other part. This self-similarity suggests the underlying process has no characteristic scale. Compare this to a relationship with a characteristic scale, like $y = e^{-x/x_0}$. The behavior changes fundamentally around $x = x_0$. Power laws have no such transition point (until other effects intervene, like the irreducible loss floor). ### Why power laws in neural networks? Power laws appear throughout nature: earthquake magnitudes, city populations, word frequencies, species abundances. When we see a power law, it usually indicates some underlying scale-free process. But why would neural network training produce power laws? Several theories have been proposed. None is fully satisfactory, but together they illuminate different aspects of the phenomenon. #### The manifold hypothesis Natural language does not fill the entire space of possible token sequences. Most random sequences are gibberish. Meaningful text lies on a lower-dimensional structure within the high-dimensional space of all possible sequences. Imagine the space of all 1000-token sequences. With vocabulary 50,000, this space has $50000^{1000}$ points. But coherent English text occupies a tiny fraction of this space. The "manifold hypothesis" says this fraction forms a smooth, lower-dimensional surface. A neural network approximates this manifold. A small network can only capture a crude approximation, like fitting a plane to a curved surface. As we add parameters, the network can represent finer details: curves, bumps, wrinkles. The error in approximating a smooth manifold typically decreases as a power of the approximation capacity. Why power law specifically? If the manifold has intrinsic dimension $d$ and we approximate it with a model of capacity $N$, approximation theory suggests error scales as $N^{-\alpha}$ where $\alpha$ depends on the smoothness of the manifold and the dimension. For sufficiently smooth manifolds, this gives power law scaling. The theory has limitations. We do not know the actual dimension or smoothness of the "language manifold." The theory predicts that $\alpha$ should depend on these properties, but empirically the exponent is remarkably consistent across different data distributions. This suggests something more universal is at play. #### The random feature perspective Another view focuses on what happens inside the network. A randomly initialized neural network already computes a large set of random features (nonlinear combinations of inputs). Training selects which features to use for prediction. Consider a network with $N$ parameters computing $\sim N$ random features. Some features are useful for predicting text; most are not. As $N$ increases, we get more features, and by chance, some of the new features are useful. The probability of finding a useful feature among random ones often follows power law statistics. More precisely: suppose the "usefulness" of random features follows a heavy-tailed distribution, where a few features are very useful and most are nearly useless. This is plausible because useful features (like "detects question syntax" or "tracks subject-verb agreement") are specific, while random features are generic. The number of useful features you find among $N$ random ones scales as $N^\alpha$ for some $\alpha < 1$. This explains why the exponent is less than 1: doubling parameters does not double the number of useful features, because useful features are rare. It also suggests the exponent should be universal, depending on the statistics of useful features rather than the specific task. #### The loss landscape perspective Training a neural network means minimizing a loss function over a high-dimensional parameter space. The geometry of this "loss landscape" affects what solutions we find. Small networks have rugged loss landscapes with many local minima separated by high barriers. The optimizer gets stuck in mediocre solutions. Large networks have smoother landscapes where minima are connected by low-loss paths. The optimizer can find better solutions. Why does landscape smoothness improve with scale? One argument: in high dimensions, most directions are "neutral" (neither uphill nor downhill). Saddle points are more common than local minima. A large network has so many parameters that it can almost always find a direction to escape bad regions. The loss landscape becomes a gentle slope toward good solutions rather than a maze of traps. This connects to the observation that larger models are easier to train, not harder. Learning rate and other hyperparameters transfer across scales. If the landscape became more complex with scale, we would expect training to become more difficult, but the opposite occurs. #### The statistical mechanics perspective Physicists have studied power laws for over a century in the context of phase transitions and critical phenomena. Systems at criticality (the boundary between two phases, like water at the boiling point) exhibit power law correlations. Some researchers propose that neural networks during training operate near a critical point. The network balances between underfitting (too simple, high bias) and overfitting (too complex, high variance). At this boundary, power law scaling emerges naturally. The analogy goes further. In statistical mechanics, power laws arise when the system has no characteristic scale. A network near the interpolation threshold (just enough capacity to fit the training data) might similarly be "scale-free," with features at all sizes contributing to the prediction. This theory makes a specific prediction: the scaling exponent should be related to "critical exponents" that characterize the universality class of the learning process. Different architectures might fall into different universality classes, but within a class, the exponent should be fixed. This matches the observation that transformers of different sizes and configurations show similar exponents. #### The data structure hypothesis Perhaps the power law comes not from the model but from the data. Natural language has power law statistics at multiple levels: word frequencies follow Zipf's law, phrase frequencies decay as power laws, topic distributions are heavy-tailed. If learning proceeds by capturing patterns from most common to least common, and pattern frequencies follow a power law, then the rate of improvement might inherit this power law structure. A model of capacity $N$ can capture patterns down to some frequency threshold. As $N$ increases, the threshold drops, capturing rarer patterns. The reduction in loss depends on how much probability mass lies in the newly captured patterns. Under Zipfian statistics, this gives power law scaling with an exponent determined by the Zipf exponent. English has Zipf exponent approximately 1, which would predict a specific scaling exponent. The observed exponents are in the right ballpark, though the detailed predictions do not match perfectly. #### Synthesis: no single explanation Each theory captures something real: - The manifold hypothesis explains why larger models generalize better - The random feature view explains why useful capacity grows sublinearly - The loss landscape theory explains why larger models are easier to train - Statistical mechanics provides a framework for universality - Data structure explains why the exponent might be consistent across tasks The remarkable fact is that all these mechanisms, arising from different aspects of learning, conspire to produce nearly the same power law exponents. This suggests an underlying unity we do not yet understand. The exponents $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$ are not predicted by any theory. They are measured. A complete theory of deep learning would derive these numbers from first principles. We are far from that today. ### The irreducible loss The scaling laws predict that loss approaches zero as resources approach infinity. This cannot be literally true. The irreducible loss $L_\infty$ sets a floor. We can write the full scaling law as: $$ L(N, D) = L_\infty + \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} $$ For current models, $L_\infty$ is small compared to the reducible terms, so we often ignore it. But as models approach human-level performance, the irreducible term will dominate. Further scaling will yield diminishing returns as we bump against fundamental unpredictability in text. Estimating $L_\infty$ is difficult because it requires extrapolating current trends far beyond observed scales. Different estimation methods give values between 1.0 and 1.8 nats. This uncertainty matters for predicting ultimate model capabilities. ## Emergent capabilities Beyond smooth loss improvements, scaling produces qualitative jumps in capability [@wei2022emergent]. These "emergent" abilities appear suddenly at specific scales, surprising researchers who expected gradual improvement. ### Phase transitions Some abilities appear suddenly at specific scales: - **Multi-digit arithmetic**: Models below 10 billion parameters fail at three-digit addition, performing near random chance. Above this threshold, accuracy jumps to over 80%. There is no gradual improvement; the capability switches on. - **Chain-of-thought reasoning**: When prompted to "think step by step," small models ignore the instruction or produce incoherent steps. Large models (roughly 100B+ parameters) suddenly use the steps productively, improving accuracy on math and logic problems by 20-50 percentage points. - **In-context learning**: The ability to learn new tasks from a few examples in the prompt. Small models treat examples as context but do not generalize. Large models extract the pattern and apply it to new instances. - **Word unscrambling**: Given "elppa" and asked for the unscrambled word, small models fail entirely. Large models succeed, suggesting they can mentally manipulate token sequences. - **Logical fallacy detection**: Identifying invalid arguments requires understanding both the logical structure and the content. This emerges around 50-100B parameters. The pattern is consistent: performance is flat (near random) below a threshold, then jumps sharply. The transition occurs over less than one order of magnitude in model size, sometimes just a 2-3x increase. ### Why emergence happens Several hypotheses explain sudden capability emergence: **Circuit formation**: Perhaps a capability requires multiple components working together. A model might separately learn "parse numbers," "apply addition algorithm," and "format output." Only when all components are present does the capability work. Below the threshold, one component is missing. Above it, the circuit is complete. **Representation phase transitions**: The internal representations might undergo qualitative changes at scale. Small models represent words; larger models might represent abstract concepts, relationships, or reasoning patterns. When the representation crosses a complexity threshold, new capabilities become possible. **Task decomposition**: Complex tasks decompose into subtasks. A model might need to solve subtasks reliably before the full task works. If subtask accuracy is 60%, and the full task requires three subtasks, success rate is $0.6^3 = 22\%$. Improving subtask accuracy to 90% gives $0.9^3 = 73\%$. Small improvements in components yield large improvements in composite tasks. **Measurement artifacts**: A controversial view holds that emergence is partly an illusion of how we measure. If we use accuracy (right/wrong) as our metric, a task that requires 90% of steps correct will show 0% accuracy until the model crosses 90% per-step accuracy, then jump to high accuracy. Smoother metrics like log-probability might show gradual improvement where accuracy shows a jump. Research suggests both perspectives have merit. Some emergence is real (qualitative changes in what models can do), some is measurement artifact (smooth underlying improvement appearing discontinuous due to threshold metrics). ### Predicting emergence We cannot reliably predict when capabilities will emerge. This is a major challenge for AI safety and planning. Loss decreases smoothly and predictably. We can forecast the loss of a 10x larger model with reasonable accuracy. But we cannot forecast what new capabilities that loss improvement will unlock. A model with 5% lower loss might have no new abilities, or it might suddenly solve a class of problems it previously failed on completely. Several approaches attempt to predict emergence: **Extrapolating capability curves**: If we measure a capability at multiple scales, we might extrapolate when it will reach useful levels. But this requires the capability to show some signal below the emergence threshold, which is often not the case. **Proxy tasks**: Sometimes simpler versions of a task show gradual improvement. If "two-digit addition" improves gradually, we might predict when "five-digit addition" will emerge. But the relationship between proxy and target tasks is often unclear. **Theoretical analysis**: Understanding why a capability requires scale might predict when it emerges. If we knew that chain-of-thought requires a certain minimum context window or representation capacity, we could predict the threshold. But we rarely have such understanding. **Empirical discovery**: In practice, we discover emergent capabilities by training large models and testing them. This is expensive and reactive. We often do not know what a model can do until we try. The unpredictability of emergence is one reason AI development is hard to forecast. Scaling laws tell us loss will decrease, but not what that means for real-world capabilities. ## Limits to scaling Scaling laws suggest that we could keep improving models forever by adding more compute, data, and parameters. In practice, we face hard limits on each resource. ### Data constraints The Chinchilla-optimal recipe requires roughly 20 tokens of training data per parameter. A 1 trillion parameter model needs 20 trillion tokens. Where does this data come from? **The internet**: Common Crawl, a snapshot of the public web, contains roughly 100 trillion tokens. But most of this is low quality: spam, duplicate content, machine-generated text, navigation menus, cookie notices. After filtering for quality, perhaps 10-20 trillion tokens of useful text remain. **Books**: Project Gutenberg, library digitization efforts, and commercial ebooks provide perhaps 100 billion tokens of high-quality, edited prose. This is a small fraction of web data but higher quality per token. **Code**: GitHub and other repositories contain trillions of tokens of code. Code is highly structured and teaches models about logic, syntax, and precise instruction-following. Most frontier models train on significant code despite targeting natural language. **Scientific literature**: Academic papers, patents, and technical documents provide specialized knowledge. PubMed alone contains billions of tokens of biomedical text. **Curated datasets**: Wikipedia, StackOverflow, Reddit (with filtering), and other curated sources provide moderate amounts of high-quality text. The total pool of quality text is finite. Current estimates suggest 5-15 trillion tokens of "good" training data exist in digitized form. Models are already training on significant fractions of this. Llama 2 trained on 2 trillion tokens; GPT-4 likely used more. **Strategies for data scarcity**: *Synthetic data*: Use existing models to generate training data for new models. This works but risks "model collapse" if the synthetic data distribution drifts from natural text. Careful filtering is required. *Multi-epoch training*: Train on the same data multiple times. Diminishing returns set in after 2-4 epochs; the model memorizes rather than generalizes. But some repetition is better than no data. *Multi-modal data*: Images, video, and audio contain information that text alone does not. A model that can learn from video (roughly 100x more data than text) escapes the text data limit. This requires architectural changes. *Active data collection*: Pay humans to write text specifically for training. Expensive, but produces high-quality, targeted data. Used for instruction tuning and RLHF, but too expensive for pretraining at scale. ### Compute constraints Training frontier models requires extraordinary computational resources. **Current scale**: GPT-4 reportedly required 25,000 A100 GPUs training for approximately 3 months. At cloud rental prices of $2 per GPU-hour, the compute cost alone exceeds $100 million. Actual costs are higher due to engineering, failed runs, and infrastructure. **Hardware availability**: The world produces perhaps 500,000 high-end AI GPUs per year. A single frontier training run might consume 5-10% of annual production. Supply chains, chip manufacturing capacity, and geopolitical factors limit GPU availability. **Energy consumption**: A large training run consumes tens of megawatts continuously for months. Data centers require this power plus cooling. Locating facilities near cheap, abundant power is a real constraint. A 1 gigawatt data center (plausible for frontier AI by 2030) would consume as much power as a small city. **Memory and communication**: Models with trillions of parameters do not fit in a single GPU's memory (currently 80GB for high-end chips). They must be split across thousands of GPUs with high-bandwidth interconnects. The communication overhead becomes a bottleneck. Training efficiency (actual FLOPS achieved vs. theoretical peak) drops as models span more devices. **Physical limits**: Moore's law has slowed. Transistor density improvements that once doubled every 18 months now take 3+ years. New architectures (wafer-scale chips, optical interconnects, analog computing) might help, but fundamental physical limits loom within a few decades. ### Economic constraints Even if data and compute were physically available, the economics become challenging. **Training costs**: Current frontier models cost $100M-$1B to train. If scaling laws hold, a model 100x better would require roughly $10B-$100B in compute costs. This approaches the R&D budgets of entire nations. **Diminishing returns**: The power law exponent $\alpha_C \approx 0.05$ means 10x more compute yields only 12% lower loss. To halve the loss requires $2^{1/0.05} \approx 10^6$ times more compute. At some point, the marginal improvement per dollar becomes negligible compared to other research directions. **Inference economics**: Larger models cost more to run. A model with 10x more parameters costs roughly 10x more per query. If the capability improvement is only 12%, the cost-effectiveness of serving the model decreases. Training a compute-optimal (smaller) model often dominates training the largest possible model. **Opportunity cost**: Resources spent scaling one model are not spent on architectural innovations, dataset improvements, or new training methods. The optimal allocation between scaling and research is unclear, but pure scaling is unlikely to be optimal. ### The data wall Many researchers believe data is the binding constraint. We can build bigger GPUs, but we cannot create more Shakespeare or more Wikipedia. Synthetic data helps but has limits. This "data wall" may force a shift from scaling to other approaches: better architectures, improved training efficiency, or learning from less data (sample efficiency). Whether the data wall is hard (fundamentally insurmountable) or soft (surmountable with synthetic data and multi-modal learning) is actively debated. The answer will determine the trajectory of AI development. ## Practical implications Scaling laws are not just academic curiosities. They have transformed how practitioners, researchers, and policymakers approach language models. ### For practitioners When designing a model for a specific task, scaling laws provide a framework for resource allocation. **Estimating compute budgets**: Start with the target capability. If existing models at scale X achieve 70% accuracy on your task, and the task-specific scaling exponent suggests 10x compute yields 10% improvement, you can estimate the compute needed for 90% accuracy. This is imprecise but better than guessing. **Choosing model size**: Given a compute budget $C$ (in FLOPs), the Chinchilla-optimal model has approximately: $$ N_{opt} \approx 0.7 \times 10^{9} \times \left(\frac{C}{10^{21}}\right)^{0.5} \text{ parameters} $$ $$ D_{opt} \approx 20 \times N_{opt} \text{ tokens} $$ For example, with $C = 10^{23}$ FLOPs (a modest budget by frontier standards), the optimal model has about 7 billion parameters trained on 140 billion tokens. **Data-limited scenarios**: If you have less data than $20N$ tokens, you are data-limited. Options include: - Use a smaller model that matches your data - Accept some compute inefficiency and train the larger model anyway (sometimes worthwhile if inference cost matters) - Augment data through synthetic generation or multi-epoch training **Inference optimization**: Remember that training cost is paid once; inference cost is paid per query. If your application will serve millions of queries, a smaller model trained slightly suboptimally for compute efficiency might be cheaper overall than the compute-optimal model. **Fine-tuning considerations**: Scaling laws apply to pretraining. Fine-tuning on task-specific data has different dynamics. A model pretrained at scale retains its capabilities when fine-tuned with much less data. You do not need to repeat the scaling analysis for fine-tuning; use the largest pretrained model you can afford to run. ### For researchers Scaling laws enable efficient research by allowing extrapolation from small experiments. **Predicting performance**: Train models at 1%, 3%, and 10% of your target scale. Plot loss vs. compute on log-log axes. If the points form a line, extrapolate to predict the full-scale result. This lets you estimate whether a research direction is promising before committing full resources. **Comparing methods**: If method A achieves loss $L_A$ at compute $C$, and method B achieves $L_B$ at the same compute, you can estimate the "effective compute multiplier" of method B: how much compute would method A need to match method B's loss? This normalizes for the quality vs. efficiency tradeoff. **Hyperparameter transfer**: Scaling laws suggest that optimal hyperparameters (learning rate, batch size) transfer across scales with predictable adjustments. Learning rate typically scales as $N^{-0.5}$; batch size scales as $N^{0.5}$. This reduces the hyperparameter search space at large scale. **Identifying scaling bottlenecks**: If your model scales worse than expected (exponent smaller than baseline), something is limiting scaling. This might be data quality, architectural bottlenecks, or optimization issues. Scaling experiments can diagnose problems. **Efficient ablations**: To test whether a modification helps, compare scaling curves rather than single points. A modification that helps at small scale but hurts the scaling exponent will eventually underperform the baseline. Conversely, a modification that looks neutral at small scale but improves the exponent is valuable. ### For society and policy Scaling laws have implications beyond technical AI development. **Capability forecasting**: If scaling laws hold, we can estimate when models might achieve specific capabilities. If current frontier models are at $10^{25}$ FLOPs and a capability is expected to emerge at $10^{27}$ FLOPs, we can estimate the timeline based on hardware and investment trends. This informs safety research priorities. **Compute governance**: Compute is measurable and controllable in ways that algorithms and data are not. Understanding how compute translates to capability informs policies about compute access, export controls, and international agreements. **Economic projections**: The cost of AI capabilities is predictable from scaling laws. If capability X requires $10^{26}$ FLOPs, and compute costs fall 30% per year, we can project when X becomes economically viable for various applications. **Safety implications**: Emergent capabilities mean that a model slightly larger than the current frontier might have qualitatively new abilities. This argues for careful, incremental scaling with evaluation at each step, rather than racing to the largest possible model. **Resource allocation**: Understanding diminishing returns helps allocate AI investment. If the next 10x in compute yields only modest improvements, perhaps resources are better spent on data quality, alignment research, or deployment infrastructure. ## Beyond loss Loss is a convenient metric because it is smooth, continuous, and easy to measure. But we ultimately care about task performance: can the model answer questions, write code, follow instructions? The relationship between loss and task performance is complex. ### Task performance scaling Different tasks scale differently with model size and loss: **Knowledge-intensive tasks** (trivia, factual questions) scale well with model size. Larger models memorize more facts from training data. The scaling exponent for knowledge retrieval is relatively high. **Reasoning tasks** (math problems, logic puzzles) scale steeply but with higher variance. Small models fail completely; large models show rapid improvement. The emergence phenomenon is strongest for reasoning. **Pattern matching tasks** (sentiment classification, language identification) scale slowly because small models already perform well. The task saturates before scaling laws matter much. **Generation quality** (coherent writing, appropriate tone) scales steadily. Larger models produce more fluent, coherent, and contextually appropriate text. Human evaluations correlate with log model size. **Instruction following** scales with both pretraining and instruction tuning. Larger pretrained models learn to follow instructions more easily during fine-tuning. The relationship between loss and task accuracy is often sigmoidal. At high loss, accuracy is near chance. As loss decreases, accuracy improves slowly, then rapidly, then saturates. The "knee" of the sigmoid varies by task. This means: - A 10% loss improvement might yield 1% accuracy improvement on an easy task (already saturated) - The same loss improvement might yield 20% accuracy improvement on a hard task (in the steep region) - Or 0% improvement on an impossibly hard task (below the threshold) Predicting which tasks benefit from scaling requires understanding where each task sits on its sigmoid curve. ### Efficiency innovations Scaling laws describe a particular architecture (transformers) trained in a particular way (standard pretraining). Innovations can shift the curves, achieving better loss for the same compute. **Architectural improvements**: Flash attention reduces memory and compute for attention operations by 2-4x. Mixture-of-experts models activate only a fraction of parameters per token, achieving better loss per training FLOP (though inference cost remains high). Sparse attention patterns, linear attention, and state-space models each offer different tradeoffs. Each innovation can be characterized by its "effective compute multiplier": how much baseline compute would achieve the same loss? Flash attention might provide a 2x multiplier; mixture-of-experts might provide 4x for training (less for inference). **Training improvements**: Better optimizers (AdamW, Lion), learning rate schedules (cosine decay, warmup), and regularization techniques (dropout, weight decay) improve training efficiency. These compound: a 10% improvement from the optimizer and 10% from the schedule yields 21% overall. **Data improvements**: Filtering training data for quality, deduplicating, and balancing domains improves loss more than raw data quantity. A dataset that is 10x smaller but 10x higher quality might train a better model. This shifts the data scaling curve, achieving lower loss per token. **Quantization and distillation**: Running models at lower precision (8-bit, 4-bit) or distilling large models into smaller ones does not improve scaling laws but changes the pareto frontier of capability vs. inference cost. A distilled 7B model might match a dense 30B model on many tasks at 4x lower inference cost. **Important caveat**: Innovations rarely change the scaling exponents. They shift the curves vertically (better loss at all scales) but the slope remains similar. A 2x efficiency improvement saves one "doubling" worth of compute but does not change how many doublings are needed for a given improvement. This is why scaling laws remain relevant even as techniques improve. ### Multi-modal scaling Scaling laws extend to multi-modal models (text + images, text + code, etc.), but with modifications: - Different modalities may have different exponents - Cross-modal tasks (image captioning, visual question answering) may scale differently than single-modal tasks - The optimal modality mix depends on the target task Early results suggest that multi-modal training does not violate scaling laws but adds complexity to the resource allocation problem. How much image data versus text data? How to weight losses across modalities? These questions do not yet have definitive answers. ## Mathematical framework Understanding the mathematics behind scaling laws helps us apply them correctly and understand their limitations. ### The power law form A power law has the form $y = ax^b$ where $a$ and $b$ are constants. Power laws are "scale-invariant": if you zoom in or out on a log-log plot, the relationship looks identical. This makes them natural for phenomena spanning many orders of magnitude. Taking logarithms linearizes the relationship: $$ \log L = \log a - \alpha \log N $$ On log-log axes, this is a straight line with slope $-\alpha$ and intercept $\log a$. The negative slope reflects that loss decreases as resources increase. **Why plot on log-log axes?**: If the relationship is truly a power law, log-log plotting reveals it immediately as a straight line. If the relationship is exponential, polynomial, or some other form, it will curve on log-log axes. Log-log plots also make it easy to visualize many orders of magnitude on a single graph. **Reading the exponent**: The slope of the log-log line gives the exponent directly. A slope of $-0.076$ means doubling $N$ multiplies $L$ by $2^{-0.076} = 0.949$, a 5.1% reduction. A slope of $-0.1$ would give $2^{-0.1} = 0.933$, a 6.7% reduction. Small changes in exponent matter when compounded over many doublings. ### The unified scaling law When both parameters and data vary, the loss follows: $$ L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D} $$ This formula captures the interaction between resources. Let us unpack it. When $N$ is very large (effectively infinite), the first term vanishes: $$ L(N \to \infty, D) = \left(\frac{D_c}{D}\right)^{\alpha_D} $$ We recover the data-only scaling law. Similarly, when $D$ is very large: $$ L(N, D \to \infty) = \left(\frac{N_c}{N}\right)^{\alpha_N} $$ We recover the parameter-only scaling law. When both are finite, neither resource alone limits performance. The formula interpolates smoothly between regimes. The structure $[A + B]^{\alpha_D}$ with $A = (N_c/N)^{\alpha_N/\alpha_D}$ and $B = D_c/D$ has a specific meaning: it is as if we are adding two "effective data deficits." Insufficient parameters act like insufficient data, with a conversion factor $\alpha_N/\alpha_D \approx 0.8$ between them. ### Fitting scaling laws To fit scaling laws empirically: **Step 1: Train models at multiple scales.** Vary $N$ (or $D$, or both) across at least 2-3 orders of magnitude. For example, train models with 10M, 30M, 100M, 300M, 1B, and 3B parameters. More points give more reliable fits. **Step 2: Measure test loss.** Use a held-out test set that the model never saw during training. The test set should be large enough that measurement noise is small. Compute loss in nats (natural log) or bits (log base 2) consistently. **Step 3: Fit in log space.** Transform to $\log L$ vs. $\log N$. Fit a line using ordinary least squares: $$ \log L_i = \beta_0 + \beta_1 \log N_i + \epsilon_i $$ The fitted slope $\hat{\beta}_1 = -\hat{\alpha}$ gives the exponent. The intercept $\hat{\beta}_0 = \log \hat{a}$ gives the prefactor. **Step 4: Validate.** Hold out one or two data points from the fit. Predict their loss from the fitted law. If predictions are accurate (within a few percent), the law is reliable. If predictions are far off, the law may not hold, or more data points are needed. **Step 5: Estimate uncertainty.** Bootstrap or compute standard errors on the fitted parameters. The uncertainty in $\alpha$ translates to uncertainty in extrapolated loss. Small errors in $\alpha$ compound when extrapolating many orders of magnitude. **Common pitfalls**: - *Insufficient scale range*: Fitting over less than two orders of magnitude often gives unreliable exponents. The curvature from irreducible loss or saturation can masquerade as different slopes. - *Underfitting at small scale*: Very small models may not train to convergence or may have different optimization dynamics. Exclude the smallest models if they deviate systematically. - *Overfitting at large scale*: If your largest model trains on so much data that it starts memorizing, the test loss may be artificially low. Ensure sufficient test data. ### Computing optimal allocations Given a compute budget $C$, we want to find the optimal $N$ and $D$. Training a model with $N$ parameters on $D$ tokens requires approximately $C = 6ND$ FLOPs (a commonly used approximation where the factor 6 accounts for forward and backward passes with some overhead). The constraint is $D = C / (6N)$. Substituting into the loss formula: $$ L(N) = L\left(N, \frac{C}{6N}\right) $$ Minimizing over $N$ gives the optimal model size for budget $C$. Taking derivatives and solving (algebra omitted) yields: $$ N_{opt} \propto C^{a}, \quad D_{opt} \propto C^{1-a} $$ where $a$ depends on the exponents $\alpha_N$ and $\alpha_D$. The original OpenAI paper found $a \approx 0.73$ (favor larger models). Chinchilla found $a \approx 0.5$ (balance parameters and data). The difference comes from different experimental setups and fitting procedures. The practical rule "$D \approx 20N$" corresponds to $a = 0.5$ (Chinchilla) with the $C = 6ND$ compute formula. ### Extrapolation risks Scaling laws are empirical fits, not physical laws. Extrapolating beyond observed scales carries risks: **Phase transitions**: The scaling exponent might change at some scale. Perhaps models above 10 trillion parameters enter a new regime with different dynamics. We would not know until we train such models. **Optimization breakdown**: Very large models might be harder to optimize, with instabilities, gradient issues, or sensitivity to hyperparameters that do not appear at smaller scales. **Data distribution shift**: If training data quality or distribution changes at scale (e.g., running out of high-quality data and using more synthetic data), the scaling law from earlier data might not apply. **Irreducible loss dominance**: As models improve, the irreducible loss $L_\infty$ becomes a larger fraction of total loss. Near this floor, the power law bends and eventually flattens. **Unknown unknowns**: There might be phenomena we have not anticipated that change the picture at larger scales. How far can we safely extrapolate? A common heuristic: extrapolate at most 1 order of magnitude beyond your largest training run. Extrapolating 2+ orders of magnitude is speculation, not prediction. ### Confidence intervals When presenting scaling law predictions, include uncertainty. If your fit gives $\alpha = 0.08 \pm 0.01$, propagate this to predictions: At $N = 10^{12}$ parameters: $$ L = (N_c / N)^\alpha = (8.8 \times 10^{13} / 10^{12})^{0.08} = 88^{0.08} \approx 1.43 $$ With $\alpha = 0.07$: $L \approx 1.37$. With $\alpha = 0.09$: $L \approx 1.50$. The uncertainty grows with extrapolation distance. A 12% uncertainty in exponent becomes 20%+ uncertainty in loss over 2 orders of magnitude. ## Summary Scaling laws reveal that language model performance improves predictably with compute, data, and parameters. The key insights: 1. **Power laws**: Loss decreases as power laws with each resource 2. **Compute-optimal training**: Balance parameters and data (roughly 20 tokens per parameter) 3. **Emergent capabilities**: Qualitative abilities appear at specific scales 4. **Limits**: Data scarcity, compute costs, and diminishing returns constrain scaling These relationships have transformed how we build language models, shifting focus from architectural innovation to efficient scaling. Understanding them is essential for anyone working with modern transformers.