Scaling Laws & Compound Optimization

Prerequisites: Part 8: Calculus (optimization, partial derivatives, Lagrange multipliers). This note provides canonical derivations for scaling discussions in TensorFlow: EfficientNet and AI in the Wild: LLMs.

Power-Law Scaling

Empirically, neural network test loss follows a power law in three independent variables:

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$$

where $N$ = parameters, $D$ = training tokens, $C$ = compute (FLOPs), and $\alpha_N, \alpha_D, \alpha_C$ are scaling exponents. On a log-log plot, these appear as straight lines with slope $-\alpha$.

The combined scaling law (Kaplan et al.): when both $N$ and $D$ limit performance:

$$\boxed{L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}}$$

This decomposition means: loss improvement from more parameters is independent of loss improvement from more data — they are roughly additive on log scale.

Kaplan Scaling Laws (2020)

The original OpenAI scaling laws (Kaplan et al., 2020) found for language models:

Variable	Exponent	Interpretation
Parameters $N$	$\alpha_N \approx 0.076$	10× more parameters → 0.076 nat lower loss
Dataset $D$	$\alpha_D \approx 0.095$	10× more data → 0.095 nat lower loss
Compute $C$	$\alpha_C \approx 0.050$	10× more compute → 0.050 nat lower loss

Key Kaplan conclusion: For a fixed compute budget, you should train a large model on a relatively small amount of data. Specifically, Kaplan recommended $D \propto N^{0.74}$ (data grows sub-linearly with parameters).

Chinchilla Optimal Training (2022)

Hoffmann et al. (2022) revised Kaplan's analysis and found that models should be trained on proportionally more data. The Chinchilla scaling law:

$$\boxed{L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}}$$

with fitted parameters $\alpha \approx 0.34$, $\beta \approx 0.28$, $E \approx 1.69$ (irreducible entropy), $A \approx 406.4$, $B \approx 410.7$.

Compute-optimal allocation: Given compute budget $C \approx 6ND$ (FLOPs for a forward + backward pass), minimize $L(N,D)$ subject to $6ND = C$. Using Lagrange multipliers:

$$\frac{\partial L}{\partial N}\bigg|_{D=C/(6N)} = 0 \implies N_{\text{opt}} \propto C^{a}, \quad D_{\text{opt}} \propto C^{b}$$

where $a = \frac{\beta}{\alpha + \beta} \approx 0.45$ and $b = \frac{\alpha}{\alpha + \beta} \approx 0.55$.

Chinchilla rule of thumb: Train with $D \approx 20N$ tokens. A 70B parameter model needs ~1.4T tokens. Kaplan would have said ~300B tokens for the same model — under-training by 4.7×. This explains why Chinchilla (70B, 1.4T tokens) matched Gopher (280B, 300B tokens).

EfficientNet Compound Scaling

For CNNs, Tan & Le (2019) proposed scaling three dimensions simultaneously:

Depth $d = \alpha^\phi$ (number of layers)
Width $w = \beta^\phi$ (channels per layer)
Resolution $r = \gamma^\phi$ (input image size)

Subject to the compute constraint:

$$\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$$

The exponents of 2 on $\beta$ and $\gamma$ arise because FLOPs scale as $O(\text{depth} \times \text{width}^2 \times \text{resolution}^2)$ for convolutions. The compound coefficient $\phi$ controls how much compute to add: doubling $\phi$ approximately doubles FLOPs.

Derivation of constraint: For a convolution layer with $C_{\text{in}}$ input channels, $C_{\text{out}}$ output channels, kernel $k \times k$, and spatial dimension $H \times W$:

$$\text{FLOPs} \propto k^2 \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot H \cdot W$$

Scaling width by $w$ multiplies both $C_{\text{in}}$ and $C_{\text{out}}$, giving $w^2$. Scaling resolution by $r$ multiplies $H \cdot W$, giving $r^2$. Scaling depth by $d$ adds $d$ such layers. Total: $\text{FLOPs} \propto d \cdot w^2 \cdot r^2$.

EfficientNet-B0 found $\alpha = 1.2, \beta = 1.1, \gamma = 1.15$ (via grid search at $\phi = 1$), then scaled $\phi$ from 1 to 7 for B1–B7.

Compute-Optimal Allocation

The general question: given a fixed FLOPs budget $C$, how should we split between $N$ (model size) and $D$ (data)?

Lagrangian formulation:

$$\min_{N,D} L(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} \quad \text{s.t.} \quad 6ND = C$$

Lagrangian: $\mathcal{L} = E + AN^{-\alpha} + BD^{-\beta} + \lambda(6ND - C)$

Setting $\partial \mathcal{L}/\partial N = 0$ and $\partial \mathcal{L}/\partial D = 0$:

$$-\alpha A N^{-\alpha-1} + 6\lambda D = 0 \implies \lambda = \frac{\alpha A}{6 D N^{\alpha+1}}$$ $$-\beta B D^{-\beta-1} + 6\lambda N = 0 \implies \lambda = \frac{\beta B}{6 N D^{\beta+1}}$$

Equating the two expressions for $\lambda$:

$$\frac{\alpha A}{D N^{\alpha+1}} = \frac{\beta B}{N D^{\beta+1}} \implies \frac{N}{D} = \frac{\alpha A D^\beta}{\beta B N^\alpha}$$

Combined with $6ND = C$, this gives the optimal split as powers of $C$:

$$N_{\text{opt}} \propto C^{\beta/(\alpha+\beta)}, \quad D_{\text{opt}} \propto C^{\alpha/(\alpha+\beta)}$$

Exercise

Predict Optimal Model Size

Using Chinchilla parameters ($\alpha = 0.34$, $\beta = 0.28$), compute: for a 10× increase in compute budget, by what factor should model size and data each grow? Verify that $N_{\text{opt}}$ grows by $10^{0.45} \approx 2.8\times$ and $D_{\text{opt}}$ grows by $10^{0.55} \approx 3.5\times$.

Table of Contents

Power-Law Scaling

Kaplan Scaling Laws (2020)

Chinchilla Optimal Training (2022)

EfficientNet Compound Scaling

Compute-Optimal Allocation

Predict Optimal Model Size

Related Articles

Part 8: Calculus & Optimization

TensorFlow Deep Dive: EfficientNet