Power-Law Scaling
Empirically, neural network test loss follows a power law in three independent variables:
$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$$where $N$ = parameters, $D$ = training tokens, $C$ = compute (FLOPs), and $\alpha_N, \alpha_D, \alpha_C$ are scaling exponents. On a log-log plot, these appear as straight lines with slope $-\alpha$.
The combined scaling law (Kaplan et al.): when both $N$ and $D$ limit performance:
$$\boxed{L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}}$$This decomposition means: loss improvement from more parameters is independent of loss improvement from more data — they are roughly additive on log scale.
Kaplan Scaling Laws (2020)
The original OpenAI scaling laws (Kaplan et al., 2020) found for language models:
| Variable | Exponent | Interpretation |
|---|---|---|
| Parameters $N$ | $\alpha_N \approx 0.076$ | 10× more parameters → 0.076 nat lower loss |
| Dataset $D$ | $\alpha_D \approx 0.095$ | 10× more data → 0.095 nat lower loss |
| Compute $C$ | $\alpha_C \approx 0.050$ | 10× more compute → 0.050 nat lower loss |
Key Kaplan conclusion: For a fixed compute budget, you should train a large model on a relatively small amount of data. Specifically, Kaplan recommended $D \propto N^{0.74}$ (data grows sub-linearly with parameters).
Chinchilla Optimal Training (2022)
Hoffmann et al. (2022) revised Kaplan's analysis and found that models should be trained on proportionally more data. The Chinchilla scaling law:
$$\boxed{L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}}$$with fitted parameters $\alpha \approx 0.34$, $\beta \approx 0.28$, $E \approx 1.69$ (irreducible entropy), $A \approx 406.4$, $B \approx 410.7$.
Compute-optimal allocation: Given compute budget $C \approx 6ND$ (FLOPs for a forward + backward pass), minimize $L(N,D)$ subject to $6ND = C$. Using Lagrange multipliers:
$$\frac{\partial L}{\partial N}\bigg|_{D=C/(6N)} = 0 \implies N_{\text{opt}} \propto C^{a}, \quad D_{\text{opt}} \propto C^{b}$$where $a = \frac{\beta}{\alpha + \beta} \approx 0.45$ and $b = \frac{\alpha}{\alpha + \beta} \approx 0.55$.
EfficientNet Compound Scaling
For CNNs, Tan & Le (2019) proposed scaling three dimensions simultaneously:
- Depth $d = \alpha^\phi$ (number of layers)
- Width $w = \beta^\phi$ (channels per layer)
- Resolution $r = \gamma^\phi$ (input image size)
Subject to the compute constraint:
$$\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$$The exponents of 2 on $\beta$ and $\gamma$ arise because FLOPs scale as $O(\text{depth} \times \text{width}^2 \times \text{resolution}^2)$ for convolutions. The compound coefficient $\phi$ controls how much compute to add: doubling $\phi$ approximately doubles FLOPs.
Derivation of constraint: For a convolution layer with $C_{\text{in}}$ input channels, $C_{\text{out}}$ output channels, kernel $k \times k$, and spatial dimension $H \times W$:
$$\text{FLOPs} \propto k^2 \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot H \cdot W$$Scaling width by $w$ multiplies both $C_{\text{in}}$ and $C_{\text{out}}$, giving $w^2$. Scaling resolution by $r$ multiplies $H \cdot W$, giving $r^2$. Scaling depth by $d$ adds $d$ such layers. Total: $\text{FLOPs} \propto d \cdot w^2 \cdot r^2$.
EfficientNet-B0 found $\alpha = 1.2, \beta = 1.1, \gamma = 1.15$ (via grid search at $\phi = 1$), then scaled $\phi$ from 1 to 7 for B1–B7.
Compute-Optimal Allocation
The general question: given a fixed FLOPs budget $C$, how should we split between $N$ (model size) and $D$ (data)?
Lagrangian formulation:
$$\min_{N,D} L(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} \quad \text{s.t.} \quad 6ND = C$$Lagrangian: $\mathcal{L} = E + AN^{-\alpha} + BD^{-\beta} + \lambda(6ND - C)$
Setting $\partial \mathcal{L}/\partial N = 0$ and $\partial \mathcal{L}/\partial D = 0$:
$$-\alpha A N^{-\alpha-1} + 6\lambda D = 0 \implies \lambda = \frac{\alpha A}{6 D N^{\alpha+1}}$$ $$-\beta B D^{-\beta-1} + 6\lambda N = 0 \implies \lambda = \frac{\beta B}{6 N D^{\beta+1}}$$Equating the two expressions for $\lambda$:
$$\frac{\alpha A}{D N^{\alpha+1}} = \frac{\beta B}{N D^{\beta+1}} \implies \frac{N}{D} = \frac{\alpha A D^\beta}{\beta B N^\alpha}$$Combined with $6ND = C$, this gives the optimal split as powers of $C$:
$$N_{\text{opt}} \propto C^{\beta/(\alpha+\beta)}, \quad D_{\text{opt}} \propto C^{\alpha/(\alpha+\beta)}$$Predict Optimal Model Size
Using Chinchilla parameters ($\alpha = 0.34$, $\beta = 0.28$), compute: for a 10× increase in compute budget, by what factor should model size and data each grow? Verify that $N_{\text{opt}}$ grows by $10^{0.45} \approx 2.8\times$ and $D_{\text{opt}}$ grows by $10^{0.55} \approx 3.5\times$.