Foundation Models & Scaling Laws — A History of AI (14/15)

Beat 1 · Concrete

Same question, three sizes

One prompt. The small model fails, the medium fumbles, the large one nails it.

Beat 2 · Abstract

The scaling law

Loss versus compute on log–log axes: a straight descending line. Gains are predictable.

Beat 3 · Interactive

Slide the scale

Drag the dial: loss slides down the line, and capabilities switch on as thresholds are crossed.

Scale loss 2.6 · 1 unlocked

Drag the scale dial with JavaScript enabled to watch capabilities emerge.

Footnotes & further reading

2020 · 2022

Scaling laws

Kaplan et al. fit loss as a power law in compute, data, and parameters. Chinchilla later rebalanced the recipe — train smaller models on far more tokens.

2020

GPT-3, few-shot

175B parameters. With no fine-tuning, it learned tasks from a handful of examples in the prompt — scale alone bought in-context learning.

2021 · 2022

Emergence & “foundation models”

Stanford coined “foundation models” for one pretrained base adapted to many tasks; researchers catalogued “emergent abilities” that appear abruptly past a scale threshold.