Technology Level: Practitioner

Backpropagation: How AI Models Actually Learn

Backpropagation is the algorithm that trains every modern neural network. Explained here: the chain-rule math, the 5 steps of the training loop, and why vanishing and exploding gradients used to break training.

Lukas Hoffmann · Updated May 23, 2026

Backpropagation visualized — the error signal travels backwards through a neural network and updates the weights in every layer.

Optimizer comparison at a glance

Backpropagation computes the gradients — the optimizer decides how they turn into a weight update. Which variant fits depends on model type and training budget. The five most-used optimizers since the 1950s, sorted by year of publication:

Optimizer	Year	Speed	Stability	Recommendation in 2026
SGD (classical)	1951	slow	high	Textbook examples, very simple models
SGD + Momentum	1986	medium	high	CNNs in computer vision (ResNet, EfficientNet)
Adam	2014	fast	medium	LLM research, fast prototyping
AdamW	2017	fast	high	Industry-standard for transformers (GPT, Claude, Llama)
Lion	2023	very fast	medium	Large models, memory-constrained training

Important: Adam is not universally superior. For computer vision tasks, SGD with Momentum often delivers better validation accuracy. For transformers, AdamW has been the practical default since the AdamW paper (Loshchilov & Hutter, 2017). Lion (Google, 2023) uses less memory than Adam, but is more sensitive to learning-rate choice.

What is backpropagation in one sentence?

Backpropagation is the algorithm that, for every weight in a neural network, computes how much it contributed to the total error — and thereby provides the foundation for learning. The name reveals the direction: back = backwards, propagation = spread. While the forward pass pushes data from left to right through the network, the error signal in the backward pass travels from right to left.

Mathematically, backpropagation is an efficient application of the chain rule to a very long composition of functions. Every layer of a network is a function — multiply by weights, add bias, then activation. A network with 100 layers is therefore a 100-fold nested function. The chain rule says: to know how the first weight influences the final error, multiply the derivatives of every function in between. Backpropagation does this systematically, in a single backward pass, without recomputing any derivative twice.

Historically, the algorithm was discovered independently several times. Seppo Linnainmaa described reverse-mode autodiff in his Finnish master’s thesis in 1970; Paul Werbos formalized the idea for neural networks in 1974. The breakthrough came in 1986 with a Nature paper by Rumelhart, Hinton and Williams — from then on it was clear that deep networks are trainable. Even so, it took almost three more decades before enough data and compute were available for the modern deep-learning boom.

How does backpropagation work step by step?

The backpropagation loop consists of five repeated phases: forward pass, compute loss, backward pass, weight update, repeat. On a modern A100 GPU, each iteration takes around 10 milliseconds for a typical image CNN — roughly 100 training steps per second. If you understand this loop, you understand the skeleton of every deep-learning training run.

Phase 1 — Forward pass

A mini-batch (typically 32 to 256 examples) flows through the network. Layer by layer: matrix-multiply the inputs by the weights, add the bias, apply an activation function (usually ReLU). At the output, a prediction emerges — for image classification, a probability distribution over the classes; for a language model, a distribution over the next token. The activations of every intermediate layer are cached; the backward pass will need them in a moment.

Phase 2 — Compute the loss

The loss function compares prediction and ground truth and returns a single number. Small means: good prediction. Large means: way off. Cross-Entropy for classification, Mean Squared Error for regression. For language models usually Cross-Entropy on the next token, for diffusion models an L2 loss on the noise.

Phase 3 — Backward pass (the actual backpropagation)

Now the name-giving step: starting from the loss, the chain rule is applied backwards. Frameworks like PyTorch, JAX, and TensorFlow handle this with automatic differentiation — you don’t write the derivatives yourself. While the forward pass records the function calls on a computation graph, the backward pass walks that graph in reverse. At every node, the local derivative is computed and multiplied with the incoming gradient. Result: one dedicated gradient for each of the millions or billions of weights.

Phase 4 — Optimizer update

With the gradients in hand, the optimizer takes over. Classical SGD: w = w − η · ∂L/∂w — shift the weight a small distance (learning rate η) in the opposite direction of the gradient. Modern variants like Adam or AdamW additionally maintain running averages and variances per weight — they adaptively adjust the effective learning rate for each individual parameter.

Phase 5 — Repeat

After the update, the next mini-batch begins. One complete pass through all training data is called an epoch. A small model on MNIST needs 5 to 20 epochs; an ImageNet CNN, 90 to 300; a large language model often sees its entire corpus only once. Training ends when validation loss converges or the compute budget runs out.

The underlying mechanism has been known since the 1980s. What has changed is the scale: mini-batches instead of single examples, GPUs instead of CPUs, mixed-precision training, gradient accumulation, distributed training across thousands of cards. A deeper treatment of the learning loop with loss functions and visualization lives in the Deep Learning hub.

What math is actually behind it? The chain rule without textbook pain

Backpropagation is the chain rule of calculus, applied to a very long composition of functions. If you understand the chain rule with a single example of two nested functions, you already get the mathematical core — repeating it a hundred times in a deep network does not change that.

Picture two simple functions: g(x) = x + 1 and f(y) = y². Composed, that gives f(g(x)) = (x + 1)². We want to know: how does the output change when we nudge x a tiny bit? The chain rule says: derivative of the outer function times the derivative of the inner.

f'(g(x)) = 2 · (x + 1)        // derivative of the outer function
g'(x)    = 1                  // derivative of the inner function
d/dx f(g(x)) = 2 · (x + 1) · 1

Translated to plain English: the influence of x on the final value is the product of two local sensitivities — how strongly does g react to x, and how strongly does f react to g. In a neural network with 100 layers, this exact chain shows up — just 100-fold: the influence of a weight in layer 1 on the loss at the end is the product of the local derivatives of every layer in between.

The decisive trick of backpropagation: naively, you would have to recompute every chain for every weight — unaffordable with billions of weights. Backprop computes the derivatives once, from back to front, and distributes them along the way. This reuse is what computer scientists call dynamic programming. It makes the backward pass roughly as expensive as the forward pass — a dramatic efficiency gain.

If you want to go deeper: the best visual tutorial is 3Blue1Brown — Backpropagation calculus, and the original paper is Rumelhart/Hinton/Williams (1986). For day-to-day code, you don’t need to implement any of this yourself — PyTorch, TensorFlow and JAX handle the derivatives automatically.

What is the difference between gradient descent and backpropagation?

Backpropagation computes the gradients. Gradient descent uses them to update the weights. The two terms are often used interchangeably but describe different phases of the same learning loop.

More precisely:

Backpropagation is a derivative engine. Input: a loss value plus the cached activations from the forward pass. Output: one gradient per weight. Backprop doesn’t tell you how to change the weight — only in which direction and with what magnitude the loss reacts.
Gradient descent is an update rule. Input: a gradient and a learning rate. Output: a shifted weight. The simplest form: w_new = w_old − η · g, where η is the learning rate and g is the gradient.

The distinction becomes clear from the alternatives: there are methods that replace backpropagation (Evolutionary Strategies, Forward-Forward, Direct Feedback Alignment) — they still use some form of gradient descent. And there are optimizers that replace gradient descent (Adam, AdamW, RMSprop, Lion) — they still need backprop. Both building blocks are orthogonal.

Rule of thumb in code: in PyTorch, loss.backward() runs backpropagation; optimizer.step() performs the gradient-descent step. Two function calls, two phases, two concepts. JAX expresses the same split functionally — jax.grad(loss_fn) returns a gradient function; you apply it inside a manual update. A comparison with the broader ML-optimization picture lives in the beginner hub Machine Learning.

What are vanishing and exploding gradients?

Vanishing gradient means: during the backward pass, gradients shrink so much that the early layers practically stop learning. Exploding gradient means the opposite: gradients grow so large that training becomes unstable or crashes with NaN. Both problems stem from the multiplicative nature of the chain rule — and both have well-established fixes since around 2015.

Vanishing gradient — why it happens

At every step, the backward pass multiplies a local gradient by the incoming one. If the local derivative is smaller than 1 (Sigmoid, for example, caps at 0.25), the gradient shrinks layer by layer. After 20 layers, 0.25^20 ≈ 9 · 10⁻¹³ — effectively zero. The early layers receive no learning signal anymore. Deep networks were practically untrainable in the 1990s for precisely this reason.

Exploding gradient — why it happens

The mirror image: if local derivatives are greater than 1 and start compounding, the weight update grows so large that the model tips out of its useful parameter range. Symptom: the loss jumps up or turns NaN. RNNs and very deep networks without normalization are especially vulnerable.

Fixes that work in 2026

Four innovations have largely defused the problem:

ReLU instead of Sigmoid (Glorot et al., 2010). ReLU has a derivative of 1 for positive inputs — no more shrinkage during the backward pass.
Residual connections (ResNet, He et al., 2015). Skip connections give the gradient a bypass path that avoids the multiplicative chain. They are what made networks with more than 50 layers practical.
Layer Normalization (Ba et al., 2016). Normalizes activations inside each layer. Indispensable in transformers — most LLMs use LayerNorm or RMSNorm.
Gradient clipping. When the gradient exceeds a threshold, it is capped to that threshold. Standard trick in RNN training and large language model training — one line of code, prevents exploding-gradient crashes.

On top of that, good initialization schemes (He initialization for ReLU, Glorot for Sigmoid/Tanh) and mixed-precision training with loss scaling — which catches numerical underflows in the FP16 range — round out the toolkit.

Which optimizers replace classical SGD in 2026?

Pure stochastic gradient descent is rarely used in production anymore — modern optimizers adapt the learning rate per parameter and converge faster. Which optimizer is the best choice in 2026 depends on model type: AdamW for transformers, SGD + Momentum for many CV tasks, Lion for very large models.

SGD with Momentum (Polyak, 1964 / Nesterov, 1983)

Classical SGD plus a moving average of past gradients — like a ball rolling down a slope and keeping its momentum. Excels on clean loss landscapes; in computer vision, often generalizes better than adaptive methods. ResNet, EfficientNet and many classical image CNNs are still trained with SGD + Momentum today.

Adam (Kingma & Ba, 2014)

Adam — Adaptive Moment Estimation — combines momentum (first moment) with an adaptive per-parameter learning rate (second moment). Original paper. Converges fast on most architectures and is robust to hyperparameter choice. Weakness: its implicit weight regularization is suboptimal for some tasks — hence AdamW.

AdamW (Loshchilov & Hutter, 2017)

AdamW decouples weight decay from the gradient update. Sounds small, isn’t: for transformer training with heavy regularization, Adam was unreliable, and AdamW became the de facto industry standard. GPT-3, GPT-4, Claude, Llama — all use variants of AdamW. Original paper.

Lion (Chen et al., 2023)

Lion — EvoLved Sign Momentum — was discovered at Google in 2023 through program search. Instead of using the gradient itself, Lion uses only its sign, multiplied by a momentum term. Upside: noticeably less memory than Adam, similar or better convergence on large models. Downside: more sensitive to learning rate and batch size.

Which one to pick?

Computer vision (CNN, classical image tasks): SGD + Momentum + cosine schedule.
Language and multimodal (transformers): AdamW with warmup + cosine decay.
Very large models with memory pressure: Lion or 8-bit Adam.
Prototyping and research: Adam as a safe default.

An overview of the architectures these optimizers train lives in the pillar Neural Networks. For applications to generative models, see Generative AI and Transformer.

What does backpropagation look like in PyTorch? (5 lines)

In modern frameworks, the entire backpropagation loop boils down to five lines. PyTorch handles autodifferentiation automatically as long as tensors are marked with requires_grad=True — which is the default for nn.Module parameters. Here is the standard training loop:

for inputs, targets in dataloader:
  optimizer.zero_grad()              # clear old gradients
  outputs = model(inputs)            # forward pass
  loss = criterion(outputs, targets) # compute loss
  loss.backward()                    # backprop: fills .grad on every parameter
  optimizer.step()                   # gradient-descent step

Line by line, what actually happens:

optimizer.zero_grad() — PyTorch accumulates gradients by default. Before every new batch, the old .grad buffers must be reset to zero, otherwise multiple backward passes will sum on top of each other.
model(inputs) — the forward pass. Under the hood, PyTorch builds the computation graph that autograd will need.
criterion(outputs, targets) — the loss function (e.g. nn.CrossEntropyLoss for classification or nn.MSELoss for regression). The result is a single scalar.
loss.backward() — this is where the actual backpropagation happens. PyTorch walks the computation graph in reverse and writes a gradient into each parameter’s .grad attribute.
optimizer.step() — the optimizer (e.g. torch.optim.AdamW(model.parameters(), lr=3e-4)) reads the .grad values and updates the weights.

In TensorFlow/Keras the code looks slightly different (tf.GradientTape context), but the logic is identical. JAX takes the functional route — jax.grad(loss_fn)(params, batch) returns a gradient pytree, and you apply the update yourself or via Optax — but the end result is the same. A thorough walkthrough lives in the PyTorch tutorials; if you want to test this hands-on, a tool from the AI Code Assistants category can walk you through writing your first training script.

Practical tip: when a model stubbornly refuses to learn, check these five lines first. Missing zero_grad() → gradients accumulate. Missing loss.backward() → no learning at all. Missing optimizer.step() → gradients computed, weights never updated.

Which problems remain unsolved in 2026?

Despite forty years of optimization, backpropagation still has hard limits: high memory cost, sequential dependencies, biological implausibility, and poor scaling in very deep or dynamic models. Research on alternatives is active in 2026, but none has displaced standard backprop in production yet.

Memory

Backpropagation must store every intermediate activation from the forward pass for the backward pass to use. In a 100-billion-parameter model with long sequences, that quickly runs to hundreds of gigabytes — the reason LLM training has to be distributed across dozens of GPUs. Gradient checkpointing softens this (store only selected activations, recompute the rest) at the cost of extra compute.

Sequential dependency

Layer n has to wait for the backward signal from layer n+1 — that makes parallelization across layers difficult. Pipeline parallelism helps but is complicated to implement and produces “bubbles” (idle time on individual GPUs).

Biological implausibility

The biological brain almost certainly does not use exact backprop. Neuroscientists and AI researchers discuss alternatives like Predictive Coding, Direct Feedback Alignment, or Forward-Forward (Hinton, 2022). These are more biologically plausible and potentially more parallelizable — but on large models, they have not matched the efficiency of standard backprop so far.

Adversarial examples and robustness

Models trained via backprop on cross-entropy are vulnerable to inputs that are minimally perturbed — visually identical to humans, but classified wrong. This is not a backprop bug but a consequence of the training objective — and practical robustness remains an open research topic. More on this in the pillar AI risks.

Learning without labels

Standard backprop needs a training signal — usually labeled data or a differentiable reward function. Self-supervised learning (next-token prediction, contrastive losses) solves this in many domains, but the unsolved scaling problem for agentic learning in complex environments remains — a theme that shapes the discussion about the next leap in AI. A broader framing is in the hub The future of AI.

Frequently asked questions about backpropagation

Who invented backpropagation?

The mathematical roots go back to Linnainmaa (1970) and Werbos (1974). The breakthrough came with the 1986 Nature paper by Rumelhart, Hinton and Williams — after that, deep networks became trainable for the first time. Hinton received the 2024 Nobel Prize in Physics for this and related work.

Is backpropagation the same as gradient descent?

No. Backpropagation computes the gradients — gradient descent uses them for the weight update. Backprop is the derivative engine; gradient descent (or Adam, AdamW, Lion) is the learning step. In PyTorch that maps to the two calls loss.backward() and optimizer.step().

What is the chain rule, in plain English?

The chain rule says: the derivative of f(g(x)) is f’(g(x)) · g’(x) — the product of local sensitivities. A neural network with 100 layers is just a 100-fold nested function; backprop applies the rule backwards, from outside in.

What does vanishing gradient mean?

During the backward pass, gradients shrink so much that the early layers stop learning. Cause: multiplying many numbers smaller than 1 (Sigmoid, for example). Fixes since 2015: ReLU instead of Sigmoid, residual connections (ResNet), Layer Normalization, and good weight initialization.

What’s the difference between Adam and SGD?

SGD uses the same learning rate for every weight. Adam (Kingma & Ba, 2014) stores a running average and variance of past gradients per weight — and adapts the learning rate. For transformers, AdamW is the default; for many CV tasks, SGD with Momentum still generalizes better.

What is an epoch in training?

An epoch is one full pass through the training dataset, split into mini-batches. Small MNIST models need 5–20 epochs; ImageNet CNNs need 90–300. Very large language models like GPT-4 often see their corpus only once — epoch = 1 is standard at that scale.

What is backprop through time (BPTT)?

BPTT is backpropagation for recurrent neural networks. The RNN is unrolled over time — 50 tokens become a 50-layer net through which the error travels backwards. Compute-heavy and vulnerable to vanishing gradients — one reason transformers displaced RNNs from 2017 onward.

Do I need backpropagation for a pretrained model?

For pure inference, no — only the forward pass runs. The moment fine-tuning, LoRA or RLHF come into play, backprop is active again. Even prompt-tuning and prefix-tuning train embeddings via backpropagation, even when the underlying model stays frozen.

Can models learn without backpropagation?

Yes — Evolutionary Strategies, Direct Feedback Alignment (Lillicrap, 2016) and Forward-Forward (Hinton, 2022) skip the global backward pass. They are biologically more plausible and parallelizable, but as of 2026 do not match the efficiency of standard backprop. In production: effectively 100 percent backpropagation.

Why do I need a GPU for backpropagation?

Both the forward and backward pass consist almost entirely of matrix multiplications. GPUs have thousands of parallel compute units and are 50–100× faster than CPUs. Small networks run on CPU; from a few hundred thousand parameters upward, serious training without GPU or TPU becomes impractical.

Deepen your knowledge

Backpropagation is the training skeleton — but only one building block in the deep-learning stack. Further reading:

Strengthen the foundations

Neural Networks — the model architecture that becomes trainable through backpropagation in the first place. · ~10 min.
Deep Learning — the broader context: architectures, data requirements, GPU economics. · ~12 min.
Machine Learning — the umbrella term, from supervised learning to workflow structure. · ~12 min.

Architectures that backprop trains

Transformer architecture — GPT, Claude and Gemini are trained with AdamW + backprop. · ~10 min.
Generative AI — diffusion models and language models use the same learning loop. · ~9 min.

Practical tooling

AI Code Assistants — if you want to write the PyTorch loop yourself, this category page has tool recommendations for the implementation. · Tool hub.

Frequently asked questions

Who invented backpropagation?

The mathematical roots go back to Henry J. Kelley (1960) and Seppo Linnainmaa (1970, who described reverse-mode autodiff in his Finnish master's thesis). The breakthrough for neural networks came with the paper Learning representations by back-propagating errors by David Rumelhart, Geoffrey Hinton and Ronald Williams, published in Nature in 1986. Deep networks only became trainable after that — Hinton received the 2024 Nobel Prize in Physics for this and related work.

Is backpropagation the same as gradient descent?

No. Backpropagation computes the gradients — that is, how strongly each individual weight influences the total loss. Gradient descent is the step that follows: with the gradients in hand, the optimizer nudges each weight a small distance toward lower loss. Backprop is the derivative engine; gradient descent (or Adam, AdamW, Lion) is the actual learning step. The two belong together but describe different phases.

What is the chain rule?

The chain rule says: the derivative of a nested function is the product of the derivatives of each layer. For f(g(x)) it is f'(g(x)) · g'(x). A neural network with 100 layers is nothing more than a 100-fold nested function — backpropagation applies the chain rule from the outside in and collects one gradient per weight along the way. High-school calculus, stacked a hundred times.

What does vanishing gradient mean?

Vanishing gradient means: as gradients propagate backwards through many layers, they shrink so much that the early layers stop learning. The usual cause is multiplying many numbers smaller than 1 — for example with Sigmoid activations. Fixes in 2026: ReLU instead of Sigmoid, residual connections (ResNet, 2015), Layer Normalization (2016), and careful weight initialization (He, Glorot).

What is the difference between Adam and SGD?

SGD (Stochastic Gradient Descent) uses the same learning rate and current gradient for every weight. Adam (Kingma & Ba, 2014) also stores a running average and variance of past gradients per weight — and automatically adapts the learning rate for each parameter. Adam often converges faster; SGD with Momentum frequently generalizes better in computer vision. For transformer training, AdamW has been the default for years.

What is an epoch in training?

An epoch is one complete pass through the training dataset. The data is split into mini-batches (typically 32 to 256 samples); after each batch, the forward pass, backpropagation and optimizer step run. A small model on MNIST needs 5–20 epochs; a large language model like GPT-4 often sees its training corpus only once — epoch = 1 is the norm for very large data.

What is backprop through time (BPTT)?

BPTT is backpropagation applied to recurrent neural networks (RNNs and LSTMs). The network is unrolled over time: a 50-token sequence becomes a 50-layer feed-forward net through which the error travels backwards. The procedure is compute-heavy and amplifies the vanishing-gradient problem — one of the reasons transformers replaced RNN-based language processing from 2017 onward.

Do I need backpropagation if I only use a pretrained model?

For pure inference — running a ChatGPT prompt or pushing an image through Stable Diffusion — no. Only the forward pass runs; no gradients, no learning. The moment you start fine-tuning, training LoRA adapters or running RLHF, backprop is back in play: you update either the whole model or a small slice of its parameters. Even prompt-tuning and prefix-tuning train embeddings via backprop.

Can models learn without backpropagation?

Yes, but rarely competitively. Alternatives like Evolutionary Strategies, Direct Feedback Alignment (Lillicrap, 2016), Forward-Forward (Hinton, 2022) and Predictive Coding skip the global backward pass. They are biologically more plausible and easier to parallelize, but as of 2026 they do not reach the efficiency of standard backprop on large models. In production: effectively 100 percent backpropagation.

Why do I need a GPU for backpropagation?

Both the forward and the backward pass consist almost entirely of matrix multiplications. GPUs have thousands of parallel compute units and are orders of magnitude faster than CPUs — a typical image CNN trains roughly 50–100× faster on an A100 than on a desktop CPU. Small networks can run on CPU; from a few hundred thousand parameters onwards, serious training without a GPU or TPU becomes impractical.