Direkt zum Inhalt
Fundamentals Level: Practitioner

Deep Learning & Neural Networks Explained

Deep learning and neural networks explained simply: architecture, training via backpropagation, architectures (CNN, RNN, Transformer), the math you actually need, frameworks — with two interactive demos.

toolwiki – Editorial · Updated April 24, 2026
Deep Learning & Neural Networks Explained (2026 Beginner's Guide) — concept illustration: Deep learning & neural networks for beginners: structure, neurons, weights, backpropagation, CNN, RNN,…

1 · What is a neuron?

From biological inspiration to the artificial perceptron — input × weight, bias, activation.

2 · How does training work?

Forward pass, loss, backpropagation, gradient descent — the learning loop in 5 phases.

3 · What is it used for?

CNN, RNN, Transformer — which architecture fits images, text, audio, and time series.

What is deep learning? The simple definition

Deep learning is machine learning with deep neural networks — models composed of many stacked layers of artificial neurons. The core difference from classical ML: a deep learning network learns its own features. Instead of a human deciding which properties of an image matter (edges, color histograms, textures), the network finds these representations automatically in its early layers — and builds more complex concepts on top of them in later layers.

The three terms are nested: Artificial intelligence ⊃ Machine learning ⊃ Deep learning. Every deep learning system is a machine learning system. Every machine learning system is AI. But not every AI uses ML, and not every ML model is deep. A decision tree is ML but not deep learning. A rule-based chatbot is AI but neither ML nor deep learning.

The adjective “deep” refers specifically to the number of layers between input and output. A shallow network has one hidden layer; a deep network has dozens to hundreds. This depth lets the model form hierarchical abstractions: in an image CNN, the first layer detects edges, the second detects shapes, the third detects object parts, the later ones detect whole objects. This mechanism is what makes deep learning so strong on complex data.

Where this fits. If you’re not yet comfortable with machine learning basics, work through the beginner hub Machine Learning first — this hub builds on it. The logical continuation of deep learning is Generative AI, where exactly these architectures are used to generate new content.

What are neural networks? The brain analogy — and where it ends

A neural network is a mathematical model made of many simple computational units connected together — loosely inspired by the biological brain. The analogy is useful for memorizing the principle, but shouldn’t be stretched too far.

The biological neuron. Dendrites receive electrical signals from other neurons. In the cell body, those signals are summed. If the sum crosses a threshold, the neuron fires an action potential down its axon. At the axon’s end, synapses pass the signal to the next neurons. The human brain has about 86 billion neurons and 100 trillion synapses.

The artificial neuron (perceptron). It receives numbers as input, multiplies each input by a weight, adds the products plus a bias term, and passes the sum through an activation function. That’s it. The weights roughly correspond to synapses, the activation roughly to “firing.” But: a biological neuron fires as a real-time spike pattern, uses chemical neurotransmitters, and has complex feedback dynamics — an artificial neuron is just a summing gate with a nonlinear function after it.

A brief historical arc.

  • 1943 — Warren McCulloch and Walter Pitts describe the first mathematical neuron model.
  • 1958 — Frank Rosenblatt builds the perceptron, the first trainable network. It could learn linearly separable patterns — but not the infamous XOR problem.
  • 1969 — Minsky and Papert prove the perceptron’s limits. The first AI winter follows.
  • 1986 — Rumelhart, Hinton, and Williams popularize backpropagation: deeper nets become trainable. The euphoria is limited — compute and data still missing.
  • 2006 — Geoffrey Hinton’s “deep belief networks” spark the deep learning renaissance.
  • 2012 — The CNN AlexNet wins ImageNet by a wide margin. Deep learning goes mainstream.
  • 2017 — Google researchers publish “Attention is all you need” and launch the transformer era.
  • 2022–2026 — Generative AI (GPT, Claude, Midjourney, Stable Diffusion) becomes mass technology.

Three factors made the renaissance possible: more data (internet, smartphones), more compute (GPUs, later TPUs), and better algorithms (backprop variants, ReLU, dropout, Adam). Without this trio, the 1958 perceptron would never have become GPT-4.

Structure of a neural network: the building blocks

Every neural network is built from three parts: neurons, layers, and weighted connections between them. If you understand these three, you understand the structure of every modern model — from a small MLP to GPT-4.

The neuron: input × weight + bias → activation

A single artificial neuron performs one three-step computation:

  1. Multiply each input by its weight. The weight expresses how important the input is to this neuron.
  2. Sum all weighted inputs — plus a bias term. The bias shifts the decision threshold.
  3. Pass the sum through an activation function. This nonlinear function produces the neuron’s output.

Formally: y = f(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b). The activation function f is the key — without it, the whole network, no matter how many layers, would collapse into a single linear function. That would be too weak for complex patterns.

The layers: input, hidden, output

Neurons are organized into layers. Each layer receives the previous layer’s outputs as inputs and passes its own outputs to the next.

  • Input layer. Takes the raw data — for a 28×28 image that’s 784 input neurons; for tabular data, one per feature.
  • Hidden layer(s). The layers in between. This is where the real representation work happens. A deep network has dozens of them.
  • Output layer. Produces the final result — for classification, one neuron per class with Softmax; for regression, a single neuron with no activation.

In a feed-forward network, data flows strictly left to right. In a recurrent network, there are feedback loops. In a transformer, neurons in a layer look at all positions of the previous layer simultaneously — more on that below.

Weights and biases: the parameters that get learned

The weights between layers and the biases in each neuron are the parameters of the network. They’re learned from data during training. A small network has thousands of parameters; a large language model has hundreds of billions. More parameters = more expressive model — but also more data and compute needed for training.

Activation functions compared

FunctionRangeStrengthWeaknessTypical use
ReLU (max(0, x))0 to ∞Extremely fast, avoids vanishing gradientsCan “die” (permanently 0)Default for hidden layers
Sigmoid (1/(1+e⁻ˣ))0 to 1Nice probability interpretationVanishing gradients in deep netsBinary output
Tanh ((eˣ−e⁻ˣ)/(eˣ+e⁻ˣ))−1 to 1Zero-centered, stronger than SigmoidStill has vanishing gradientsOlder RNNs
SoftmaxProbabilities summing to 1Multi-class classificationOnly at the outputOutput layer for multi-class
GELU / SiLUSmooth, ReLU-likeState of the art in transformersSlightly more expensiveGPT, BERT, Claude

ReLU dominates most modern hidden layers today. It’s as simple as a light switch (if negative: 0, else: unchanged) and empirically works better than all more complex alternatives. Only at the output layer do specialized functions appear: Softmax for multi-class, Sigmoid for binary, linear for regression.

Worked example: coffee or tea?

A tiny network predicts whether you drink coffee (1) or tea (0) in the morning. Two inputs: x₁ = hours of sleep, x₂ = outdoor temperature in °C. A single neuron, Sigmoid activation, weights w₁ = −0.5, w₂ = −0.1, b = 4.

Case A: You slept 5 hours, it’s 10 °C outside. z = −0.5·5 + (−0.1)·10 + 4 = −2.5 − 1 + 4 = 0.5 → Sigmoid(0.5) ≈ 0.62 → coffee (> 0.5).

Case B: 9 hours of sleep, 25 °C. z = −0.5·9 + (−0.1)·25 + 4 = −4.5 − 2.5 + 4 = −3.0 → Sigmoid(−3) ≈ 0.05 → tea.

That’s one neuron, two inputs, five learned numbers. Modern networks do this billions of times in parallel — but the core operation stays the same.

Neural network — live visualizer

Move the sliders to change the network's depth, width, and activation function. Watch how data flows through the layers and how the decision boundary changes. Hit Train to see gradient descent in action.

Loss:
What am I seeing?

This is a feed-forward neural network with 2 inputs, a stack of hidden layers, and 1 output. Each line is a weight; thicker = stronger. Each circle is a neuron that sums its inputs, applies the chosen activation, and forwards the result. The right panel shows the decision boundary for a simple classification task — change the architecture and watch it update. "Train" runs a few steps of gradient descent to fit a toy dataset: weights shift, the boundary moves, and the loss drops.

Input Hidden Output

How does a neural network learn? Training in 5 phases

Training is the process of adjusting weights and biases until the network’s predictions match the labels as closely as possible. The cycle repeats hundreds to millions of times.

Phase 1 — Forward propagation

Training examples flow through the network, layer by layer, until a prediction emerges at the output. For an image classifier: pixels in, hidden layers, then a probability distribution over the classes. Initially, weights are random — the first prediction is correspondingly bad.

Phase 2 — Compute the error (loss function)

The loss function compares prediction and ground truth and returns a number. Small = good prediction. Large = way off. The two classics:

  • Mean Squared Error (MSE) — for regression: (prediction − truth)².
  • Cross-Entropy — for classification: penalizes predictions with wrong probabilities exponentially.

Phase 3 — Backpropagation

Backpropagation is the trick for computing a per-weight share of the total loss. Core idea: the chain rule from school, applied backwards through the network. Every weight gets a gradient — a number that says which direction and by how much it should change to reduce the loss.

You don’t need to implement backpropagation yourself — every modern framework (PyTorch, TensorFlow, JAX) does it automatically via autodifferentiation. But you should understand: backwards through the network, chain rule, every parameter gets its gradient.

Phase 4 — Gradient descent (adjust weights)

With gradients in hand, each weight is shifted a small step in the direction that reduces the loss. The step size is controlled by the learning rate — a critical hyperparameter.

Blindfolded-in-a-valley analogy. Imagine standing on a hill, blindfolded. You want the lowest point. You feel the slope with your foot and take one step downhill. Then another. And another. Eventually you’re at the bottom. That’s gradient descent — in a high-dimensional parameter space instead of on a hill.

Modern variants: SGD (Stochastic Gradient Descent, uses only a subset of the data per step), Adam (adapts the learning rate per parameter automatically — today’s default), AdamW (Adam with weight decay).

Phase 5 — Repeat (epochs)

An epoch is one pass through the entire training dataset. The data is split into batches (typically 32, 64, or 256 examples per step). After every batch, the weights are updated. After many epochs, the network converges — the loss stops dropping significantly, and the model is done training.

The key types of neural networks

There is no universal network. For images you use CNNs, for text Transformers, for sequences LSTMs or Transformers, for anomalies autoencoders. Here are the main architectures at a glance — each with a typical use case and a real app you know.

Feed-forward network (MLP)

The most classical network: input layer, one or more hidden layers, output layer. All connections go strictly forward. Often enough for tabular data and simple classification or regression tasks. Production examples: credit scoring, customer churn, basic forecasting. But: for tabular data, gradient boosting (XGBoost, LightGBM) is usually better and faster.

Convolutional Neural Network (CNN)

CNNs were developed by Yann LeCun in 1989 to read ZIP codes on US mail (LeNet). Their superpower: convolutional layers, which slide a small filter over the image and detect local patterns. Early layers find edges, middle ones shapes, later ones object parts. The breakthrough came in 2012 with AlexNet on ImageNet. Today: face recognition (Face ID on iPhone), medical imaging, self-driving cars (object detection in camera frames).

Variants worth knowing: VGG, ResNet (skip connections fight vanishing gradients), EfficientNet (optimized for accuracy per parameter).

Recurrent Neural Network (RNN)

RNNs have a feedback loop: the output of one step feeds into the next as input. This lets them process sequences — text, speech, time series. Problem: with long sequences, they lose context (vanishing gradients).

LSTM & GRU

Long Short-Term Memory (Hochreiter & Schmidhuber 1997) and Gated Recurrent Unit are RNN upgrades with gates that decide what to remember and what to forget. LSTMs dominated NLP from roughly 2014 to 2018 — until transformers arrived. Still used today for time-series forecasting (finance, energy), speech recognition (pre-Whisper), and simple sequence tasks.

Transformer

The dominant architecture since 2017. Drops recurrence entirely and uses self-attention instead. Foundation for GPT-4, Claude, BERT, T5, Midjourney, and almost every modern language or vision model. Own section just below.

Autoencoder

An autoencoder tries to reconstruct its input at the output as faithfully as possible — with a bottleneck in between that forces the signal to be compressed. Used for dimensionality reduction, denoising, anomaly detection (whatever can’t be reconstructed is flagged). Variational autoencoders (VAEs) can also generate new samples.

GAN (Generative Adversarial Network)

Introduced in 2014 by Ian Goodfellow. Two networks play against each other: a generator creates fake data, a discriminator tries to tell real from fake. Until about 2022 the standard for image generation (StyleGAN for faces). Since 2023 largely displaced by diffusion models (Stable Diffusion, DALL·E).

ArchitectureTypical problemExample appBeginner difficulty
MLPTabular data, classificationCredit scoring★★☆☆☆
CNNImages, videoFace ID, DALL·E★★★☆☆
RNNSequences, text (historical)Early translators★★★☆☆
LSTM / GRUTime series, audioEarly Siri★★★☆☆
TransformerText, multimodalGPT-4, Claude★★★★☆
AutoencoderCompression, anomaliesFraud detection★★★☆☆
GANClassic image generationStyleGAN faces★★★★☆
Diffusion modelModern image generationStable Diffusion, Midjourney★★★★★

Which neural network architecture fits your problem?

Answer 5 quick questions. The tool recommends a starting architecture with reasoning and two alternatives — so you go into your first deep learning project with a plan.

1) What type of data do you work with?

Transformer: the revolution that made ChatGPT possible

Transformers are today’s dominant neural architecture — introduced in the 2017 paper “Attention is all you need” by Vaswani et al. (Google). They’re the foundation of practically every modern language model: GPT-4, Claude, Gemini, BERT, T5, Llama. And increasingly of vision and multimodal models as well (Vision Transformer, CLIP, Stable Diffusion 3).

Self-attention explained simply

The core idea: every word looks at every other word and weights how relevant they are to its own meaning. In the sentence “The bank by the river was modern,” the model must decide whether “bank” goes more with “river” (a place to sit) or with “modern” (a financial institution). Self-attention computes a weighted average over all other words for each word — and can thus take context from the entire sequence into account.

Technically, attention is built from three matrices computed per token: Query, Key, Value. The query asks “who’s relevant to me?”, the keys respond, and the values supply the information. This mechanism runs several times in parallel (multi-head attention) and is stacked across many layers.

Why transformers beat RNNs

  • Parallelizable. RNNs are sequential (step n depends on n−1). Transformers compute all positions of a sequence in parallel — a perfect fit for GPUs.
  • Better long-range dependencies. Self-attention links every token directly to every other token. RNNs must pass information over many steps, which often fails.
  • Scalable. More data + more parameters + more compute = predictably better models. This scaling property is why GPT-4 works.

Encoder, decoder, encoder-decoder

Transformers come in three flavors:

  • Encoder-only (BERT, RoBERTa) — optimized for understanding, e.g. text classification, named entity recognition, embeddings.
  • Decoder-only (GPT, Claude, Llama) — optimized for generation. Writes one token at a time, based on the prior context.
  • Encoder-decoder (T5, original Transformer) — encoder processes the input, decoder generates the output. Classic for translation and summarization.

For the deep dive, see the spoke Transformer architecture. For applying these models to text, image, and audio generation: Generative AI.

When deep learning — and when classical machine learning?

Deep learning isn’t automatically better. The choice depends on dataset size, data type, explainability requirements, and available resources.

Deep learning is the right choice when:

  • You have lots of data (from ~100,000 examples, for language models billions)
  • Your data is unstructured — images, audio, video, raw text
  • Complex patterns exist that manual feature engineering can’t capture
  • GPU compute is available (local or cloud)
  • Explainability is secondary

Classical ML is better when:

  • You have little data (under 10,000 rows)
  • Your data is tabular
  • Explainability is mandatory (credit, medicine, law, insurance)
  • Resources are limited (no GPU budget, edge deployment)
  • You need fast iteration

For the full foundation of classical ML — algorithms, workflow, overfitting — see the beginner hub Machine Learning.

Practice: 8 deep learning applications you use every day

1. Face ID on smartphones

CNN

A deep CNN encodes your face as a vector and compares it at every unlock.

2. Google Translate / DeepL

Transformer (encoder-decoder)

Billions of parallel-language sentences trained a model that understands context across whole paragraphs.

3. Speech recognition (Siri, Alexa, Whisper)

Transformer / formerly RNN

Audio spectrograms turn into text — today via Whisper or similar transformer-based models.

4. Netflix recommendations

Deep learning + classical ML

Hybrid models combining deep embedding networks with collaborative filtering.

5. Mobile autocorrect

Transformer (small, on-device)

Compressed language models run directly on your phone — no cloud required.

6. Medical image analysis

CNN

Tumor detection in CT and MRI reaches specialist level on specific tasks.

7. Self-driving cars

CNN + RNN / Transformer

Camera, radar, and lidar data are fused — decisions happen in real time.

8. ChatGPT & Claude

Decoder-only Transformer

Language models with hundreds of billions of parameters — trained on a large slice of the public internet.

Do I need math for deep learning? An honest answer

One of the most common beginner questions. The answer depends on what you actually want to do.

Use it (follow PyTorch tutorials, apply Hugging Face models). High-school math is enough. You need Python basics, basic linear algebra (“matrix times vector”), and a feel for what a gradient is. Time to your first working network: 1–2 months at 5 hours per week.

Understand it (why backpropagation works, why ReLU beats Sigmoid). Linear algebra (matrices, eigenvectors), calculus (chain rule, partial derivatives), basic probability. All high-school level, all reachable via MOOCs. Time: 3–6 months.

Develop your own architectures (not just use them). Numerical optimization, probability theory, a bit of information theory, deeper linear algebra. University level. Time: 12–24 months.

Research (new architectures or optimizers). Full math or CS degree. Master’s or PhD. Time: 5+ years.

Recommended starting resources — no hype, honest curation:

Deep learning frameworks at a glance

Framework choice matters less than it did five years ago — PyTorch and TensorFlow can both do everything. Quick overview:

  • PyTorch (Meta). Currently dominant in research and increasingly in production. Pythonic, flexible, excellent for debugging. The vast majority of new papers and open-source models use PyTorch.
  • TensorFlow (Google). Historically the leader, today strongest in production deployments (TFLite for mobile, TensorFlow.js for the browser, TFX for pipelines). The high-level Keras API is beginner-friendly.
  • JAX (Google). High-performance, functional style, popular in research at Google DeepMind and for scientific computing.
  • Hugging Face Transformers. Not a low-level library but an ecosystem with tens of thousands of pretrained models. The standard toolkit for LLMs, vision transformers, and audio.
  • Keras. No longer truly standalone — a high-level API over TensorFlow (and since Keras 3 also PyTorch, JAX). Perfect for beginners — model.fit() and off you go.

My recommendation for beginners in 2026: PyTorch + Hugging Face. PyTorch for custom architectures and debugging, Hugging Face for anything involving pretrained models. If you want the gentlest possible start, begin with Keras — and switch to PyTorch later when you need more control.

Explore further: your path through deep learning

This hub is your starting point. Depending on your interest, three directions open up:

Understanding architectures

  • Transformer architecture — the dominant architecture behind GPT, Claude, and others. · ~10 min.
  • Diffusion models — how Midjourney and Stable Diffusion create images from noise. · ~7 min.
  • Generative AI — applying deep networks to content generation. · ~9 min.

Strengthening foundations

Ethics and the future

Further reading

Frequently asked questions

What's the difference between machine learning and deep learning?

Deep learning is a subfield of machine learning that uses deep neural networks — models with many stacked layers. Classical ML (decision trees, SVM, linear regression) uses hand-crafted features and few parameters. Deep learning learns its features directly from raw data, but it needs massive datasets and GPU compute. For tabular data, classical ML is often better — for images, audio, and text, deep learning is the standard.

Why is it called 'deep' learning?

The 'deep' refers to the number of layers in the network. A simple neural net has 1–2 hidden layers between input and output; a deep net has dozens or hundreds. The extra depth lets the model build hierarchical representations — early layers detect edges, middle ones detect shapes, later ones detect objects. You typically call a network 'deep' from about three hidden layers onward.

How many layers does a deep network need?

No fixed minimum. Rule of thumb: from 3 hidden layers onward, a net is considered deep. Production image CNNs like ResNet have 50 to 152 layers. Language models like GPT-4 or Claude have well over 100 transformer blocks. For your own projects: start small (2–5 hidden layers) and only go deeper if validation accuracy keeps improving. Deeper is not automatically better — too-deep networks become hard to train.

Why does deep learning need so much data?

Deep networks have millions to billions of parameters. Every parameter must be learned from data, otherwise the model overfits (memorizes training examples). Rule of thumb: 10–100 clean examples per parameter. Transfer learning — fine-tuning a pretrained model — drastically reduces the requirement: hundreds of images plus a ResNet backbone often get you top results.

Why GPUs instead of regular CPUs?

Training neural networks is essentially giant matrix multiplications. GPUs have thousands of simple cores that handle matrices in parallel — CPUs have only a few strong cores. An image CNN trains 20–100× faster on a modern GPU than on CPU. For very large models, specialized chips come into play: NVIDIA H100, Google TPU, AMD Instinct. Without a GPU, serious deep learning is practically impossible.

What is backpropagation explained simply?

Backpropagation is how a network learns. Core idea: after a prediction, the error is propagated backwards through the network — each layer gets its share of the blame. Those shares (gradients) are then used to shift every weight via gradient descent in the direction that reduces the loss. Mathematically it's the chain rule from school, applied to many nested functions. Without backpropagation, no practical deep learning.

What does an activation function do?

It adds non-linearity to the network. Without it, even a 100-layer net would collapse into a single linear function — too weak for complex patterns. The key ones: ReLU (max(0,x), fast, the default for hidden layers), Sigmoid (0–1, for binary outputs), Tanh (−1–1, smoother than Sigmoid), Softmax (multi-class probabilities at the output). ReLU dominates virtually every modern network today.

Is a neural network like a brain?

Only very loosely. The original inspiration comes from the biological neuron (dendrite, axon, synapse), and the terms 'neuron' and 'activation' survived from that era. But: a biological neuron fires as a spike, uses chemical neurotransmitters, and has complex feedback loops — an artificial neuron is just a weighted sum with an activation function. The brain has 86 billion neurons; a large LLM maybe a trillion parameters — but these are fundamentally different computational units.

Can a neural network be creative?

Creative in the human sense — no. What modern generative networks (GPT, Midjourney, Stable Diffusion) do is recombination: they have learned patterns from massive datasets and can produce new combinations that didn't appear that way in training. It often feels creative but is interpolation in a high-dimensional space. True creativity — deliberate, intentional deviation — remains human.

What is overfitting in deep learning?

Overfitting means the network memorizes training data instead of the underlying pattern. Symptom: training accuracy 99%, validation accuracy drops. Remedies: more data, data augmentation, dropout (randomly disabling neurons), L2 regularization, early stopping (halt training when validation gets worse), smaller model. Overfitting is enemy #1 in deep learning — especially with limited data.

Which programming language for deep learning?

Python, almost without exception. PyTorch, TensorFlow, JAX, Hugging Face — every major framework is Python-centric. Under the hood the critical operations run in C++ and CUDA, but as a user you write Python. Alternatives: Julia has an active ML community, R focuses on statistics, but both are niche. For getting started: PyTorch (flexible, research) or Keras/TensorFlow (beginner-friendly, production).

How long does training a model take?

Massively depends on model size and dataset. An MNIST classifier on a single GPU: a few minutes. An ImageNet CNN from scratch: several days on multiple GPUs. A large language model like GPT-4: months on thousands of GPUs, with training costs estimated in the tens of millions of dollars. Fine-tuning a pretrained model is much faster: hours to days on a single GPU.

How much electricity does deep learning use?

Training large models is energy-intensive. GPT-3 reportedly consumed about 1,287 MWh during training — roughly the annual usage of 400 German households. Inference is much cheaper per query, but at billions of ChatGPT requests it adds up. AI data centers became a notable global electricity factor in 2025. Smaller models (DistilBERT, LoRA fine-tuning) and more efficient architectures (Mixture of Experts) substantially reduce consumption.

Tool comparison

Live side-by-side comparison

All comparisons