Neural Networks Explained
How neural networks learn — from perceptron to transformer. Architecture, training, activation functions, and why the same family of models powers vision, speech, and language alike.

Architectures at a glance
To navigate the landscape of neural networks, five architectures get you most of the way. Each is optimized for a particular data type and had its breakthrough at a specific moment in history:
| Architecture | Data type | Classic example | Breakthrough year |
|---|---|---|---|
| Perceptron | tabular, linearly separable | digit classification | 1958 (Rosenblatt) |
| MLP (Multi-Layer Perceptron) | tabular, general purpose | customer churn, credit scoring | from 1986 (backprop) |
| CNN (Convolutional) | images, video | AlexNet on ImageNet | 2012 |
| RNN / LSTM | sequences, audio, time series | speech recognition pre-Whisper | 1997 (LSTM paper) |
| Transformer | text, multimodal | GPT-4, Claude, Gemini | 2017 (“Attention Is All You Need”) |
The sections below walk through each of these — from the single computational unit, through training, to the question of which architecture fits which task.
What is a neural network? The structure in one sentence
A neural network is a mathematical model made of many simple computational units, arranged in layers and communicating through learned weights. The idea is loosely inspired by neurobiology; the mechanism itself is pure math.
The central building block is the neuron (or perceptron, to be precise). It does three things:
- It receives several numbers as input — say, the pixel values of an image or the token vectors of a sentence.
- It multiplies each input by a weight (
w) and adds the products plus a bias term (b). - It passes the result through an activation function that decides how strongly the neuron “fires”.
Formally: y = f(w₁·x₁ + w₂·x₂ + … + wₙ·xₙ + b). Thousands of these units are stacked into layers. An input layer receives the raw data, one or more hidden layers process it step by step, and an output layer produces the final result — for an image classifier, a probability distribution over the classes; for a language model, the next token.
The adjective “artificial” in “artificial neural network” marks exactly this difference from biology. A biological neuron fires in real time as a spike pattern, uses chemical neurotransmitters, and is embedded in a hyper-complex feedback mesh. An artificial neuron is a summing gate with a non-linear function after it — nothing more.
Where this fits. If you’re just entering the field, What is AI? is the better starting point. If you want classical machine learning in your head before diving into the neural world, read Machine Learning. This article is the direct bridge to Deep Learning — the deep neural networks that have dominated the discipline since 2012.
How does a neural network learn from data?
Learning, for a neural network, means adjusting the weights and biases so that predictions match the correct answers in the training data as closely as possible. The process runs in five recurring steps.
Step 1 — Forward pass. A training example — say, an image — flows layer by layer through the network. Initially all weights are random (typically via Glorot or He initialization), and the first prediction is correspondingly poor. That’s expected.
Step 2 — Compute the loss. A loss function compares the prediction to the actual label and returns a number: small = good prediction, large = far off. For classification you use Cross-Entropy; for regression, Mean Squared Error.
Step 3 — Backpropagation. The loss is propagated backwards through the network. Using the chain rule, the framework computes a gradient for every weight — a number that gives both direction and magnitude of the required adjustment. PyTorch, TensorFlow, and JAX handle this automatically via autodifferentiation. You don’t have to implement backprop yourself, but you should understand it.
Step 4 — Gradient descent. With gradients in hand, each weight is shifted a small step in the direction that reduces the loss. The step size is set by the learning rate — the single most important hyperparameter. Pure gradient descent is rarely used today; Adam and AdamW are the modern standard because they adapt the learning rate per parameter automatically.
Step 5 — Repeat for many epochs. One full pass through the training data is called an epoch. The data is split into batches (typically 32, 64, or 256 examples), and after each batch the weights are updated. After many epochs, the loss converges — the network is done training.
This loop runs millions of times when large models are trained. The math behind it has been known for decades; what’s new is the scale — the sheer amount of data, parameters, and parallel compute. A deeper treatment with loss curves and visualizations lives in the Deep Learning hub.
What does an activation function do, and which ones matter?
An activation function is the non-linear function at the end of every neuron. Without it, even the deepest network would mathematically collapse into a single linear function — too weak for complex patterns. It is the reason stacking layers adds expressive power at all.
Four functions are worth knowing:
- ReLU —
max(0, x). If the input is negative, it outputs 0; otherwise the input unchanged. Extremely cheap to compute, and no “vanishing gradient” for positive values. ReLU is today’s default for hidden layers in nearly every modern network. - Sigmoid —
1 / (1 + e⁻ˣ). Squashes the input into the range (0, 1). Nice probability interpretation, which is why it’s common at the output layer for binary classification. Problematic inside deep networks because of vanishing gradients. - Tanh —
(eˣ − e⁻ˣ) / (eˣ + e⁻ˣ). Squashes into the range (−1, 1) and is zero-centered. Historically common in RNNs, rare today. - Softmax — used at the output layer for multi-class classification. Converts raw outputs into probabilities that sum to 1.
Modern transformer models like GPT, Claude, and BERT use ReLU variants — GELU and SiLU — that are slightly smoother and empirically deliver marginally better results. They’re more expensive to compute, which is irrelevant at today’s GPU budgets.
Rule of thumb: ReLU for hidden layers, Softmax at the output for multi-class, Sigmoid at the output for binary. Pick these three defaults and you’re right in 90% of all cases.
Which architectures shape 2026? (Perceptron → MLP → CNN → RNN → Transformer)
There is no single “neural network”. Which architecture fits depends on the data type. The family tree in five steps:
Perceptron (1958)
Frank Rosenblatt’s original perceptron was a single neuron with a step function. It could learn linearly separable patterns — data that you can split into two classes with a single straight line on a sheet of paper. Its limit was documented in 1969 by Marvin Minsky and Seymour Papert (no XOR), which contributed to the first AI winter.
MLP — Multi-Layer Perceptron (from 1986)
Stack several layers and train them with backpropagation, and you get an MLP. It can approximate any continuous function (the Universal Approximation Theorem). In practice today, MLPs are used for tabular data — customer churn, credit scoring, simple classifications. Gradient-boosting methods like XGBoost and LightGBM often beat MLPs on tabular data, so always benchmark.
CNN — Convolutional Neural Network (breakthrough in 2012)
CNNs were developed in 1989 by Yann LeCun for handwritten digit recognition on US ZIP codes (LeNet). Their trick: convolutional layers that slide a small filter across the image and detect local patterns — edges in the early layers, shapes in the middle, whole object parts in late layers. The mainstream breakthrough came in 2012 when AlexNet won the ImageNet contest by a wide margin. Today CNNs power Face ID, medical imaging, and self-driving cars.
RNN / LSTM (1997 / dominance 2014–2018)
Recurrent Neural Networks have a feedback loop — the output at step n becomes input at step n+1. This lets them process sequences: text, audio, time series. Their weak spot: long sequences cause context loss (vanishing gradients). LSTM (Hochreiter & Schmidhuber 1997) and GRU are upgrades with gates that decide what to remember and what to forget. LSTMs dominated language processing from roughly 2014 to 2018 — until transformers pushed them aside. They still appear in time-series forecasting (finance, energy) and lightweight on-device models.
Transformer (2017 — today’s dominant architecture)
The architecture behind practically everything you perceive as “AI” in 2026. In 2017, Google researchers introduced — in “Attention Is All You Need” — a model that drops recurrence entirely. The trick: self-attention — every token looks at every other token and weighs how relevant each is to its own meaning. In the sentence “The bank by the river was modern,” attention decides whether “bank” leans toward “river” (a seat) or toward “modern” (a financial institution).
Transformers are the foundation of GPT-4, Claude, Gemini, Llama, BERT, T5, Stable Diffusion 3, and almost every other large model. They are excellent at parallelization (unlike RNNs) and scale continuously with more data + more parameters + more compute. This scaling property is the reason GPT-4 with an estimated 1.8 trillion parameters works (industry estimate, unconfirmed by OpenAI). A deeper dive lives in the Transformer pillar.
What are neural networks actually used for?
If your phone looked at you five times yesterday, you interacted with a neural network five times. The most important applications you already use:
- Face ID on smartphones. A CNN encodes your face as a high-dimensional vector and compares it against the stored template on every unlock.
- Speech recognition in Siri, Alexa, Whisper. Audio spectrograms are converted to text by a transformer-based model.
- Translation (DeepL, Google Translate). Encoder-decoder transformers trained on billions of parallel sentence pairs.
- Recommendation systems (Netflix, Spotify, YouTube). Hybrids of classical collaborative filtering and embedding networks.
- Medical image analysis. CNN models reach specialist level on specific tasks (tumor classification in CT/MRI scans).
- Autonomous driving. Fusion of camera, radar, and lidar data, processed by multi-task networks in real time.
- Generative AI. Generative AI produces text (ChatGPT, Claude), images (Midjourney, Stable Diffusion), voices (ElevenLabs), and video (Sora, Runway).
A useful rule of thumb: as soon as data is unstructured — pixels, audio samples, tokens — a neural network is almost always the right choice in 2026. For structured tabular data, classical machine learning often still wins.
How does a neural network differ from classical machine learning?
Classical machine learning needs hand-crafted features. A neural network learns the features itself from raw data. That’s the core difference — and the reason deep learning beat classical methods on image, audio, and text.
Take image classification. In the classical ML workflow, a human extracts features from each image: color histograms, edge orientations (SIFT, HOG), texture measures. Only these preprocessed numbers then go into a classifier like Support Vector Machine or Random Forest. It works, but it’s bound to the quality of the features — and that depends on the domain knowledge of the human who defines them.
A CNN, by contrast, takes the raw pixels directly. Early convolutional layers learn edges, middle layers learn shapes, late layers learn object parts. This hierarchical representation is not prescribed — it emerges during training. At AlexNet in 2012, the gap to classical competitors on ImageNet was so wide that the entire field switched to deep learning within a few years.
But: deep learning is not automatically better. For tabular data under 10,000 rows, XGBoost beats neural networks in practically every Kaggle competition. When explainability is mandatory (credit, medicine, law), decision trees or logistic regression are often the right call. And when compute is scarce, classical ML on a CPU often delivers more than a trained giant network that needs a GPU.
| Criterion | Classical ML | Neural network |
|---|---|---|
| Data type | tabular | unstructured (image, audio, text) |
| Dataset size | hundreds to hundreds of thousands | from ~10,000; for LLMs, billions |
| Feature engineering | manual, by humans | automatic, by the network |
| Explainability | usually high (decision trees) | low (“black box”) |
| Compute | CPU is enough | GPU or TPU |
| Iteration time | minutes | hours to days |
Why does training need so much compute?
Training a neural network is, at heart, a very long sequence of matrix multiplications. That’s exactly what GPUs are built for. A modern GPU has thousands of parallel cores; a CPU only a few strong ones. On an image classification task, a GPU trains 20 to 100 times faster than a CPU.
Three orders of magnitude to anchor the scale:
- A small tabular network (MLP with a few thousand parameters, MNIST dataset): a few minutes on a modern GPU, doable on a CPU too.
- An image CNN from scratch (ResNet-50 on ImageNet, 1.2M images, 25M parameters): days on a single H100.
- GPT-4 (estimated ~1.8T parameters): months on thousands of GPUs, with training costs in the tens to hundreds of millions of dollars range (industry estimates).
Energy use becomes relevant at this scale. GPT-3 reportedly consumed about 1,287 MWh during training — the annual usage of roughly 400 German households. Inference is much cheaper per query but adds up fast at billions of requests per day.
Three levers cut the compute bill:
- Transfer learning. Instead of training from scratch, load a pretrained model (e.g. Llama 3 or a ResNet backbone) and fine-tune it with a few hundred examples on your task — hours instead of weeks.
- More efficient architectures. Mixture of Experts (used in Mistral, and likely in GPT-4) activates only a subset of parameters per input. Sparse models skip unnecessary computations.
- Specialized hardware. Google TPU, AMD Instinct, NVIDIA H100/B200 are matrix-multiplication accelerators that deliver substantially more per watt than general-purpose chips.
How much math do I actually need?
The honest answer depends on what you want to do — use, understand, build, or research.
Use pretrained models (ChatGPT, Claude, Hugging Face Inference API). No math. You need the ability to write clear prompts — see Prompt Engineering. Time investment: hours.
Train your own networks (PyTorch tutorials, fine-tuning, simple architectures). High-school math is enough: vectors, matrix-by-vector multiplication, an intuitive feel for derivatives and gradients. Plus Python basics. Time investment: 4–8 weeks at 5 hours per week.
Understand why it works (why backpropagation, why ReLU beats Sigmoid, why Adam). Linear algebra (matrices, eigenvectors), calculus (chain rule, partial derivatives), basic probability. All high-school or early-undergraduate level, all reachable via MOOCs. Time: 3–6 months.
Design your own architectures (not just apply them). Numerical optimization, probability theory, some information theory, deeper linear algebra. Undergraduate-level math or CS. Time: 12–24 months.
Do research (invent new architectures or methods). Full degree plus master’s or PhD. Time: 5+ years.
Three resources we recommend in practice:
- 3Blue1Brown — Neural Networks (YouTube, free). Four videos and you’ll grasp backpropagation intuitively.
- Fast.ai — Practical Deep Learning. Top-down, hands-on. You build a model in the first hour; the theory follows.
- Andrew Ng’s Deep Learning Specialization (Coursera). The academic gold standard — five courses, thorough.
What are the most common training mistakes?
Anyone training a network for the first time will fall into three or four of these traps. The good news: all of them have well-known counter-measures.
1. Wrong learning rate. Too large: the loss explodes or oscillates. Too small: training drags on forever and the loss barely moves. Starting value for Adam: 3e-4 to 1e-3. If the loss stagnates after the first few epochs, halve it; if it explodes, quarter it.
2. Overfitting. The network memorizes the training data. Symptom: training accuracy 99%, validation accuracy drops. Counter-measures: more data, data augmentation (random image transforms), dropout (randomly disable neurons), L2 regularization, early stopping (halt training as soon as validation gets worse), smaller model.
3. Underfitting. The network is too small or trained too briefly. Symptom: training loss also stays high. Counter-measures: more layers, more neurons per layer, longer training, a more suitable architecture (CNN for images, transformer for text).
4. Data leakage. Information from the validation or test set bleeds into training. Result: seemingly great numbers that collapse in production. Classic sources: time series where train and test data are not strictly time-separated; duplicates between train and test; features that indirectly contain the label.
5. Imbalanced classes. 99% of the examples belong to class A, 1% to B. A model that always says “A” has 99% accuracy — but is useless. Counter-measures: class weights in the loss, oversampling the minority class, or use F1 score instead of accuracy.
6. Wrong loss function. Cross-Entropy for multi-class classification, MSE for regression — those are the two defaults. Training classification with MSE gives weak gradients and a sluggish network.
7. Validation too late. You train for four hours, only to discover the validation pipeline was broken. Counter-measure: within the first ten minutes of a training run, sanity-check the validation loop once.
For a deeper treatment with loss-curve diagnosis, see the Deep Learning pillar.
Go deeper
This hub gives you the vocabulary for neural networks. Three paths lead further:
Frame the fundamentals
- What is AI? — the umbrella concept neural networks sit under. For beginners with no prior knowledge.
- Machine Learning — the broader category. Classical ML with decision trees, SVMs, and logistic regression before neural networks enter the picture.
Go into architectures
- Deep Learning — deep neural networks in practice: architecture choice, frameworks, training loops.
- Transformer — self-attention, multi-head, encoder vs. decoder. Today’s standard architecture in detail.
- Generative AI — what happens when deep networks stop classifying and start producing: text, images, voices.
Try it in practice
- ChatGPT — the best-known transformer-based language model. A direct application of everything above.
Further reading
Frequently asked questions
What is the difference between a neural network and deep learning?
A neural network is the model architecture — neurons in layers connected by weights. Deep learning is the discipline of working with particularly deep neural networks (roughly three or more hidden layers). Every deep learning model is a neural network, but not every neural network is deep. Frank Rosenblatt's 1958 perceptron had a single layer — a neural network, but not deep learning.
Is a neural network like the human brain?
Only very loosely. The original inspiration came from the biological neuron — dendrites, cell body, axon — and terms like 'neuron' and 'activation' survived from that era. But the human brain has roughly 86 billion neurons with complex biochemical dynamics. An artificial neuron is just a weighted sum followed by an activation function. The architectures solve similar problems; the underlying blueprint is fundamentally different.
What is a perceptron and why did it matter in 1958?
The perceptron, built in 1958 by Frank Rosenblatt at the Cornell Aeronautical Laboratory, was the first trainable neural network. It could classify simple images — like handwritten letters — and adjust its weights from examples. Before that, AI systems were rule-based. The perceptron showed for the first time that a machine could learn from data. Its limit: only linearly separable patterns. Multi-layer perceptrons later removed that ceiling.
What do CNN, RNN, and Transformer stand for?
CNN stands for 'Convolutional Neural Network' — the architecture for images and video, popular since AlexNet in 2012. RNN means 'Recurrent Neural Network' — a feedback-loop architecture for sequences like text or audio, dominant from roughly 2014 to 2018. Transformer is the architecture from the 2017 Google paper 'Attention Is All You Need' — the foundation of GPT, Claude, Gemini, and nearly every modern language or multimodal model.
Do I need math to use neural networks?
To use pretrained models (ChatGPT, Claude, Hugging Face), a clear prompt is enough — no math required. To train your own models with PyTorch or TensorFlow, high-school math plus Python basics will do. To understand why backpropagation works, you need linear algebra and calculus — still high-school level, all reachable via MOOCs. Designing new architectures or doing research requires undergraduate math (numerical optimization, probability theory).
How much data does a neural network need to train?
Rule of thumb: 10 to 100 clean training examples per parameter. A small tabular network (1,000 parameters) gets by on a few hundred rows. An image CNN on ImageNet trains on 1.2 million images. GPT-4 was trained on a large slice of the public internet — an estimated 13 trillion tokens. Transfer learning slashes that requirement: with a few hundred examples and a pretrained backbone, you often reach top-tier results.
What is backpropagation explained simply?
Backpropagation is the algorithm that lets a network learn from its mistakes. After a prediction, the total error is propagated backwards through the network — each weight gets a gradient, a number that says which direction it must shift. Mathematically it's the chain rule from school applied to nested functions. The algorithm was popularized in 1986 by Rumelhart, Hinton, and Williams — without backpropagation, practical deep learning would not exist.
Which programming language is used for neural networks?
Python, almost without exception. PyTorch (Meta), TensorFlow (Google), and JAX (Google) — the three dominant frameworks — are all Python-centric. The heavy lifting runs in C++ and CUDA under the hood, but as a user you write Python. Hugging Face Transformers adds tens of thousands of pretrained models as a Python library. Julia has a small ML niche and R is common for statistics workflows, but neither plays a meaningful role in modern neural networks.
Are neural networks a 'black box'?
Largely yes — and this is an active research field called interpretability. The forward pass is straightforward math, but explaining why a 175-billion-parameter network produced a specific output is hard. Tools like saliency maps, SHAP values, and attention visualizations help. For high-stakes domains (medicine, credit, justice), simpler models like decision trees or logistic regression are often legally required precisely because they are transparent end-to-end.