Technology Level: Practitioner

The Transformer Architecture: How Modern AI Works

The Transformer architecture is the technical foundation of ChatGPT, Claude, Gemini and every modern language and image model. A complete explanation of self-attention, multi-head attention, positional encoding, encoder/decoder structures, and the evolution from Attention Is All You Need“ (2017) to mixture-of-experts and reasoning models in 2026.“

toolwiki – Editorial · Updated April 25, 2026

Transformer 2026: Self-Attention, Encoder, Decoder Explained — concept illustration: Transformer architecture explained: self-attention, multi-head, positional encoding, encoder/decoder

Why Transformers changed everything

Until 2017, recurrent architectures (RNN, LSTM, GRU) dominated sequence modeling. They have a structural disadvantage: information flows sequentially through time steps, long-range dependencies fade, and training is hard to parallelize — every token waits on the previous one. That fundamentally limited both model size and training speed.

„Attention Is All You Need” (Vaswani et al., 2017) dissolved this in a single stroke. Eight researchers at Google Brain and Google Research showed that an architecture using only attention plus feed-forward layers, with no recurrence at all, solves sequence-to-sequence tasks better — and parallelizes fully on GPUs. Within two years the architecture had taken over: BERT (2018, encoder-only), GPT-2 (2019, decoder-only), T5 (2019, encoder-decoder). Within five years, recurrent networks for language processing were practically obsolete.

Today, nine years after the paper, Transformers are the dominant architecture in language, vision, audio, code, protein folding (AlphaFold 2/3) and robotics. Practically every frontier model in 2026 — GPT-4o, GPT-o3, Claude 3.5/4.6, Gemini 2.5, Llama 3.x, Mistral, DeepSeek V3 — is a Transformer variant. Anyone wanting to understand how modern AI works cannot get around this architecture.

Self-attention: the core mechanism

Self-attention is the operation that lets every token „look at” every other token in the sequence. Conceptually:

For each token, three vectors are computed via three linear projections of the token representation:

Query (Q): „what am I looking for?”
Key (K): „what do I offer?”
Value (V): „what information do I carry, if I am relevant?”

Attention between two tokens results from the dot product of query (from the „asking” token) and key (from the „addressed” token), scaled by the square root of the dimension and normalized via softmax. The resulting weight determines how strongly the value of the addressed token contributes to the new representation of the asking token. The formula: Attention(Q,K,V) = softmax(QK^T / √d) · V.

Practically: when the model processes „The dog, who had been waiting for three hours, jumped at the postman” and arrives at the token „jumped”, self-attention can reach back directly to „dog” — without passing information through every intermediate token. That is the central advantage over RNNs: constant path length between arbitrary tokens.

Multi-head attention: many parallel perspectives

A single attention operation is not enough in practice. Different linguistic relationships — subject-verb agreement, pronoun reference, semantic similarity, syntactic hierarchy — call for different „angles.” Multi-head attention therefore runs several (typically 8 to 32) attention operations in parallel; each with its own Q/K/V projections.

Empirical studies (e.g. Voita et al. 2019) show that heads do specialize on different phenomena during training — some track syntactic dependencies, others semantic similarity, still others discourse structure. That is one reason Transformers are so versatile: a single model carries multiple parallel representations.

The outputs of the heads are concatenated and merged through one final linear projection. After that — per layer — come layer normalization, a two-stage feed-forward block (with non-linear activation like GELU or SwiGLU), another layer norm, and residual connections. A typical 2026 frontier model stacks 60 to 120 such blocks.

Positional encoding: recovering order

Self-attention is permutation-invariant by itself — it does not see token order. „Dog bites man” and „man bites dog” would be identical to pure attention. To restore order, each position is assigned a vector that is added to the token representation.

The original paper used sinusoidal positional encodings — fixed, non-learned functions of varying frequencies. Later models adopted learned position embeddings (BERT, GPT-2). In 2026, two newer variants dominate: RoPE (Rotary Position Embedding) rotates Q and K vectors based on position; ALiBi (Attention with Linear Biases) adds a linear penalty into attention scores for distant positions. Both scale much better to long contexts than sinusoidal encodings — a key building block of the 1–2 million-token context windows in Gemini Pro and long-context Claude.

Tokenization: from text to sequence

Before a Transformer can compute self-attention, text must be split into tokens — discrete units the model accepts as input. Pure character tokenization would be wasteful (too many steps per word); pure word tokenization would explode the vocabulary on rare words or typos.

The 2026 standard is subword schemes: Byte-Pair Encoding (BPE; GPT-2/3/4), WordPiece (BERT) and SentencePiece (T5, Llama). They split frequent words into single tokens and rare words into subword pieces. Languages other than English tend to fragment more strongly because English training material dominates — the same statement therefore needs more tokens in German, French or Hindi than in English. That has direct consequences for cost and context-window utilization.

Each token is then translated into a high-dimensional vector via an embedding table (typically 768 to 12,288 dimensions). These vectors are learned parameters — after training they carry semantically rich information: similar tokens lie close together in vector space, and simple vector arithmetic (the famous „king − man + woman ≈ queen”) becomes visible.

How a token flows through the model

A concrete walkthrough makes the architecture tangible. Input: the sentence „The capital of France is”. Goal: predict the next token.

Step 1 — tokenization: the sentence is split into tokens, e.g. [The, capital, of, France, is]. Each token gets an ID from the vocabulary.

Step 2 — embedding: each token ID is translated into a vector. Positional encoding (sinusoidal, RoPE or ALiBi) is added so that token position is encoded.

Step 3 — Transformer blocks: the sequence of 5 vectors flows through 60 to 120 Transformer blocks. In each block: multi-head self-attention computes how strongly each token should attend to every other; a feed-forward block transforms each position individually; layer norm and residual connections keep gradients stable. The representations grow richer and more context-specific with every block.

Step 4 — output projection: the representation of the last token (in decoder-only models) is mapped through a linear projection to the vocabulary size — for example, 128,000 values in Llama 3. Softmax turns these values into a probability distribution over all possible next tokens.

Step 5 — sampling: the next token is drawn from this distribution — either deterministically (greedy: highest probability) or stochastically (temperature, top-k, top-p / nucleus sampling). The chosen token is appended to the sequence; steps 1 through 5 repeat for the next position. That is autoregressive generation.

For „The capital of France is” the model will pick „Paris” with high probability — that association exists billions of times in training material, and the final layers have aggregated enough information across token relationships to identify „Paris” as the most plausible continuation.

Three architecture families

From the original Transformer, three families have evolved with different structures and strengths.

Encoder-only (BERT family)

Reads the full input bidirectionally — every token sees both left and right context. The training objective is usually masked language modeling (MLM): random tokens are masked and the model reconstructs them from context. Encoder-only models produce strong contextual embeddings — ideal for classification, named-entity recognition, retrieval (search, RAG embeddings) and similarity tasks. Practitioners: BERT, RoBERTa, DeBERTa, plus most 2026 embedding models (text-embedding-3-large, BGE, E5, Cohere Embed v3 are encoder variants).

Decoder-only (GPT family)

Generates token by token autoregressively — every token sees only the preceding tokens, not future ones (causal masking). Training objective is next-token prediction. This family dominates the generative 2026 models: GPT-4o, GPT-o3, Claude 3.5/4.6, Llama 3.x, Mistral, DeepSeek, Qwen, Gemma are all decoder-only. Strengths: scales excellently, allows very direct generation, aligns well with instruction tuning and RLHF.

Encoder-decoder (T5 family)

Combines both: an encoder stack produces representations of the input; a decoder stack generates the output and accesses the encoder representations via cross-attention. Classical for sequence-to-sequence tasks (translation, summarization). Practitioners: T5, BART, mT5, Flan-T5. In 2026 this family is less visible than the decoder-only world — many classical encoder-decoder tasks are now solved with large decoder-only models that learn the same behavior in-context.

Important 2026 evolutions

The 2017 base architecture is intact — but central modules have been swapped or extended in recent years.

Mixture-of-Experts (MoE). Instead of a monolithic feed-forward layer, many expert layers exist, of which only a few (top-2 to top-8) are activated per token — controlled by a learned routing network. Advantage: models with hundreds of billions of parameters become possible without each query activating all of them. Mistral 8x7B (Mixtral), DeepSeek V3 and Gemini Pro use MoE in production. Inference efficiency improves significantly; training becomes more complex (routing stability, expert balancing).

Long-context optimizations. Naive self-attention scales quadratically with sequence length — at 1 million tokens, prohibitive. Three levers have prevailed: Flash-Attention (Dao et al. 2022, 2023) is a GPU-efficient re-implementation that runs the same math 2–4× faster. Sliding-window attention (Mistral) restricts each token to a neighborhood. Sparse attention patterns (Longformer, BigBird) combine global and local attention. Together with RoPE/ALiBi they have made 200k–2M-token context windows economically viable.

Reasoning models. GPT-o3, Claude Extended Thinking and Gemini Deep-Think integrate chain-of-thought directly into model behavior — via reinforcement learning on solution paths instead of only final answers. The model „thinks” longer internally (more tokens) before producing the final answer. Architecturally these models remain Transformers; the difference is in training and inference pipeline. Practically this yields measurably better results on math, code and multi-step logic — at the cost of higher latency and higher token spend.

State-space models as alternative. Outside the Transformer family, Mamba, RWKV and Mamba-2 (Gu & Dao 2023, 2024) have gained attention — they scale linearly in sequence length and are attractive for very long contexts (bio sequences, audio). In 2026 Transformers still dominate the mainstream, but state-space hybrids are visible in research and individual production models.

Vision and multimodal Transformers

The architecture is not limited to text. Vision Transformer (ViT) — Dosovitskiy et al. 2020 — cuts an image into patches (e.g. 16×16 pixels), treats each patch as a token, and runs the sequence through a standard Transformer encoder. Result: with sufficient training data, ViT beats classical CNNs on ImageNet and many other benchmarks. Today vision Transformers are standard in image classification, segmentation and detection.

Multimodal models integrate text, image and audio tokens into a shared Transformer stack. GPT-4o, Gemini 2.5 and Claude 3.5 process mixed-modal input natively — an image is translated by a vision encoder into tokens that then flow through the language model alongside text tokens. This enables image captioning, OCR, chart analysis, code-from-sketch, audio responses — all from a single model architecture.

In robotics (RT-2 by DeepMind, OpenVLA, π0 by Physical Intelligence) and biology (AlphaFold 2 and 3, ESM-2), Transformers are likewise the dominant architecture — evidence that the concept of token sequences plus attention transfers successfully to almost every structured-data domain.

Generative AI places Transformers in the broader context — where LLMs come from, what tokens, embeddings and sampling are. Machine Learning explains the learning machinery (backpropagation, loss functions) that makes Transformers trainable in the first place. Deep Learning shows the historical arc — from perceptrons through CNN/RNN to the Transformer era. On the practice side: Prompt Engineering actively uses knowledge about self-attention and context windows (XML tags, long-context structuring); RAG builds on encoder embeddings (a Transformer variant) and reaches into decoder generation. Diffusion Models are the most important alternative architecture in image generation — partly with Transformer-based backbones. The Future of AI sketches where the architecture goes next.

Application context:

Software Engineering and IT: code models (Codex, Copilot, Claude Sonnet) are Transformers with code-specific tokenization and training data.
E-Commerce and Retail: vision-Transformer-based product-image analysis and multimodal search are 2026 production standard.
Healthcare and Medicine: vision Transformers for radiology images, protein Transformers (ESM, AlphaFold) for drug discovery — different applications of the same architecture.

Closing note

In 2026 Transformers are no longer experimental — almost a synonym for „modern AI.” Their core idea — self-attention instead of recurrence — has proven so robust over nine years that language, vision, audio, code, biology and robotics have converged on it. The architecture will keep evolving (MoE, long-context, reasoning, state-space hybrids), but the basic principle remains: tokens become vectors, attention mediates relationships between them, many such blocks stacked become the model. Anyone who has internalized that can read practically any current model description.

Frequently asked questions

What is the Transformer architecture in one sentence?

Transformer is a neural-network architecture introduced in 2017 that uses self-attention instead of recurrent connections — it processes sequences fully in parallel and is therefore efficiently scalable on GPUs. Practically every modern language model (ChatGPT, Claude, Gemini, Llama, Mistral) and many image and audio models build on it.

Who invented the Transformer?

Eight researchers at Google Brain and Google Research published the paper Attention Is All You Need“ in 2017 (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin). Originally aimed at machine translation, the architecture has since spread across virtually every ML domain.“

What is self-attention concretely?

Self-attention lets every token in a sequence interact directly with every other token — for each position, the model computes a weighted sum over all other positions, based on learned query, key and value vectors. A word at the end of a sentence can therefore reference a subject at the beginning without losing the path through intermediate words — the central advantage over RNNs.

What is the difference between encoder-only, decoder-only and encoder-decoder?

Encoder-only models like BERT read the full input bidirectionally and produce embeddings — ideal for classification and retrieval. Decoder-only models like GPT, Claude and Llama generate token by token autoregressively — ideal for text generation. Encoder-decoder models like T5 or the original Transformer combine both for classical sequence-to-sequence tasks (translation, summarization).

Why does a Transformer need positional encoding?

Self-attention itself is permutation-invariant — it does not see token order. To distinguish dog bites man“ from man bites dog“, each position is given a unique vector (positional encoding) added to the token representation. Modern variants like RoPE (Rotary Position Embedding) and ALiBi have largely replaced the original sinusoidal encoding by 2026 — they scale much better with long context.“

What is multi-head attention?

Instead of one attention operation, several run in parallel — typically 8, 16 or 32 heads. Each head learns a different aspect of token relationships (syntactic, semantic, coreferential, etc.). The outputs are concatenated and projected. Multi-head is one reason Transformers are so versatile — different heads specialize in different linguistic phenomena.

What is Mixture-of-Experts (MoE)?

MoE is a dominant 2026 extension: instead of a monolithic feed-forward layer, there are many experts of which only a few (top-2 or top-4) are activated per token. This allows models with hundreds of billions of parameters that activate only a fraction per query — similar quality at much lower inference cost. Mistral 8x7B, Mixtral, DeepSeek V3 and Gemini Pro use MoE in production.

What changes with reasoning models like GPT-o3?

Reasoning models integrate chain-of-thought directly into model training — via reinforcement learning on solution paths, not just final answers. The model thinks longer internally before answering, with measurably better results in mathematics, code and multi-step logic. Architecturally they are still Transformers; the difference lies in training and inference pipeline (more tokens for internal reasoning steps).

How does Transformer scale with sequence length?

Naive self-attention has quadratic complexity in sequence length — doubling tokens quadruples compute. Dominant 2026 optimizations: Flash-Attention (more efficient GPU implementation), sliding-window attention (Mistral), sparse attention patterns, and for very long contexts specialized architectures like Mamba/state-space models. Together these have made 1–2 million-token context windows (Gemini, Claude long-context) economically viable.

Do Transformers also work for images and audio?

Yes — Vision Transformer (ViT, Dosovitskiy et al. 2020) splits images into patches and treats them as tokens; audio Transformers work on spectrogram patches or direct waveform tokens. Multimodal models (GPT-4o, Gemini, Claude 3.5) integrate text, image and audio tokens in a shared Transformer stack. The architecture has proven cross-domain — a major reason for its dominance.

The Transformer Architecture: How Modern AI Works

Why Transformers changed everything

Self-attention: the core mechanism

Multi-head attention: many parallel perspectives

Positional encoding: recovering order

Tokenization: from text to sequence

How a token flows through the model

Three architecture families

Encoder-only (BERT family)

Decoder-only (GPT family)

Encoder-decoder (T5 family)

Important 2026 evolutions

Vision and multimodal Transformers

Closing note

Further reading

Frequently asked questions

Tool comparison

Why Transformers changed everything

Self-attention: the core mechanism

Multi-head attention: many parallel perspectives

Positional encoding: recovering order

Tokenization: from text to sequence

How a token flows through the model

Three architecture families

Encoder-only (BERT family)

Decoder-only (GPT family)

Encoder-decoder (T5 family)

Important 2026 evolutions

Vision and multimodal Transformers

Related topics

Closing note

Further reading

Frequently asked questions