Decision Trees Explained: When Classical ML Beats Deep Learning
Decision trees, random forests, and gradient boosting (XGBoost, LightGBM, CatBoost) aren't outdated in 2026 — for tabular data they're often the better choice. Structure, trade-offs, production use.

1 · How trees work
Gini, entropy, splits, and why a single tree overfits so easily.
2 · The ensemble family
Random forest, XGBoost, LightGBM, CatBoost — which tool when.
3 · Practice
10 lines of Python, feature importance, common mistakes, and the hyperparameters that matter.
What is a decision tree in one sentence?
A decision tree is a model that makes a prediction through a chain of binary if-then questions — from the root node down to the leaf that returns the answer. Every internal node tests a single feature against a threshold (“income > 30,000?”), every branch represents one outcome of that test, and every leaf returns a class label or numeric value. That makes the tree one of the few models whose entire decision path you can sketch on a single slide.
That transparency anchored the algorithm in regulated industries from the 1980s onward. Banks use trees for credit scoring, insurers for tariff classification, pharma for diagnostic workflows. If you need to explain to a regulator why a loan was denied, you can argue with a tree sketch — with a neural network the same defense gets noticeably harder.
The historical family tree is short: CART (Classification and Regression Trees, Breiman et al., 1984), ID3 and C4.5 (Quinlan, 1986/1993) are the classics. scikit-learn implements a CART variant. But the real renaissance didn’t come from the single tree — it came from the ensembles that became standard from the mid-2000s. More on that below.
The algorithm family at a glance
The tree world splits into five production-grade variants. Knowing these covers 95% of real use cases — from a single tree for teaching material to the Kaggle grandmaster’s full boosting stack.
| Algorithm | Training time (100k × 50) | Interpretability | Typical accuracy | When to use |
|---|---|---|---|---|
| Single Decision Tree | < 1 s (CPU) | ★★★★★ — path is drawable | ★★☆☆☆ | Teaching, very small datasets, regulated explainability |
| Random Forest | 5–30 s (CPU, n_jobs=-1) | ★★★☆☆ — feature importance, no single path | ★★★★☆ | Robust default for tabular data, low hyperparameter tuning needed |
| XGBoost | 10–30 s (CPU) | ★★★☆☆ — gain/weight/cover, SHAP popular | ★★★★★ | Production default for Kaggle and business, small to mid datasets |
| LightGBM | 3–10 s (CPU), often 2–10× faster than XGBoost | ★★★☆☆ — same as XGBoost | ★★★★★ | Very large datasets (>1M rows), speed-critical pipelines |
| CatBoost | 15–40 s (CPU) | ★★★★☆ — best default explainability of the boosting family | ★★★★★ | Many categorical features without one-hot encoding |
The accuracy differences between the three boosting libraries are small in practice — on most datasets, hyperparameter tuning and feature engineering matter more than the XGBoost vs LightGBM choice. The structural question matters more: do you actually need a boosting model, or does a random forest suffice?
How is a decision tree built? (Gini, entropy)
Building a tree follows a greedy algorithm: at every node, the procedure picks the feature and threshold that splits the training data into two groups as cleanly as possible. “Clean” is measured via an impurity metric — typically the Gini index or entropy.
Gini index measures the probability that a randomly drawn element would be misclassified if labeled according to the class distribution in the node. A node with only one class has Gini = 0 (perfectly pure), a node with a 50/50 split in a two-class problem has Gini = 0.5 (maximum impurity). scikit-learn uses Gini as the default — it’s computationally cheaper than entropy and produces practically identical trees.
Entropy comes from information theory and measures the average number of bits needed to encode the class of an element. Mathematically slightly more expensive (logarithm instead of squaring), but better theoretically grounded. To try it: pass criterion='entropy' in scikit-learn. In 99% of cases, the accuracy difference is negligible.
The algorithm runs like this: for every feature, try every possible split point, compute the information gain (impurity before minus weighted impurity after), pick the best. Recurse on the resulting subsets — until a stopping criterion fires (max_depth reached, min_samples_leaf undershot, no further improvement possible). This greedy strategy is precisely why a single tree overfits so readily: without constraints, the algorithm reconstructs the training set point by point.
Concrete example. You want to classify loan applications as “approved” / “denied”. The first split might ask “income > $35,000?”. In the yes branch, 80% are approved, 20% denied — Gini = 0.32. In the no branch: 30% approved, 70% denied — Gini = 0.42. Weighted, this gives a lower impurity than any other possible split — so the feature is chosen. Then the yes branch keeps splitting (“credit score > 720?”) until every leaf is clearly dominated.
Pros and cons compared to deep learning
Trees and neural networks aren’t rivals — they solve different data shapes. Knowing both tools lets you pick the right one for every task. Here’s the honest inventory, without hype for either side.
Strengths of the tree family:
- Tabular default. On data with clear columns — banking, CRM, telco, IoT sensors, clinical studies — gradient-boosted trees are the first choice in practically every benchmark. Studies like Shwartz-Ziv & Armon (2022) — “Tabular Data: Deep Learning is Not All You Need” and Grinsztajn et al. (NeurIPS 2022) show this systematically.
- No GPU required. A normal laptop is enough. While a transformer is unusable without a CUDA card, XGBoost trains 100k rows in 10 seconds on a CPU.
- Robust to scaling. Trees are invariant to monotonic transformations — no StandardScaler pipeline needed.
- Robust to outliers. A single extreme value doesn’t shift the optimum the way it does in linear or neural models.
- Built-in feature importance. You see out of the box which features mattered — a central advantage in regulated domains.
- Few examples needed. Random forests deliver usable models from ~1,000 rows. Deep learning often loses to a good tree model below 100,000 rows.
Strengths of deep learning:
- Unstructured data. Images, audio, video, raw text — here neural networks build representations on their own that no manual feature engineering could ever match. A decision tree with pixels as columns is hopeless.
- End-to-end learning. Instead of spending 80% of project time on feature engineering, deep nets learn directly from raw inputs. With transformers that also holds for text.
- Very large datasets. From ~10M rows onward or with complex feature interactions, deep nets can match or pull ahead.
- Multimodal tasks. Image + text + tabular jointly — trees can’t do that in a single model.
| Criterion | Tree family | Deep learning |
|---|---|---|
| Data shape | Tabular | Images, text, audio, multimodal |
| Data volume | from ~1,000 rows | from ~100,000 examples |
| Hardware | CPU is enough | GPU/TPU practically mandatory |
| Training time | Seconds to minutes | Hours to weeks |
| Interpretability | Direct | Only via SHAP, LIME, attention maps |
| Hyperparameter tuning | Minimal | Extensive, sensitive |
| Energy cost | Low | High |
What is random forest and why does it work so well?
Random forest is an ensemble of typically 100–500 decision trees trained independently on random samples of the training data and features — the final prediction comes from voting or averaging. Leo Breiman invented it in 2001. It remains one of the most robust out-of-the-box algorithms for structured data.
The underlying principle is bagging (bootstrap aggregating). From N training rows, N rows are drawn with replacement — each tree sees roughly 63% of the original rows, some multiple times, others not at all. Additionally, each tree may consider only a random subset of features at each split (default: √n_features for classification). This double randomness decouples the trees — and that’s the trick: 500 slightly different, individually mediocre trees average into a very stable overall model.
Why it works: a single tree has high variance — small changes in the training set produce completely different trees. Bagging reduces exactly that variance without raising bias. Mathematically provable, empirically confirmed across thousands of benchmarks.
Typical random forest hyperparameters:
n_estimators(number of trees): 100–500. More rarely hurts — but costs memory.max_depth(max depth per tree): 10–20, orNonefor unbounded.min_samples_leaf(min leaf size): 1–20. Higher values = stronger regularization.max_features(features per split):'sqrt'is the default for classification,'log2'an alternative.n_jobs=-1(parallelism): uses all CPU cores.
Random forest is the ideal start in practice: little tuning, hard to break, decent accuracy on almost any tabular dataset. If you want more, move to gradient boosting.
Where do XGBoost and LightGBM stand in 2026?
Gradient boosting builds trees sequentially: each new tree learns to correct the errors of its predecessors. Instead of 500 independent estimators like in a random forest, an additive series emerges — individual trees are small and shallow (typically max_depth=6) and the strength comes from the sheer number plus specialization on residual errors.
XGBoost (Tianqi Chen & Carlos Guestrin, paper 2014) brought several innovations that industrialized gradient boosting: a regularized loss function (L1 + L2), efficient histogram-based splits, parallel training on CPU cores, sparse-aware splits for missing values. Since 2014, XGBoost has dominated Kaggle competitions on structured data — and is the production default in banks, e-commerce, and telco.
LightGBM (Microsoft Research, 2017) optimized XGBoost in two directions: leaf-wise growth (always expand the leaf with the largest loss gain instead of growing layer-by-layer) and histogram-based feature binning. Result: 2–10× faster on large datasets, significantly less memory. For data above 1M rows, LightGBM is often the pragmatic choice. Caveat: on small datasets (< 10,000 rows), leaf-wise growth can overfit — XGBoost stays more robust there.
CatBoost (Yandex, 2017) is the third production variant — and a genuinely strong one. Specialty: native handling of categorical features without upstream one-hot encoding, plus reduced overfitting risk thanks to ordered boosting. For datasets with many categorical columns — ad data with user IDs, device classes, countries — CatBoost often delivers the best out-of-the-box accuracy.
| Comparison axis | XGBoost 2.0+ | LightGBM 4+ | CatBoost 1.2+ |
|---|---|---|---|
| Released | 2014 | 2017 | 2017 |
| Growth strategy | level-wise | leaf-wise | symmetric (oblivious trees) |
| Speed on 1M rows | Baseline | 2–10× faster | ~1.5× faster than XGBoost |
| Categorical features | One-hot needed | One-hot recommended | Native (target encoding built in) |
| GPU support | Yes (device='cuda') | Yes (device='gpu') | Yes |
| Default performance | Top on small + mid data | Top on large data | Top with many categoricals |
| Community size | Largest | Very large | Mid |
On Kaggle, the top of the leaderboard for tabular competitions since 2015 has been almost universally an ensemble of XGBoost + LightGBM + CatBoost — often blended. If you need a production tabular classifier in a real business project, you’ll meet these three libraries one way or another.
When should I use trees instead of neural networks?
The deciding question isn’t “better or worse” — it’s “does the tool match the data format”. A pragmatic heuristic:
Use trees if:
- Your data is tabular and below ~10M rows.
- You need explainability — credit, insurance, medicine, justice, HR.
- You want fast iteration (training in seconds, not days).
- You have no GPU infrastructure or don’t want to build one.
- You need a strong baseline before investing in deep learning.
- You have little data (1,000–50,000 rows).
Use deep learning if:
- Your data is unstructured (images, audio, video, raw text).
- You have very large datasets (100,000+ examples for images, millions for text).
- You target multimodal tasks (image + text + tabular).
- State-of-the-art performance is business-critical and you have GPU + training budget.
- You can use transfer learning on pretrained models (Hugging Face, OpenCV Zoo).
For the full comparison of classical ML vs. deep learning, see the sections in the parent pillar Machine Learning and in the Deep Learning hub.
What does the code look like? (Python + scikit-learn, 10 lines)
A complete random-forest training pipeline in ten lines of production Python. Works with any CSV that has a target column and numeric features. scikit-learn 1.4+ has been standard for years — the API hasn’t changed meaningfully since 2017.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df = pd.read_csv("data.csv")
X, y = df.drop("target", axis=1), df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model = RandomForestClassifier(n_estimators=500, max_depth=10, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.3f}")
print(sorted(zip(model.feature_importances_, X.columns), reverse=True)[:5])
What happens line by line:
1–4. Imports: pandas for loading, RandomForestClassifier as the model, train_test_split for splitting, accuracy_score for evaluation.
5–7. Load CSV, separate features (X) and target (y), split into train and test. stratify=y keeps class balance, random_state=42 makes results reproducible.
9. Instantiate the model: 500 trees, max depth 10, use all CPU cores.
10. Train — on 100k rows × 50 features that’s under a minute on a normal laptop.
12–13. Predict on the test set, compute accuracy, print top-5 features by importance.
For XGBoost instead of random forest: replace lines 2 and 9 with:
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=500, max_depth=6, learning_rate=0.05, n_jobs=-1, random_state=42)
Accuracy is usually 1–3 percentage points better than random forest, training time comparable or faster. For many categorical features, reach for CatBoost — almost identical API. If you want to dig deeper into the tabular-ML stack: code assistants like GitHub Copilot or Cursor generate scikit-learn boilerplate in seconds — especially helpful for hyperparameter tuning loops.
What about tabular deep learning? (TabNet, FT-Transformer)
A fair question — and one researchers have asked seriously since 2019. Several architectures aim to bring deep learning to tabular data:
- TabNet (Google, 2019) uses sequential attention to pick which features matter at each decision step. Native interpretability, decent performance on some benchmarks.
- FT-Transformer (Yandex, 2021) treats every feature as a token and runs a standard transformer encoder over them. Conceptually clean, often competitive with gradient boosting.
- NODE (Neural Oblivious Decision Ensembles) and DeepGBM explicitly try to combine tree-like splits with backpropagation.
But the empirical verdict as of 2026 — most recently confirmed by Grinsztajn et al. (NeurIPS 2022) and Gorishniy et al. (2023) — is consistent: gradient-boosted trees still win the majority of tabular benchmarks. Deep tabular models genuinely help in multimodal pipelines (table + text + image as joint input) and in massive-scale recommender systems at FAANG-scale companies. For the everyday tabular task at a normal company, they’re an interesting research direction — not a production default.
The takeaway: don’t reach for TabNet because deep learning sounds modern. Reach for it when you have a concrete reason — multimodal fusion, pretraining transfer, or an unusual data structure that resists tree splits.
How do I interpret feature importance?
Feature importance is a score per input column expressing how much that feature contributed, on average, to separating the classes. scikit-learn returns it out of the box via model.feature_importances_ — a NumPy array aligned with your columns.
How the score is computed: at every tree split, the algorithm records how much the impurity (Gini or entropy) was reduced by exactly that split, weighted by the number of samples in the node. This “Mean Decrease in Impurity” is averaged across all trees and normalized to 1. High values = important feature. But: the score is sensitive to scale — features with many unique values (e.g. continuous IDs) automatically get high scores even when they contribute nothing meaningful. Drop ID columns before training.
Three robust alternatives for production interpretation:
- Permutation importance (
sklearn.inspection.permutation_importance): shuffles a column randomly and measures the accuracy drop. More robust to scaling artifacts. Costs compute, worth it. - SHAP (Shapley Additive Explanations, Lundberg & Lee 2017): plays every prediction through all feature combinations and computes each feature’s contribution in a mathematically fair way. Gold standard in regulated industries. Library:
shap. - XGBoost’s own metrics (
gain,weight,cover):gainis usually the most informative — how much loss improvement a feature delivered on average.
In regulated industries, feature importance often isn’t a nice-to-have but a requirement. US financial regulators (OCC, CFPB) require explainable credit decisions; the EU AI Act classifies high-risk systems as subject to explainability obligations. This is where trees shine: SHAP values can be produced per individual case — “this denial was 40% driven by credit score, 25% by employment status, 15% by residence duration.” With a deep neural network the same answer is technically possible but more expensive and less stable.
What are the most common training mistakes?
Distilled from hundreds of code reviews: the five traps that beginners and pros produce alike.
1. Data leakage. You accidentally use features that “give away” the label — e.g. a field that’s only populated after the target decision is made. Symptom: training accuracy 99%, production accuracy collapses. Remedies: sort the pipeline strictly by timestamp, sanity-check every feature, walk the inputs through with a domain expert.
2. Wrong handling of test data. You fit your scaler or imputer on the full dataset and split only afterwards. Result: test data has leaked information from training. Correct: split first, then fit only on training data and transform test data with the already-fitted object. sklearn.pipeline.Pipeline enforces this discipline.
3. Ignoring class imbalance. In fraud detection or rare diseases, the label ratio is quickly 1:99. A model that always says “no fraud” hits 99% accuracy — and is useless. Remedies: set class_weight='balanced', use ROC-AUC or F1 instead of accuracy, or reach for specialized tools like imbalanced-learn (SMOTE etc.).
4. Overfitting via an unbounded tree. Default settings without max_depth and min_samples_leaf quickly produce perfect training accuracy and bad generalization on a single tree. Rule of thumb for random forest: max_depth=10–20, min_samples_leaf=1–5. For XGBoost: max_depth=4–8, learning_rate=0.01–0.1, early_stopping_rounds=50.
5. Hyperparameters without cross-validation. A single train/test split can land lucky by chance. For clean hyperparameter search, use GridSearchCV or RandomizedSearchCV with 5-fold cross-validation — or better, optuna for Bayesian optimization. On very large datasets, a single hold-out can suffice, but then use early_stopping.
Production use cases: where trees decide every day
A few concrete examples to ground the abstraction:
- Credit scoring (banks). Credit score, income, employment duration, prior credits — XGBoost models decide millions of loan applications daily. Explainability is required by US, UK, and EU regulators, so no deep-learning black box.
- Churn prediction (telcos, SaaS). Who cancels next quarter? Random forests with behavioral features (login frequency, support tickets, contract age) deliver useful probabilities — letting retention teams plan their outreach.
- Recommender pre-filtering (e-commerce). Before a complex embedding recommender ranks 50,000 products, a LightGBM filters down to the top 500 for the user. 100× speedup at negligible accuracy loss.
- Predictive maintenance (Industry 4.0). Sensor data from machine tools, pumps, turbines. Trees detect drift patterns before a part fails — Siemens, Bosch, and SAP ship this in industrial pipelines.
- Insurance pricing. Insurance tariffs have been classified via trees for decades — today boosting models are added, often with a GLM ensemble for the explainability layer.
- Fraud detection (payments). Every card transaction is scored in milliseconds — XGBoost models flag anomalies against the cardholder’s normal behavior.
- Customer lifetime value (CLV). Subscription businesses use gradient boosting to predict 12- and 24-month revenue per customer — feeding paid-acquisition bidding strategies.
In all of these, deep learning would be technically possible but practically uneconomic or regulatorily problematic. The tree stack remains the pragmatic default — and will still be in 2030, as long as business data continues to live in rows and columns.
Deepen your knowledge: your path through classical ML
This hub sits between fundamentals and depth. Where to go next:
Strengthen the foundations
- Machine Learning — the beginner pillar that places trees in the broader ML context. · ~12 min.
- What is AI? — the frame ML sits in. · ~10 min.
Complementary techniques
- Deep Learning & Neural Networks — when deep nets beat trees. · ~12 min.
- Neural Networks Explained — the building blocks of the counterpart. · ~9 min.
- Transformer Architecture — the dominant deep-learning architecture in 2026. · ~10 min.
Practice and tools
- Code Assistants Overview — Cursor, GitHub Copilot & co. generate scikit-learn boilerplate in seconds.
- Prompt Engineering — using LLMs as pair programmers for ML workflows. · ~6 min.
- RAG — Retrieval Augmented Generation — connecting LLMs to your own tabular data. · ~8 min.
Further reading
Frequently asked questions
Are decision trees still relevant in 2026?
Very. On tabular data — the bulk of all business problems — tree-based methods (random forest, XGBoost, LightGBM, CatBoost) have won the majority of Kaggle competitions for a decade. Banks use them for credit scoring, telcos for churn prediction, e-commerce for recommender pre-filtering. Deep learning transformed image, speech, and text — but for rows with clean columns, the tree stack remains the pragmatic default.
What's the difference between a decision tree and a random forest?
A single decision tree is one model — interpretable but prone to overfitting. Random forest trains 100–500 trees in parallel, each on a bootstrap sample of the data and a random subset of features (bagging). Final prediction comes from majority voting (classification) or averaging (regression). That dramatically reduces variance: where one tree wobbles, 500 trees average out.
What is gradient boosting?
Gradient boosting builds trees sequentially rather than in parallel. Each new tree learns to correct the errors of the previous trees. That makes it mathematically stronger — and usually more accurate in practice — than random forest. The production implementations are XGBoost (2014), LightGBM (2017), and CatBoost (2017). All three are highly optimized C++ libraries with Python APIs. Standard setup: 500–2,000 trees, max_depth between 4 and 10.
What does overfitting mean for trees?
A single tree, given enough depth, can perfectly classify every training point — and then fail on new data. Symptom: 100% training accuracy, test accuracy collapses. Remedies: cap max_depth (typically 6–10), raise min_samples_leaf (e.g. 20), or switch directly to random forest or gradient boosting. Both ensemble methods are systematically more resistant to overfitting than a single tree.
Do I need a GPU for XGBoost?
No, almost never. XGBoost is heavily CPU-optimized — 100k samples × 50 features train in ~10 seconds on a normal laptop. GPU support exists (`tree_method='gpu_hist'` or `device='cuda'`), but only delivers measurable speedup beyond ~1M rows or very wide datasets. For 95% of tabular problems, CPU is faster than the time it takes to provision a GPU.
Which libraries should I use? (scikit-learn, XGBoost, LightGBM)
scikit-learn 1.4+ ships DecisionTreeClassifier and RandomForestClassifier — the pragmatic entry, perfect for first models and teaching. XGBoost 2.0+ and LightGBM 4+ are the production gradient-boosting libraries; CatBoost 1.2+ comes from the Yandex stack and shines on categorical features without one-hot encoding. All three have Python, R, and Spark APIs. In 90% of cases, XGBoost is a safe default.
What's the difference between XGBoost and LightGBM?
Both implement gradient boosting on trees — accuracy on most datasets is practically identical. The difference: LightGBM grows trees leaf-wise (always expand the leaf with the largest loss gain), XGBoost grows level-wise (layer by layer). LightGBM is therefore often 2–10× faster on large datasets and uses less memory. XGBoost is considered more robust on small datasets under 10,000 rows — there LightGBM can overfit.
When are neural networks better than trees?
On unstructured data — images, audio, raw text, video. A CNN finds pixel patterns no decision tree could ever model. A transformer understands sentence context that no feature-engineering pipeline can compress into words. Even on very large tabular data (>10M rows) with complex feature interactions, deep nets can catch up — but they remain more expensive to operate.
Can I train trees on images or text?
Not directly. Trees need tabular input: each row a data point, each column a feature. For images you'd have to turn pixels into columns — that only works for tiny, uniform images (MNIST, 784 columns) and is still weaker than a CNN. For text, TF-IDF vectors or pretrained embeddings (Sentence-BERT) help — you then feed 384- or 768-column tables into XGBoost. A common hybrid in production.
What is feature importance?
Feature importance is a score per input column that expresses how much a feature contributed to predictions. In scikit-learn you read it via `model.feature_importances_`; XGBoost additionally offers `gain`, `weight`, and `cover`. Advantage: out-of-the-box interpretability that deep learning has to retrofit via external tools like SHAP or LIME. In regulated industries (banking, insurance) that's often a deciding criterion for model choice.
Tabular deep learning vs. trees — what's the verdict?
TabNet (Google, 2019), FT-Transformer (Yandex, 2021) and successors have closed the gap on some benchmarks. But landmark studies like Shwartz-Ziv & Armon (2022) and Grinsztajn et al. (NeurIPS 2022) consistently show: gradient-boosted trees still win the majority of tabular benchmarks as of 2026. Where deep tabular models help is multi-modal pipelines (table + text + image) — for pure tabular tasks, trees remain the rational baseline.