7.2: Classification Trees

Last updated
Save as PDF

Page ID: 138318

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\dsum}{\displaystyle\sum\limits} \)

\( \newcommand{\dint}{\displaystyle\int\limits} \)

\( \newcommand{\dlim}{\displaystyle\lim\limits} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\(\newcommand{\longvect}{\overrightarrow}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

Classification trees are a type of decision tree algorithm used when the response variable is categorical—such as yes/no outcomes, product types, or risk levels. These trees help classify observations into discrete groups based on predictor variables. Each internal node asks a yes-or-no question about the data, and each leaf node assigns a classification based on the path followed. They are highly intuitive and useful when communicating results to non-technical stakeholders.

Key Concepts in Classification Trees

Classification trees work by creating rules that partition data into subsets, with each split trying to group similar outcomes together. This structure forms a tree-like diagram with root, internal, and leaf nodes.

What They Do:

Ask a sequence of yes/no questions to split the dataset.
Each path leads to a leaf node where a final prediction (classification) is made.
The tree prioritizes splits that lead to the purest groupings—where most observations belong to the same category.

Common Use Cases:

Predicting customer churn (will they leave or stay?)
Credit risk classification (good/bad customer)
Loan approval decisions
Employee attrition predictions
Diagnosing illnesses based on symptoms

Splitting Criteria for Classification

Each internal node in a classification tree is determined by finding the best split in the data.

Gini Index
1. Measures impurity or disorder.
2. Gini values mean purer splits.
3. Used by the CART (Classification and Regression Trees) algorithm.
Entropy and Information Gain
1. Entropy quantifies uncertainty or disorder.
2. Information Gain measures how much entropy is reduced by a split.
3. The best split maximizes Information Gain.

Business Example: Predicting Customer Response from a Marketing Campaign

A retail company aims to boost the effectiveness of its promotional email campaigns by pinpointing which customers are most likely to respond. Using past engagement and purchase history, the model segments customers into distinct groups based on email click activity, age, and prior purchases. This allows the marketing team to target those with the highest probability of responding, allocate promotional resources more efficiently, and design tailored offers that resonate with each segment, ultimately driving higher conversion rates and improving overall campaign return on investment.

Classification Tree

Classification tree as described below.

Understanding the Tree Structure

Root node
The very first node at the top of the tree.
Contains all records in the dataset.
Shows the overall distribution between the two classes (in this case, 0 = non-responder, 1 = responder).
The percentages next to 0 and 1 represent the share of each class in that node.
The percentage at the bottom (for example, 100%) shows the proportion of the total dataset in that node.

Terminal nodes (leaves)
Nodes at the bottom of the tree with no further splits.
Represent the final segments after applying all decision rules from top to bottom.
Each terminal node has its own class distribution (0 vs 1) and a share of the total dataset.
Used to identify which segments are high-value (above-average response) versus low-value (below-average response).

Difference between 0 and 1
0 = Non-responder (did not take the desired action).
1 = Responder (took the desired action).
The proportion of 1s in the node is the response rate for that segment.

Bottom percentage in each node
Indicates the percentage of the entire dataset that falls into that node.
For example, if a terminal node says n=90; 18%, then 18% of all records are in that segment.

Analysis of the Tree

Node 1 (Root): Overall response rate is 49%. This serves as the baseline for comparison.
First split is on email clicks < 4. This was likely chosen because email engagement is a strong predictor of response. Customers with more clicks tend to show higher interest and thus higher response rates.
Node 3: Customers with 4 or more clicks have an 85% response rate, which is far above the root rate of 49%. This is the best performing segment and should be prioritized for future campaigns.
Node 2: Customers with fewer than 4 clicks are split by age >= 45. Age may influence response likelihood due to differences in engagement or purchasing habits.
Node 4: Customers aged 45 or older with fewer than 4 clicks have a 17% response rate, well below the root rate. This is a weak segment for response.
Node 5: Customers under 45 with fewer than 4 clicks are split by previous purchases >= 1. Past purchase history is a strong behavioral predictor.
Node 6: Customers under 45 with prior purchases have a 45% response rate, slightly below the root rate of 49%. This is a moderate segment worth targeting with tailored offers.
Node 7: Customers under 45 with no prior purchases have a 90% response rate, which is significantly above the overall rate. Despite no prior purchase history, this segment responds strongly, making it an attractive target for campaigns.

Hypothetical Target Strategies

Based on hypothetical classification tree presented above, following are optimal targeting strategies.

Node 3: Customers with 4 or more email clicks — highly engaged with an 85% response rate.
Node 7: Younger customers with no prior purchases — strong responders at 90%, ideal for first-purchase offers.
Node 6: Younger customers with prior purchases — moderate responders at 45%, worth re-engaging with tailored promotions.

Sample Confusion Matrix & Performance Metrics

Confusion Matrix

Confusion Matrix
	Predicted 1	Predicted 0
Actual 1	210	40
Actual 0	50	200

Sensitivity / Recall for Responders: 0.8400 (84.00%)
Specificity for Non-Responders: 0.8000 (80.00%)
Accuracy: 0.8200 (82.00%)

Comments

The classification tree demonstrates solid predictive performance, correctly identifying 84% of actual responders and 80% of actual non-responders. With an overall accuracy of 82%, the model effectively distinguishes between likely responders and non-responders, making it a useful tool for targeted marketing. These results suggest that the tree can guide resource allocation toward high-probability segments, improving campaign efficiency while minimizing wasted outreach.

Understanding Overfitting and Pruning

Overfitting occurs when a model captures noise in the training data rather than general patterns. This leads to high accuracy on the training set but poor performance on unseen data.

Pruning is the process of removing unnecessary branches from a decision tree to prevent overfitting. It simplifies the model by cutting back overly specific splits, which helps improve generalization.

Cross-Validation

Cross-validation is a technique used to assess how a predictive model will perform on unseen data. It involves splitting the dataset into multiple subsets (folds), training the model on some folds, and validating it on the remaining fold. This process is repeated multiple times, and the average performance is calculated to estimate the model’s true predictive power.

Advantages of Classification Trees

Interpretable: Easy to explain to non-technical audiences.
Handles Mixed Data Types: Works with both categorical and numeric variables.
No Need for Scaling: Standardization not required.
Captures Non-linear Interactions: Models interactions between features automatically.

Limitations of Classification Trees

Overfitting: Deep trees can memorize the training data.
Instability: Small changes in input data can lead to very different trees.
Biased Toward Multi-Level Features: Features with many levels can dominate splits
May Under-perform: Often outperformed by ensemble methods like Random Forests.

Other Metrics to Consider:

Lift Chart

Interpreting the Lift and Gains Chart

The Lift and Gains chart visually compares the performance of a predictive model against a random model. The x-axis shows the cumulative population ranked by predicted probability (e.g., top 10%, 20%, etc.), and the y-axis shows the cumulative percentage of actual positive outcomes (e.g., churned customers).

The Predictive Model (in dark red) shows how effective the model is at capturing positive cases early. The Random Model (black dashed line) represents a baseline where positive outcomes are randomly distributed.

Kolmogorov-Smirnov (K-S) Statistic

The K-S statistic measures the maximum vertical distance between the cumulative distributions of the predicted positives and the random model. It quantifies how well the model separates the positive from negative classes.

In the chart above, the K-S value is approximately 0.48. K-S values typically range from 0 to 1 (or 0 to 100). Values above 0.5 are considered strong separation, with values between 0.6 and 0.75 indicating excellent model performance.

Summary

While classification trees help businesses make clear yes-or-no decisions, many situations call for estimating a specific numeric value instead. For example, instead of predicting whether a customer will leave, a business might want to predict how much revenue that customer will generate over time. These kinds of continuous outcome predictions require a different type of decision tree—regression trees. In the next section, we’ll explore how regression trees operate, how they differ from classification trees, and how they can be applied to solve common forecasting and estimation problems in business analytics.

Search

Text Color

Text Size

Margin Size

Font Type

Understanding the Tree Structure

Analysis of the Tree

Hypothetical Target Strategies

Comments