Skip to main content
Business LibreTexts

7.2: Classification Trees

  • Page ID
    138318
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Classification trees are a type of decision tree algorithm used when the response variable is categorical—such as yes/no outcomes, product types, or risk levels. These trees help classify observations into discrete groups based on predictor variables. Each internal node asks a yes-or-no question about the data, and each leaf node assigns a classification based on the path followed. They are highly intuitive and useful when communicating results to non-technical stakeholders.

    Key Concepts in Classification Trees

    Classification trees work by creating rules that partition data into subsets, with each split trying to group similar outcomes together. This structure forms a tree-like diagram with root, internal, and leaf nodes.

    What They Do:

    • Ask a sequence of yes/no questions to split the dataset.
    • Each path leads to a leaf node where a final prediction (classification) is made.
    • The tree prioritizes splits that lead to the purest groupings—where most observations belong to the same category.

    Common Use Cases:

    • Predicting customer churn (will they leave or stay?)
    • Credit risk classification (good/bad customer)
    • Loan approval decisions
    • Employee attrition predictions
    • Diagnosing illnesses based on symptoms

    Splitting Criteria for Classification

    Each internal node in a classification tree is determined by finding the best split in the data.

    1. Gini Index
      1. Measures impurity or disorder.
      2. Gini values mean purer splits.
      3. Used by the CART (Classification and Regression Trees) algorithm.
    2. Entropy and Information Gain
      1. Entropy quantifies uncertainty or disorder.
      2. Information Gain measures how much entropy is reduced by a split.
      3. The best split maximizes Information Gain.

    Business Example: Predicting Customer Response from a Marketing Campaign

    A retail company aims to boost the effectiveness of its promotional email campaigns by pinpointing which customers are most likely to respond. Using past engagement and purchase history, the model segments customers into distinct groups based on email click activity, age, and prior purchases. This allows the marketing team to target those with the highest probability of responding, allocate promotional resources more efficiently, and design tailored offers that resonate with each segment, ultimately driving higher conversion rates and improving overall campaign return on investment.

    Classification Tree

    Classification tree as described below.

    Understanding the Tree Structure

    Root node
    The very first node at the top of the tree.
    Contains all records in the dataset.
    Shows the overall distribution between the two classes (in this case, 0 = non-responder, 1 = responder).
    The percentages next to 0 and 1 represent the share of each class in that node.
    The percentage at the bottom (for example, 100%) shows the proportion of the total dataset in that node.

    Terminal nodes (leaves)
    Nodes at the bottom of the tree with no further splits.
    Represent the final segments after applying all decision rules from top to bottom.
    Each terminal node has its own class distribution (0 vs 1) and a share of the total dataset.
    Used to identify which segments are high-value (above-average response) versus low-value (below-average response).

    Difference between 0 and 1
    0 = Non-responder (did not take the desired action).
    1 = Responder (took the desired action).
    The proportion of 1s in the node is the response rate for that segment.

    Bottom percentage in each node
    Indicates the percentage of the entire dataset that falls into that node.
    For example, if a terminal node says n=90; 18%, then 18% of all records are in that segment.

    Analysis of the Tree

    • Node 1 (Root): Overall response rate is 49%. This serves as the baseline for comparison.
    • First split is on email clicks < 4. This was likely chosen because email engagement is a strong predictor of response. Customers with more clicks tend to show higher interest and thus higher response rates.
    • Node 3: Customers with 4 or more clicks have an 85% response rate, which is far above the root rate of 49%. This is the best performing segment and should be prioritized for future campaigns.
    • Node 2: Customers with fewer than 4 clicks are split by age >= 45. Age may influence response likelihood due to differences in engagement or purchasing habits.
    • Node 4: Customers aged 45 or older with fewer than 4 clicks have a 17% response rate, well below the root rate. This is a weak segment for response.
    • Node 5: Customers under 45 with fewer than 4 clicks are split by previous purchases >= 1. Past purchase history is a strong behavioral predictor.
    • Node 6: Customers under 45 with prior purchases have a 45% response rate, slightly below the root rate of 49%. This is a moderate segment worth targeting with tailored offers.
    • Node 7: Customers under 45 with no prior purchases have a 90% response rate, which is significantly above the overall rate. Despite no prior purchase history, this segment responds strongly, making it an attractive target for campaigns.

    Hypothetical Target Strategies

    Based on hypothetical classification tree presented above, following are optimal targeting strategies.

    • Node 3: Customers with 4 or more email clicks — highly engaged with an 85% response rate.
    • Node 7: Younger customers with no prior purchases — strong responders at 90%, ideal for first-purchase offers.
    • Node 6: Younger customers with prior purchases — moderate responders at 45%, worth re-engaging with tailored promotions.

    Sample Confusion Matrix & Performance Metrics

    Confusion Matrix

    Confusion Matrix

    Predicted 1

    Predicted 0

    Actual 1

    210

    40

    Actual 0

    50

    200

    • Sensitivity / Recall for Responders: 0.8400 (84.00%)
    • Specificity for Non-Responders: 0.8000 (80.00%)
    • Accuracy: 0.8200 (82.00%)

    Comments

    The classification tree demonstrates solid predictive performance, correctly identifying 84% of actual responders and 80% of actual non-responders. With an overall accuracy of 82%, the model effectively distinguishes between likely responders and non-responders, making it a useful tool for targeted marketing. These results suggest that the tree can guide resource allocation toward high-probability segments, improving campaign efficiency while minimizing wasted outreach.

    Understanding Overfitting and Pruning

    Overfitting occurs when a model captures noise in the training data rather than general patterns. This leads to high accuracy on the training set but poor performance on unseen data.

    Pruning is the process of removing unnecessary branches from a decision tree to prevent overfitting. It simplifies the model by cutting back overly specific splits, which helps improve generalization.

    Cross-Validation

    Cross-validation is a technique used to assess how a predictive model will perform on unseen data. It involves splitting the dataset into multiple subsets (folds), training the model on some folds, and validating it on the remaining fold. This process is repeated multiple times, and the average performance is calculated to estimate the model’s true predictive power.

    Advantages of Classification Trees

    • Interpretable: Easy to explain to non-technical audiences.
    • Handles Mixed Data Types: Works with both categorical and numeric variables.
    • No Need for Scaling: Standardization not required.
    • Captures Non-linear Interactions: Models interactions between features automatically.

    Limitations of Classification Trees

    • Overfitting: Deep trees can memorize the training data.
    • Instability: Small changes in input data can lead to very different trees.
    • Biased Toward Multi-Level Features: Features with many levels can dominate splits
    • May Under-perform: Often outperformed by ensemble methods like Random Forests.

    Other Metrics to Consider:

    Lift Chart

    A lift chart as described below.

    Interpreting the Lift and Gains Chart

    The Lift and Gains chart visually compares the performance of a predictive model against a random model. The x-axis shows the cumulative population ranked by predicted probability (e.g., top 10%, 20%, etc.), and the y-axis shows the cumulative percentage of actual positive outcomes (e.g., churned customers).

    The Predictive Model (in dark red) shows how effective the model is at capturing positive cases early. The Random Model (black dashed line) represents a baseline where positive outcomes are randomly distributed.

    Kolmogorov-Smirnov (K-S) Statistic

    The K-S statistic measures the maximum vertical distance between the cumulative distributions of the predicted positives and the random model. It quantifies how well the model separates the positive from negative classes.

    In the chart above, the K-S value is approximately 0.48. K-S values typically range from 0 to 1 (or 0 to 100). Values above 0.5 are considered strong separation, with values between 0.6 and 0.75 indicating excellent model performance.

    Summary

    While classification trees help businesses make clear yes-or-no decisions, many situations call for estimating a specific numeric value instead. For example, instead of predicting whether a customer will leave, a business might want to predict how much revenue that customer will generate over time. These kinds of continuous outcome predictions require a different type of decision tree—regression trees. In the next section, we’ll explore how regression trees operate, how they differ from classification trees, and how they can be applied to solve common forecasting and estimation problems in business analytics.


    This page titled 7.2: Classification Trees is shared under a CC BY 4.0 license and was authored, remixed, and/or curated by Elbert L. Hearon, M.B.A., M.S..

    • Was this article helpful?