Classification trees. This problem concerns the construction of classification trees. Each split is typically made to minimise the cost function
\[
C(T) = \sum_{m = 1}^{\lvert T \rvert} N_m Q_m \,,
\] where \(T\) denotes the tree object, \(T\) represents the total number of terminal nodes, \(N_m\) is the number of training data points in the region corresponding to the \(m\)th terminal node, and \(Q_m\) is the node impurity measure for the same region.
In class, we have introduced two common measures: misclassification error and the Gini index.
Consider a two-class classification problem with two inputs and a training dataset illustrated in Figure 1. There are infinitely many possible splits that could be made, but we restrict ourselves to splits at integer values within the region containing the training data.
Let \(y \in \{1,\ldots,K\}\), \(x \in \Re^p\), \[
y = f(x) + \epsilon
\]
Tree based methods partition covariate space into regions and return a constant output per region
Given a partition \(R_1, \ldots, R_m\) of \(\Re^p\), a decision tree outputs the region majority label the majority label \(\hat{y}_j\) for each region \(R_j\): \[
\begin{aligned}
\hat f(x) &= \arg\max_k \, \hat p_k(x) \\
\hat p_k(x) &= \sum_{j = 1}^m \mathbb{1}\{x \in R_j\} \hat{p}_{j,k}\\
\hat{p}_{j,k} &= \frac{1}{|R_j|} \sum_{x_i \in R_j} \mathbb{1}\{y_i = k \}
\end{aligned}
\]
Partitions are obtained by series of axis-aligned splits
When splitting a region \(R_j\) into two \(R_{j_1}\), \(R_{j_2}\), we seek the covariate and split point that minimises \[
N_{j_1} Q_{j_1} + N_{j_2} Q_{j_2}
\] among all available splits
Approach is top-down, greedy:
Recursively partition regions \(R_j\)
Split is solely determined by optimal split at current point
If we use the misclassification error as the node impurity measure, what is the resulting classification tree with two split points obtained through recursive binary splitting?
Discuss the suboptimality of this solution.
Show
2-split decision tree with misclassification error rate
Know
Trees partition covariate space along principal axes
par(mar =c(0, 0, 0, 0))plot(0, 0, type ="n", xlim =c(-5, 10), ylim =c(-5, 10),axes =FALSE, xlab ="", ylab ="")rect(4, 7, 6, 9, col ="grey", border ="black")text(5, 8, expression(X[1] >1), col ="white")rect(1, 2, 3, 4, col ="grey", border ="black")text(2, 3, expression(X[2] >1), col ="white")rect(7, 2, 9, 4, col ="grey20", border ="black")text(8, 3, "black", col ="white")arrows(5, 7, 2, 4, length =0.1)arrows(5, 7, 8, 4, length =0.1)text(2.5, 5.5, expression(X[1] <=1))text(7.5, 5.5, expression(X[1] >1))rect(-1, -3, 1, -1, col ="grey20", border ="black")text(0, -2, "black", col ="white")rect(3, -3, 5, -1, col ="lightblue", border ="black")text(4, -2, "blue", col ="white")arrows(2, 2, 0, -1, length =0.1)arrows(2, 2, 4, -1, length =0.1)text(0, 0.5, expression(X[2] <=1))text(4, 0.5, expression(X[2] >1))
Question 1.c.
Consider the first split point. If we use the Gini index as the node impurity measure, calculate the cost \(C(T)\) for the following two candidate splits:
Understand how \(0\)-\(1\) loss leads to suboptimal splits
Structured approach for exam setting
Question 2
Random forests. Show that for the average of \(B\) identically distributed variables \(Z_1,\ldots,Z_B\), such that \(\text{var}(Z_i) = \sigma^2\) and \(\text{cov}(Z_i, Z_i) / \sqrt{\text{var}(Z_i)}\text{var}(Z_j)= \rho\) for all \(1 \leq i \neq j \leq B\), \[
\text{var} \left[\frac{1}{B} \sum_{i = 1}^B Z_i \right] = \rho \sigma^2 + (1 - \rho) \sigma^2 \frac{1}{B} \,.
\]
\(\approx\)\(2/3\) of the original sample appear in each bootstrap sample in expectation
Trees are going to be correlated due to similarity of bootstrap data
If we introduce randomness in the tree building process by randomly restricting the number of features to split on, we reduce correlation \(\rho\)
The term \(\rho \sigma^2\) decreases, while \((1-\rho) \sigma^2 / B\) increases; can control the latter by increasing the number of bootstrap samples to reduce overall variance
Learnings
Variance manipulations from simple definitions
Understanding random forest decorrelation
Question 3
Variance and bias trade-off of regression trees. Consider \(y_i = f(x_i) + \epsilon_i\), for \(i = 1,\ldots,n\) where \(x_1, \ldots, x_n \in [0,1]^p\) and \(\epsilon_1, \ldots, \epsilon_n\) are i.i.d. with mean zero and variance at most \(\sigma^2\). Assume that \(f\) satisfies \(\lvert f(x) - f(x')\rvert \leq L \| x - x' \|_2\), for all \(x,x' \in [0,1]^p\) for some constant \(L \geq 0\).
Divide \([0,1]\) into intervals of equal length \(h > 0\). Thus, \([0,1]^p\) is divided into \(j\) non-overlapping \(p\)-dimensional regions \(B_1,\ldots,B_J\). Fix an arbitrary \(x \in [0,1]^p\) and define \[
\hat f(x) = \frac{1}{N(x)} \sum_{i = 1}^n y_i \mathbb{1}\{x_i \in B(x)\} \,,
\] where \(B(x)\) is the \(p\)-dimensional region containing \(x\) and \[
N(x) = \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} \,.
\] Show that \[
\mathrm{E}\left[\left(f(x) - \hat f(x) \right)^2 \right] \leq L^2 p h^2 + \frac{\sigma^2}{N(x)} \,.
\]
\[
\mathrm{E}(N_B) = n h^p = n \left(\frac{1}{K} \right)^p = n \exp\{-\log(K) p \} \,.
\]
The number of points in a region decreases polynomially fast with the number of regions for a fixed dimension (Bias variance tradeoff)
For a fixed number of splits and sample size, the number of points in a region decreases exponentially fast with growing dimension (Curse of dimensionality)
Learnings
Bias-variance tradeoff for regression trees and relation to number of partitions
Get comfortable with MSE derivation in tricky settings