ST443 Machine Learning and data mining

Week 9


Philipp Sterzinger

p.sterzinger@lse.ac.uk

27 November 2025

Question 1

Classification trees. This problem concerns the construction of classification trees. Each split is typically made to minimise the cost function

\[ C(T) = \sum_{m = 1}^{\lvert T \rvert} N_m Q_m \,, \] where \(T\) denotes the tree object, \(T\) represents the total number of terminal nodes, \(N_m\) is the number of training data points in the region corresponding to the \(m\)th terminal node, and \(Q_m\) is the node impurity measure for the same region.

In class, we have introduced two common measures: misclassification error and the Gini index.

Consider a two-class classification problem with two inputs and a training dataset illustrated in Figure 1. There are infinitely many possible splits that could be made, but we restrict ourselves to splits at integer values within the region containing the training data.

Question 1

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

grid_shapes <- list(
  list(type = "line", x0 = 1, x1 = 1, y0 = 0, y1 = 4,
       line = list(color = "grey", width = 1)),
  list(type = "line", x0 = 2, x1 = 2, y0 = 0, y1 = 4,
       line = list(color = "grey", width = 1)),
  list(type = "line", x0 = 3, x1 = 3, y0 = 0, y1 = 4,
       line = list(color = "grey", width = 1)),
  list(type = "line", x0 = 0, x1 = 4, y0 = 1, y1 = 1,
       line = list(color = "grey", width = 1)),
  list(type = "line", x0 = 0, x1 = 4, y0 = 2, y1 = 2,
       line = list(color = "grey", width = 1)),
  list(type = "line", x0 = 0, x1 = 4, y0 = 3, y1 = 3,
       line = list(color = "grey", width = 1))
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE
  ) %>%
  add_segments(
    x = 2, xend = 2,
    y = 0, yend = 4,
    line = list(dash = "dash", width = 2),
    visible = FALSE,
    showlegend = FALSE
  ) %>%
  add_segments(
    x = 0, xend = 4,
    y = 2, yend = 2,
    line = list(dash = "dash", width = 2),
    visible = FALSE,
    showlegend = FALSE
  ) %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    margin = list(t = 90)
  )

p

Decision trees

Let \(y \in \{1,\ldots,K\}\), \(x \in \Re^p\), \[ y = f(x) + \epsilon \]

  • Tree based methods partition covariate space into regions and return a constant output per region

  • Given a partition \(R_1, \ldots, R_m\) of \(\Re^p\), a decision tree outputs the region majority label the majority label \(\hat{y}_j\) for each region \(R_j\): \[ \begin{aligned} \hat f(x) &= \arg\max_k \, \hat p_k(x) \\ \hat p_k(x) &= \sum_{j = 1}^m \mathbb{1}\{x \in R_j\} \hat{p}_{j,k}\\ \hat{p}_{j,k} &= \frac{1}{|R_j|} \sum_{x_i \in R_j} \mathbb{1}\{y_i = k \} \end{aligned} \]

Tree splits

Overall cost of a given partition

\[ C(T) = \sum_{m = 1}^{\lvert T \rvert} N_m Q_m \,, \]

  • Partitions are obtained by series of axis-aligned splits

  • When splitting a region \(R_j\) into two \(R_{j_1}\), \(R_{j_2}\), we seek the covariate and split point that minimises \[ N_{j_1} Q_{j_1} + N_{j_2} Q_{j_2} \] among all available splits

  • Approach is top-down, greedy:

    • Recursively partition regions \(R_j\)
    • Split is solely determined by optimal split at current point

Tree splits

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)

black_rect1 <- list(
  type = "rect",
  x0 = 1, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)


line_v1 <- list(
  type = "line",
  x0 = 1, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)



shapes_split <- list(
  blue_rect1, 
  black_rect1, 
  line_v1
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
    #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split off",
            method = "relayout",
            args = list(list(shapes = list()))
          ),
          list(
            label = "Split on",
            method = "relayout",
            args = list(list(shapes = shapes_split))
          )
        )
      )
    ),
    margin = list(t = 80)
  )
p

Node impurity measures

Misclassification error rate

\[ Q_j = \frac{1}{N_j} \sum_{x_i \in R_j} \mathbb{1} \{y_i \neq \hat{y}_{j}\} = 1 - \hat{p}_{j, \hat{y}_j} \]

Gini index

\[ Q_j = \sum_{k = 1}^K \hat p_{j,k} (1 - \hat{p}_{j,k}) \]

Cross entropy

\[ Q_j =- \sum_{k = 1}^K \hat p_{j,k} \log(\hat{p}_{j,k}) \]

Node impurity measures

In binary classification, can look at how the loss changes for different node impurity measures:

Code
ps <- seq(0,1,by =0.001) 

mer <- pmin(ps, 1-ps) 
gini <- 2 * ps * (1 - ps) 
crss <- - ps * log(ps) - (1 - ps) * log(1 - ps) 

plot(ps, mer, type = "l", col = "black", lwd = 2, main = "Node impurity measures")
lines(ps, gini, col = "blue", lwd = 2)
lines(ps, crss / (2 * log(2)), col = "green", lwd = 2)


legend("topright", lty = c(1,1,1), col = c("black", "blue", "green"), legend = c("MER", "Gini", "x-entropy"))

Question 1.a.

Provide an optimal classification tree with two split points to minimise the training misclassification loss

Show

  • 2-split tree with zero training error under MER

Know

  • Trees partition covariate space along principal axes

  • Restrict ourselves to integer valued splits

Question 1.a.

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 2,
  y0 = 2, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 4,
  y0 = 0, y1 = 2,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

black_rect2 <- list(
  type = "rect",
  x0 = 2, x1 = 4,
  y0 = 2, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v1 <- list(
  type = "line",
  x0 = 2, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


line_h <- list(
  type = "line",
  x0 = 0, x1 = 2,
  y0 = 2, y1 = 2,
  line = list(width = 2, dash = "dash", color = "black")
)

shapes_split <- list(
  blue_rect1, 
  black_rect1, black_rect2, 
  line_v1, line_h
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
     xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
    #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split off",
            method = "relayout",
            args = list(list(shapes = list()))
          ),
          list(
            label = "Split on",
            method = "relayout",
            args = list(list(shapes = shapes_split))
          )
        )
      )
    ),
    margin = list(t = 80)
  )

p

Question 1.b

If we use the misclassification error as the node impurity measure, what is the resulting classification tree with two split points obtained through recursive binary splitting?

Discuss the suboptimality of this solution.

Show

  • 2-split decision tree with misclassification error rate

Know

  • Trees partition covariate space along principal axes

  • Restrict ourselves to integer valued splits

  • \(Q_j = \frac{1}{N_j} \sum_{x_i \in R_j} \mathbb{1} \{y_i \neq \hat{y}_{j}\} = 1 - \hat{p}_{j, \hat{y}_j}\)

  • Overall cost of split: \(N_{j_1} Q_{j_1} + N_{j_2} Q_{j_2}\)

Question 1.b

For Misclassification error rate: \[ \begin{aligned} N_{j_1} Q_{j_1} + N_{j_2} Q_{j_2} &= N_{j_1} \frac{1}{N_{j_1}} \sum_{x_i \in R_{j_1}} \mathbb{1} \{y_i \neq \hat{y}_{j_1}\} + N_{j_2} \frac{1}{N_{j_2}} \sum_{x_i \in R_{j_2}} \mathbb{1} \{y_i \neq \hat{y}_{j_2}\} \\ &= \sum_{x_i \in R_{j_1}}\mathbb{1} \{y_i \neq \hat{y}_{j_1}\} + \sum_{x_i \in R_{j_2}} \mathbb{1} \{y_i \neq \hat{y}_{j_2}\} \\ &= \text{Misclassified points in } R_{j_1} \text{and } R_{j_2} \end{aligned} \]

  • just count number of misclassified points resulting from split

Question 1.b

Split on \(X_1\)

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = c(rep("white", 3), rep("black", 10)),
  stringsAsFactors = FALSE
)

thresholds <- c(1, 2, 3)

compute_mer_x1 <- function(t) {
  left <- df$x <= t
  right <- df$x > t
  left_tab <- table(df$class[left])
  right_tab <- table(df$class[right])
  left_maj <- names(left_tab)[which.max(left_tab)]
  right_maj <- names(right_tab)[which.max(right_tab)]
  pred <- ifelse(df$x <= t, left_maj, right_maj)
  mis <- sum(pred != df$class)
  list(error = mis, leftMajor = left_maj, rightMajor = right_maj)
}

res_x1 <- lapply(thresholds, compute_mer_x1)

mer_x1 <- data.frame(
  threshold = thresholds,
  error = c(2, 3, 3),
  stringsAsFactors = FALSE
)

mer_x2 <- data.frame(
  threshold = thresholds,
  error = c(3, 3, 3),
  stringsAsFactors = FALSE
)

split_shapes <- lapply(seq_along(thresholds), function(i) {
  t <- thresholds[i]
  leftMaj <- res_x1[[i]]$leftMajor
  rightMaj <- res_x1[[i]]$rightMajor
  leftCol <- if (leftMaj == "white") "rgba(0,0,255,0.15)" else "rgba(0,0,0,0.12)"
  rightCol <- if (rightMaj == "white") "rgba(0,0,255,0.15)" else "rgba(0,0,0,0.12)"
  list(
    list(
      type = "rect",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = t,
      y0 = 0, y1 = 4,
      fillcolor = leftCol,
      opacity = 1,
      line = list(width = 0),
      layer = "below"
    ),
    list(
      type = "rect",
      xref = "x",
      yref = "y",
      x0 = t, x1 = 4,
      y0 = 0, y1 = 4,
      fillcolor = rightCol,
      opacity = 1,
      line = list(width = 0),
      layer = "below"
    ),
    list(
      type = "line",
      xref = "x",
      yref = "y",
      x0 = t, x1 = t,
      y0 = 0, y1 = 4,
      line = list(color = "black", width = 2, dash = "dash")
    )
  )
})

p_top <- plot_ly() %>%
  add_markers(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(color = "black", width = 2)
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_markers(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  )

p_bottom <- plot_ly(
  data = mer_x1,
  x = ~threshold,
  y = ~error,
  type = "scatter",
  mode = "lines",
  name = "Split on X1",
  showlegend = TRUE
) %>%
  add_trace(
    data = mer_x2,
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "lines",
    name = "Split on X2",
    showlegend = TRUE,
    line = list(dash = "dot")
  ) %>%
  add_trace(
    data = mer_x1[1, ],
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = TRUE
  ) %>%
  add_trace(
    data = mer_x1[2, ],
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = FALSE
  ) %>%
  add_trace(
    data = mer_x1[3, ],
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = FALSE
  )

p <- subplot(p_top, p_bottom, nrows = 2, shareX = TRUE, heights = c(0.65, 0.35))

slider_steps <- lapply(seq_along(thresholds), function(i) {
  vis <- c(TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)
  vis[4 + i] <- TRUE
  list(
    method = "update",
    args = list(
      list(visible = vis),
      list(shapes = split_shapes[[i]])
    ),
    label = as.character(thresholds[i])
  )
})

p <- p %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE
    ),
    xaxis2 = list(
      title = "Split (X1)",
      range = c(0.5, 3.5),
      dtick = 1
    ),
    yaxis2 = list(
      title = "Misclassification error"
    ),
    shapes = split_shapes[[1]],
    sliders = list(
      list(
        active = 0,
        currentvalue = list(prefix = "Split at X1 = "),
        pad = list(t = 50),
        steps = slider_steps
      )
    ),
    margin = list(t = 90)
  )

p

Question 1.b

Split on \(X_2\)

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = c(rep("white", 3), rep("black", 10)),
  stringsAsFactors = FALSE
)

thresholds <- c(1, 2, 3)

compute_mer_x2 <- function(t) {
  low <- df$y <= t
  high <- df$y > t
  low_tab <- table(df$class[low])
  high_tab <- table(df$class[high])
  low_maj <- names(low_tab)[which.max(low_tab)]
  high_maj <- names(high_tab)[which.max(high_tab)]
  pred <- ifelse(df$y <= t, low_maj, high_maj)
  mis <- sum(pred != df$class)
  list(error = mis, lowMajor = low_maj, highMajor = high_maj)
}

res_x2 <- lapply(thresholds, compute_mer_x2)

mer_x1 <- data.frame(
  threshold = thresholds,
  error = c(2, 3, 3),
  stringsAsFactors = FALSE
)

mer_x2 <- data.frame(
  threshold = thresholds,
  error = c(3, 3, 3),
  stringsAsFactors = FALSE
)

split_shapes_x2 <- lapply(seq_along(thresholds), function(i) {
  t <- thresholds[i]
  lowMaj <- res_x2[[i]]$lowMajor
  highMaj <- res_x2[[i]]$highMajor
  lowCol <- if (lowMaj == "white") "rgba(0,0,255,0.15)" else "rgba(0,0,0,0.12)"
  highCol <- if (highMaj == "white") "rgba(0,0,255,0.15)" else "rgba(0,0,0,0.12)"
  list(
    list(
      type = "rect",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = 4,
      y0 = 0, y1 = t,
      fillcolor = lowCol,
      opacity = 1,
      line = list(width = 0),
      layer = "below"
    ),
    list(
      type = "rect",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = 4,
      y0 = t, y1 = 4,
      fillcolor = highCol,
      opacity = 1,
      line = list(width = 0),
      layer = "below"
    ),
    list(
      type = "line",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = 4,
      y0 = t, y1 = t,
      line = list(color = "black", width = 2, dash = "dash")
    )
  )
})

p_top <- plot_ly() %>%
  add_markers(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(color = "black", width = 2)
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_markers(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  )

p_bottom <- plot_ly(
  data = mer_x2,
  x = ~threshold,
  y = ~error,
  type = "scatter",
  mode = "lines",
  name = "Split on X2",
  showlegend = TRUE
) %>%
  add_trace(
    data = mer_x1,
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "lines",
    name = "Split on X1",
    showlegend = TRUE,
    line = list(dash = "dot")
  ) %>%
  add_trace(
    data = mer_x2[1, ],
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = TRUE
  ) %>%
  add_trace(
    data = mer_x2[2, ],
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = FALSE
  ) %>%
  add_trace(
    data = mer_x2[3, ],
    x = ~threshold,
    y = ~error,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = FALSE
  )

p <- subplot(p_top, p_bottom, nrows = 2, heights = c(0.65, 0.35), shareX = FALSE)

slider_steps <- lapply(seq_along(thresholds), function(i) {
  vis <- rep(TRUE, 7)
  vis[5:7] <- FALSE
  vis[4 + i] <- TRUE
  list(
    method = "update",
    args = list(
      list(visible = vis),
      list(shapes = split_shapes_x2[[i]])
    ),
    label = as.character(thresholds[i])
  )
})

p <- p %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE
    ),
    xaxis2 = list(
      title = "Split (X2)",
      range = c(0.5, 3.5),
      dtick = 1
    ),
    yaxis2 = list(
      title = "Misclassification error"
    ),
    shapes = split_shapes_x2[[1]],
    sliders = list(
      list(
        active = 0,
        currentvalue = list(prefix = "Split at X2 = "),
        pad = list(t = 50),
        steps = slider_steps
      )
    ),
    margin = list(t = 90)
  )

p

Question 1.b

Split 1 1 2 3
X₁ 2 3 3
X₂ 3 3 3
Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)

black_rect1 <- list(
  type = "rect",
  x0 = 1, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v1 <- list(
  type = "line",
  x0 = 1, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split <- list(
  blue_rect1, 
  black_rect1, 
  line_v1
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
     xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
    #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split off",
            method = "relayout",
            args = list(list(shapes = list()))
          ),
          list(
            label = "Split on",
            method = "relayout",
            args = list(list(shapes = shapes_split))
          )
        )
      )
    ),
    margin = list(t = 80)
  )

p
Code
par(mar = c(0, 0, 0, 0))
plot(0, 0, type = "n", xlim = c(0, 10), ylim = c(0, 10),
     axes = FALSE, xlab = "", ylab = "")

rect(4, 7, 6, 9, col = "grey20", border = "black")
text(5, 8, expression(X[1] > 1), col = "white")

rect(1, 2, 3, 4, col = "lightblue", border = "black")
text(2, 3, "blue")

rect(7, 2, 9, 4, col = "grey20", border = "black")
text(8, 3, "black", col = "white")

arrows(5, 7, 2, 4, length = 0.1)
arrows(5, 7, 8, 4, length = 0.1)

text(2.5, 5.5, expression(X[1] <= 1))
text(7.5, 5.5, expression(X[1] > 1))

Question 1.b

Region \(X_1 \leq 1\)

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = c(rep("white", 3), rep("black", 10)),
  stringsAsFactors = FALSE
)

thresholds <- c(1, 2, 3)

compute_node_split <- function(t) {
  in_node <- df$x <= 1
  lower <- in_node & df$y <= t
  upper <- in_node & df$y > t
  maj <- function(mask) {
    tab <- table(df$class[mask])
    if (length(tab) == 0) return(NA_character_)
    names(tab)[which.max(tab)]
  }
  lower_major <- maj(lower)
  upper_major <- maj(upper)
  list(lowerMajor = lower_major, upperMajor = upper_major)
}

split_info <- lapply(thresholds, compute_node_split)

node_shapes <- lapply(seq_along(thresholds), function(i) {
  t <- thresholds[i]
  lowerMajor <- split_info[[i]]$lowerMajor
  upperMajor <- split_info[[i]]$upperMajor
  lowerCol <- if (identical(lowerMajor, "white")) "rgba(0,0,255,0.15)" else "rgba(0,0,0,0.12)"
  upperCol <- if (identical(upperMajor, "white")) "rgba(0,0,255,0.15)" else "rgba(0,0,0,0.12)"
  list(
    list(
      type = "rect",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = 1,
      y0 = 0, y1 = t,
      fillcolor = lowerCol,
      line = list(width = 0),
      layer = "below"
    ),
    list(
      type = "rect",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = 1,
      y0 = t, y1 = 4,
      fillcolor = upperCol,
      line = list(width = 0),
      layer = "below"
    ),
    list(
      type = "line",
      xref = "x",
      yref = "y",
      x0 = 1, x1 = 1,
      y0 = 0, y1 = 4,
      line = list(color = "black", width = 2, dash = "dot")
    ),
    list(
      type = "line",
      xref = "x",
      yref = "y",
      x0 = 0, x1 = 1,
      y0 = t, y1 = t,
      line = list(color = "black", width = 2, dash = "dash")
    )
  )
})

mer_x1 <- data.frame(
  threshold = thresholds,
  mer = c(1, 1, 1) / 10
)

mer_x2 <- data.frame(
  threshold = thresholds,
  mer = c(0, 0, 1) / 3
)

p_top <- plot_ly() %>%
  add_markers(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(color = "black", width = 2)
    ),
    hoverinfo = "none",
    showlegend = FALSE
  ) %>%
  add_markers(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    hoverinfo = "none",
    showlegend = FALSE
  )

p_bottom <- plot_ly(
  data = mer_x2,
  x = ~threshold,
  y = ~mer,
  type = "scatter",
  mode = "lines",
  name = "Split on X2 in region X1 <= 1",
  showlegend = TRUE
) %>%
  add_trace(
    data = mer_x1,
    x = ~threshold,
    y = ~mer,
    type = "scatter",
    mode = "lines",
    name = "Split on X1 in region X1 <= 1",
    line = list(dash = "dot"),
    showlegend = TRUE
  ) %>%
  add_trace(
    data = mer_x2[1, ],
    x = ~threshold,
    y = ~mer,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = TRUE
  ) %>%
  add_trace(
    data = mer_x2[2, ],
    x = ~threshold,
    y = ~mer,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = FALSE
  ) %>%
  add_trace(
    data = mer_x2[3, ],
    x = ~threshold,
    y = ~mer,
    type = "scatter",
    mode = "markers",
    marker = list(size = 10),
    showlegend = FALSE,
    hoverinfo = "none",
    visible = FALSE
  )

p <- subplot(p_top, p_bottom, nrows = 2, heights = c(0.65, 0.35), shareX = FALSE)

slider_steps <- lapply(seq_along(thresholds), function(i) {
  vis <- rep(TRUE, 7)
  vis[5:7] <- FALSE
  vis[4 + i] <- TRUE
  list(
    method = "update",
    args = list(
      list(visible = vis),
      list(shapes = node_shapes[[i]])
    ),
    label = as.character(thresholds[i])
  )
})

p <- p %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE
    ),
    xaxis2 = list(
      title = "Split threshold on X2 (in region X1 <= 1)",
      range = c(0.5, 3.5),
      dtick = 1
    ),
    yaxis2 = list(
      title = "Misclassification error rate"
    ),
    shapes = node_shapes[[1]],
    sliders = list(
      list(
        active = 0,
        currentvalue = list(prefix = "Split at X2 = "),
        pad = list(t = 50),
        steps = slider_steps
      )
    ),
    margin = list(t = 90)
  )

p

Question 1.b

Region \(X_1 > 1\)

Split 2 1 2 3
X₁ 1 1 1
X₂ 1 1 1

Region \(X_1 \leq 1\)

Split 2 1 2 3
X₁ - - -
X₂ 0 0 1

Question 1.b

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 1, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)

black_rect1 <- list(
  type = "rect",
  x0 = 1, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

black_rect2 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 0, y1 = 1,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v1 <- list(
  type = "line",
  x0 = 1, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)

line_h1 <- list(
  type = "line",
  x0 = 0, x1 = 1,
  y0 = 1, y1 = 1,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split <- list(
  blue_rect1, 
  black_rect1, 
  black_rect2, 
  line_v1, 
  line_h1
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
     xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
    #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split off",
            method = "relayout",
            args = list(list(shapes = list()))
          ),
          list(
            label = "Split on",
            method = "relayout",
            args = list(list(shapes = shapes_split))
          )
        )
      )
    ),
    margin = list(t = 80)
  )

p
Code
par(mar = c(0, 0, 0, 0))
plot(0, 0, type = "n", xlim = c(-5, 10), ylim = c(-5, 10),
     axes = FALSE, xlab = "", ylab = "")

rect(4, 7, 6, 9, col = "grey", border = "black")
text(5, 8, expression(X[1] > 1), col = "white")

rect(1, 2, 3, 4, col = "grey", border = "black")
text(2, 3, expression(X[2] > 1), col = "white")

rect(7, 2, 9, 4, col = "grey20", border = "black")
text(8, 3, "black", col = "white")

arrows(5, 7, 2, 4, length = 0.1)
arrows(5, 7, 8, 4, length = 0.1)

text(2.5, 5.5, expression(X[1] <= 1))
text(7.5, 5.5, expression(X[1] > 1))

rect(-1, -3, 1, -1, col = "grey20", border = "black")
text(0, -2, "black", col = "white")

rect(3, -3, 5, -1, col = "lightblue", border = "black")
text(4, -2, "blue", col = "white")

arrows(2, 2, 0, -1, length = 0.1)
arrows(2, 2, 4, -1, length = 0.1)

text(0, 0.5, expression(X[2] <= 1))
text(4, 0.5, expression(X[2] > 1))

Question 1.c.

Consider the first split point. If we use the Gini index as the node impurity measure, calculate the cost \(C(T)\) for the following two candidate splits:

  1. \(R_1 = \{X: X_1 \leq 1\}\) and \(R_2 = \{X: X1 > 1\}\)
  2. \(R_1 = \{X: X_1 \leq 2\}\) and \(R_2 = \{X: X1 > 2\}\)

Among the two candidate splits in 1 and 2, where should we make the first split?

Show

  • preferable split

Know

  • Partition regions

  • \(Q_j = \sum_{k = 1}^K \hat p_{j,k} (1 - \hat{p}_{j,k})\)

  • Overall cost of split: \(N_{j_1}Q_{j_1} + N_{j_2}Q_{j_2}\)

Question 1.c.

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect1 <- list(
  type = "rect",
  x0 = 1, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v1 <- list(
  type = "line",
  x0 = 1, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split <- list(
  blue_rect1, 
  black_rect1, 
  line_v1
)


blue_rect2 <- list(
  type = "rect",
  x0 = 0, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect2 <- list(
  type = "rect",
  x0 = 2, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v2 <- list(
  type = "line",
  x0 = 2, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split2 <- list(
  blue_rect2, 
  black_rect2, 
  line_v2
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
      #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split 1",
            method = "relayout",
            args = list(list(shapes = shapes_split))
          ),
          list(
            label = "Split 2",
            method = "relayout",
            args = list(list(shapes = shapes_split2))
          )
        )
      )
    ),
    margin = list(t = 80)
  )

p

Question 1.c

In a \(2\)-class classification problem:

\[ \begin{aligned} Q_j &= \sum_{k = 1}^K \hat p_{j,k} (1 - \hat{p}_{j,k}) \\ &= \hat p_{j,1} (1 - \hat{p}_{j,1}) + \hat p_{j,2} (1 - \hat{p}_{j,2}) \\ &= \hat p_{j,1} (1 - \hat{p}_{j,1}) + (1- \hat p_{j,1}) (1 - (1 - \hat{p}_{j,1})) \\ &= 2 \hat p_{j,1} (1 - \hat{p}_{j,1}) \,. \end{aligned} \]

Question 1.c

Split 1

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect1 <- list(
  type = "rect",
  x0 = 1, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v1 <- list(
  type = "line",
  x0 = 1, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split <- list(
  blue_rect1, 
  black_rect1, 
  line_v1
)


blue_rect2 <- list(
  type = "rect",
  x0 = 0, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect2 <- list(
  type = "rect",
  x0 = 2, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v2 <- list(
  type = "line",
  x0 = 2, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split2 <- list(
  blue_rect2, 
  black_rect2, 
  line_v2
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
      #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split 1 off",
            method = "relayout",
            args = list(list(shapes = list()))
          ),
          list(
            label = "Split 1 on",
            method = "relayout",
            args = list(list(shapes = shapes_split))
          )
        )
      )
    ),
    margin = list(t = 80)
  )

p
  • \(R_1\): \(Q_1 = 2 \cdot (1/3) \cdot(1 - 1/3) = 4 / 9\)
  • \(N_1: 3\)
  • \(R_2\): \(Q_2 = 2 \cdot (1/10) \cdot(1 - 1/10) = 18/ 100\)
  • \(N_2: 10\)
  • \(C(T) = 3 \cdot 4 / 9 + 10 \cdot 18 / 100 \approx 3.13\)

Question 1.c

Split 2

Code
library(plotly)

df <- data.frame(
  x = c(0.5, 1.5, 0.5,
        0.5, 1.5, 3.5,
        1.5, 2.5, 3.5,
        2.5, 3.5,
        2.5, 3.5),
  y = c(3.5, 3.5, 2.5,
        0.5, 0.5, 0.5,
        1.5, 1.5, 1.5,
        2.5, 2.5,
        3.5, 3.5),
  class = factor(c(rep("white", 3), rep("black", 10)))
)

blue_rect1 <- list(
  type = "rect",
  x0 = 0, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect1 <- list(
  type = "rect",
  x0 = 1, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v1 <- list(
  type = "line",
  x0 = 1, x1 = 1,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split <- list(
  blue_rect1, 
  black_rect1, 
  line_v1
)


blue_rect2 <- list(
  type = "rect",
  x0 = 0, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "blue",
  opacity = 0.15
)


black_rect2 <- list(
  type = "rect",
  x0 = 2, x1 = 4,
  y0 = 0, y1 = 4,
  line = list(width = 0),
  fillcolor = "black",
  opacity = 0.08
)

line_v2 <- list(
  type = "line",
  x0 = 2, x1 = 2,
  y0 = 0, y1 = 4,
  line = list(width = 2, dash = "dash", color = "black")
)


shapes_split2 <- list(
  blue_rect2, 
  black_rect2, 
  line_v2
)

p <- plot_ly() %>%
  add_trace(
    data = subset(df, class == "white"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle-open",
      size = 16,
      line = list(width = 2, color = "black")
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = subset(df, class == "black"),
    x = ~x, y = ~y,
    type = "scatter", mode = "markers",
    marker = list(
      symbol = "circle",
      size = 16,
      color = "black"
    ),
    showlegend = FALSE,
    hoverinfo = "none"
  ) %>%
  layout(
    xaxis = list(
      title = "X1",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "X2",
      range = c(0, 4),
      dtick = 1,
      zeroline = FALSE,
      showgrid = FALSE,
      #   scaleanchor = "x",
      scaleratio = 1
    ),
    shapes = list(),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.05, y = 1.15,
        xanchor = "left", yanchor = "top",
        pad = list(t = 5, r = 10),
        buttons = list(
          list(
            label = "Split 2 off",
            method = "relayout",
            args = list(list(shapes = list()))
          ),
          list(
            label = "Split 2 on",
            method = "relayout",
            args = list(list(shapes = shapes_split2))
          )
        )
      )
    ),
    margin = list(t = 80)
  )

p
  • \(R_1\): \(Q_1 = 2 \cdot (1/2) \cdot(1 - 1/2) = 1 / 2\)
  • \(N_1: 6\)
  • \(R_2\): \(Q_2 = 2 \cdot 0 \cdot(1 - 0) = 0\)
  • \(N_2: 7\)
  • \(C(T) = 6 \cdot 1/2 + 7 \cdot 0 = 3\)

Question 1.c

N1 p1 N2 p2 Q1 Q2 C
Split 1 3 1/3 10 1/10 4/9 18/100 3.13
Split 2 6 1/2 7 0 1/2 0 3

Learnings

  • Understand tree building process
  • Understand how \(0\)-\(1\) loss leads to suboptimal splits
  • Structured approach for exam setting

Question 2

Random forests. Show that for the average of \(B\) identically distributed variables \(Z_1,\ldots,Z_B\), such that \(\text{var}(Z_i) = \sigma^2\) and \(\text{cov}(Z_i, Z_i) / \sqrt{\text{var}(Z_i)}\text{var}(Z_j)= \rho\) for all \(1 \leq i \neq j \leq B\), \[ \text{var} \left[\frac{1}{B} \sum_{i = 1}^B Z_i \right] = \rho \sigma^2 + (1 - \rho) \sigma^2 \frac{1}{B} \,. \]

Show

  • variance formula

Know

  • \(Z_i\) identically distributed

  • \(\text{var}(c Z) = c^2 \text{var}(Z)\)

  • \(\text{var}(Z) = \mathrm{E}[Z^2] - \mathrm{E}[Z]^2\)

  • \(\text{cov}(Z,Y) = \mathrm{E}[ZY] - \mathrm{E}[Z]\mathrm{E}[Y]\)

Question 2

\[ \begin{aligned} \text{var} \left[\frac{1}{B} \sum_{i = 1}^B Z_i \right] &= \frac{1}{B^2} \text{var} \left[\sum_{i = 1}^B Z_i \right] \\ &= \frac{1}{B^2} \left(\mathrm{E}\left[\left(\sum_{i = 1}^B Z_i\right)^2\right] - \mathrm{E}\left[\sum_{i = 1}^B Z_i\right]^2 \right) \\ &= \frac{1}{B^2} \left(\mathrm{E}\left[\left(\sum_{i = 1}^B Z_i\right)^2\right] - B^2\mathrm{E}\left[Z_i\right]^2 \right) \\ &= \frac{1}{B^2}\mathrm{E}\left[\left(\sum_{i = 1}^B Z_i\right)^2\right] - \mathrm{E}\left[Z_i\right]^2 \\ \end{aligned} \]

Question 2

\[ \begin{aligned} \mathrm{E}\left[\left(\sum_{i = 1}^B Z_i\right)^2\right] &= \mathrm{E}\left[\left(\sum_{i = 1}^B Z_i\right)\left(\sum_{i = 1}^B Z_i\right)\right] \\ &= \mathrm{E}\left[\sum_{i = 1}^B \sum_{j = 1}^B Z_i Z_j \right] \\ &= \mathrm{E}\left[\sum_{i = 1}^B Z_i^2 + \sum_{i = 1}^B \sum_{j \neq i} Z_i Z_j \right] \\ &= \mathrm{E}\left[\sum_{i = 1}^B Z_i^2 \right] + \mathrm{E}\left[\sum_{i = 1}^B \sum_{j \neq i} Z_i Z_j \right] \\ &= \mathrm{E}\left[\sum_{i = 1}^B Z_i^2 \right] + \sum_{i = 1}^B \sum_{j \neq i} \mathrm{E}\left[Z_i Z_j \right] \\ &= B \mathrm{E}[Z_i^2] + B(B-1) \mathrm{E}[Z_i Z_j] \,. \end{aligned} \]

Question 2

\[ \begin{aligned} \text{var} \left[\frac{1}{B} \sum_{i = 1}^B Z_i \right] &= \frac{1}{B^2} \left(B \mathrm{E}[Z_i^2] + B(B-1) \mathrm{E}[Z_i Z_j] - B^2 \mathrm{E}\left[Z_i\right]^2\right) \\ &= \frac{1}{B^2} \left(B \left(\mathrm{E}[Z_i^2] - \mathrm{E}[Z]^2 \right) + B(B-1) \left(\mathrm{E}[Z_i Z_j] - \mathrm{E}[Z_i]^2\right)\right) \\ &= \frac{1}{B^2} \left(B \sigma^2 + B(B-1)\text{cov}(Z_i,Z_j)\right) \\ &= \frac{1}{B^2} \left(B \sigma^2 + B(B-1)\sigma^2 \rho\right) \\ &= \frac{1}{B^2} \left(B \sigma^2 - B \sigma^2 \rho + B^2\sigma^2 \rho\right) \\ &= \frac{1}{B^2} \left(B^2\sigma^2 \rho + (1-\rho) \sigma^2 B\right) \\ &= \rho \sigma^2 + (1-\rho)\sigma^2 \frac{1}{B} \,. \end{aligned} \]

Question 2

Discussion

  • In Bagging, we build trees \(T_1, \ldots, T_B\) from bootstrap samples \(B_1,\ldots, B_B\)

  • Probability that any point of the original sample appears at least once in a bootstrap sample:

\[ p_n = 1 - \left(1- \frac{1}{n} \right)^n \longrightarrow 1 - e^{-1} \approx 0.63, \quad \text{as } n \to \infty \,. \]

  • \(\approx\) \(2/3\) of the original sample appear in each bootstrap sample in expectation

  • Trees are going to be correlated due to similarity of bootstrap data

  • If we introduce randomness in the tree building process by randomly restricting the number of features to split on, we reduce correlation \(\rho\)

  • The term \(\rho \sigma^2\) decreases, while \((1-\rho) \sigma^2 / B\) increases; can control the latter by increasing the number of bootstrap samples to reduce overall variance

Learnings

  • Variance manipulations from simple definitions
  • Understanding random forest decorrelation

Question 3

Variance and bias trade-off of regression trees. Consider \(y_i = f(x_i) + \epsilon_i\), for \(i = 1,\ldots,n\) where \(x_1, \ldots, x_n \in [0,1]^p\) and \(\epsilon_1, \ldots, \epsilon_n\) are i.i.d. with mean zero and variance at most \(\sigma^2\). Assume that \(f\) satisfies \(\lvert f(x) - f(x')\rvert \leq L \| x - x' \|_2\), for all \(x,x' \in [0,1]^p\) for some constant \(L \geq 0\).

Divide \([0,1]\) into intervals of equal length \(h > 0\). Thus, \([0,1]^p\) is divided into \(j\) non-overlapping \(p\)-dimensional regions \(B_1,\ldots,B_J\). Fix an arbitrary \(x \in [0,1]^p\) and define \[ \hat f(x) = \frac{1}{N(x)} \sum_{i = 1}^n y_i \mathbb{1}\{x_i \in B(x)\} \,, \] where \(B(x)\) is the \(p\)-dimensional region containing \(x\) and \[ N(x) = \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} \,. \] Show that \[ \mathrm{E}\left[\left(f(x) - \hat f(x) \right)^2 \right] \leq L^2 p h^2 + \frac{\sigma^2}{N(x)} \,. \]

Question 3

Code
library(plotly)

n_div <- 4
grid_breaks <- seq(0, 1, length.out = n_div + 1)
n_plane_traces <- 3 * (length(grid_breaks) - 2)

g <- seq(0, 1, length.out = 40)
x_grid <- outer(g, g, function(x, y) x)
y_grid <- outer(g, g, function(x, y) y)

fig <- plot_ly()

for (x0 in grid_breaks[-c(1, length(grid_breaks))]) {
  fig <- fig %>%
    add_surface(
      x = matrix(x0, nrow = length(g), ncol = length(g)),
      y = y_grid,
      z = x_grid,
      showscale = FALSE,
      opacity = 0.1,
      colorscale = list(
        c(0, "lightblue"),
        c(1, "lightblue")
      ),
      hoverinfo = "skip",
      showlegend = FALSE,
      visible = TRUE
    )
}

for (y0 in grid_breaks[-c(1, length(grid_breaks))]) {
  fig <- fig %>%
    add_surface(
      x = x_grid,
      y = matrix(y0, nrow = length(g), ncol = length(g)),
      z = y_grid,
      showscale = FALSE,
      opacity = 0.1,
      colorscale = list(
        c(0, "lightblue"),
        c(1, "lightblue")
      ),
      hoverinfo = "skip",
      showlegend = FALSE,
      visible = TRUE
    )
}

for (z0 in grid_breaks[-c(1, length(grid_breaks))]) {
  fig <- fig %>%
    add_surface(
      x = x_grid,
      y = y_grid,
      z = matrix(z0, nrow = length(g), ncol = length(g)),
      showscale = FALSE,
      opacity = 0.1,
      colorscale = list(
        c(0, "lightblue"),
        c(1, "lightblue")
      ),
      hoverinfo = "skip",
      showlegend = FALSE,
      visible = TRUE
    )
}

set.seed(123)
n_points <- 500
pts <- data.frame(
  x = runif(n_points),
  y = runif(n_points),
  z = runif(n_points)
)

pts$y_val <- with(
  pts,
  sin(2 * pi * x) * cos(2 * pi * y) +
    0.5 * sin(2 * pi * z) +
    rnorm(n_points, sd = 0.1)
)

pts$ix <- findInterval(pts$x, grid_breaks, rightmost.closed = TRUE)
pts$iy <- findInterval(pts$y, grid_breaks, rightmost.closed = TRUE)
pts$iz <- findInterval(pts$z, grid_breaks, rightmost.closed = TRUE)

all_cubes <- expand.grid(
  ix = 1:n_div,
  iy = 1:n_div,
  iz = 1:n_div
)
cube_means <- aggregate(y_val ~ ix + iy + iz, data = pts, FUN = mean)
cube_df <- merge(all_cubes, cube_means, by = c("ix", "iy", "iz"), all.x = TRUE)
names(cube_df)[names(cube_df) == "y_val"] <- "cube_mean"
cube_df$cube_mean[is.na(cube_df$cube_mean)] <- mean(pts$y_val)

cmin <- min(cube_df$cube_mean)
cmax <- max(cube_df$cube_mean)

i_faces <- c(0, 0, 4, 4, 0, 0, 3, 3, 0, 0, 1, 1)
j_faces <- c(1, 2, 5, 6, 1, 5, 2, 6, 3, 7, 2, 6)
k_faces <- c(2, 3, 6, 7, 5, 4, 6, 7, 7, 4, 6, 5)

for (row in seq_len(nrow(cube_df))) {
  ix <- cube_df$ix[row]
  iy <- cube_df$iy[row]
  iz <- cube_df$iz[row]
  v <- cube_df$cube_mean[row]
  
  x0 <- grid_breaks[ix]
  x1 <- grid_breaks[ix + 1]
  y0 <- grid_breaks[iy]
  y1 <- grid_breaks[iy + 1]
  z0 <- grid_breaks[iz]
  z1 <- grid_breaks[iz + 1]
  
  xv <- c(x0, x1, x1, x0, x0, x1, x1, x0)
  yv <- c(y0, y0, y1, y1, y0, y0, y1, y1)
  zv <- c(z0, z0, z0, z0, z1, z1, z1, z1)
  
  fig <- fig %>%
    add_trace(
      type = "mesh3d",
      x = xv,
      y = yv,
      z = zv,
      i = i_faces,
      j = j_faces,
      k = k_faces,
      intensity = rep(v, length(xv)),
      intensitymode = "vertex",
      cmin = cmin,
      cmax = cmax,
      colorscale = "Viridis",
      opacity = 0.4,
      showscale = row == 1,
      colorbar = list(title = "Mean f(x) per cube"),
      hoverinfo = "skip",
      name = if (row == 1) "Predictions" else "Predictions",
      showlegend = row == 1,
      visible = FALSE
    )
}

n_cube_traces <- nrow(cube_df)

fig <- fig %>%
  add_markers(
    data = pts,
    x = ~x,
    y = ~y,
    z = ~z,
    marker = list(
      size = 2,
      color = "rgb(120,120,120)"
    ),
    name = "Data",
    visible = TRUE
  )

axis_template <- list(
  range = c(0, 1),
  zeroline = FALSE,
  showgrid = FALSE,
  tickvals = grid_breaks,
  ticktext = grid_breaks
)

fig <- fig %>%
  layout(
    scene = list(
      xaxis = axis_template,
      yaxis = axis_template,
      zaxis = axis_template,
      aspectmode = "cube"
    ),
    updatemenus = list(
      list(
        type = "buttons",
        direction = "right",
        x = 0.1,
        y = 1.1,
        showactive = TRUE,
        buttons = list(
          list(
            label = "Show predictions",
            method = "update",
            args = list(
              list(
                visible = c(
                  rep(TRUE, n_plane_traces),
                  rep(TRUE, n_cube_traces),
                  TRUE
                )
              )
            )
          ),
          list(
            label = "Hide predictions",
            method = "update",
            args = list(
              list(
                visible = c(
                  rep(TRUE, n_plane_traces),
                  rep(FALSE, n_cube_traces),
                  TRUE
                )
              )
            )
          )
        )
      )
    )
  )

fig
  • Corresponds to a regression tree with data-independent splits
  • Simplified settings allows us to analyse basic behaviour of tree
  • Regions can be thought of as the expected splits if data is uniformly distributed
  • Treat features as nonrandom, conclusions can be extended to random features

Question 3

Show

  • MSE decomposition

Know

  • \(\mathrm{E}[(f(x) - \hat{f}(x))^2] = \mathrm{E}[\mathrm{E}[\hat{f}(x)] - f(x)]^2 + \text{var}[\hat{f}(x)]\)

  • Definition of \(\hat f\)

  • Lipschitz continuity of \(f\)

  • Partition of \([0,1]^p\)

Question 3

Bias

\[ \begin{aligned} \mathrm{E}\left[ \hat f(x) \right] &= \mathrm{E}\left[\frac{1}{N(x)} \sum_{i = 1}^n y_i \mathbb{1}\{x_i \in B(x) \} \right] \\ &= \frac{1}{N(x)} \sum_{i = 1}^n \mathrm{E}\left[y_i\right] \mathbb{1}\{x_i \in B(x) \} \\ &= \frac{1}{N(x)} \sum_{i = 1}^n \mathrm{E}\left[f(x_i) + \epsilon_i \right] \mathbb{1}\{x_i \in B(x) \} \\ &= \frac{1}{N(x)} \sum_{i = 1}^n f(x_i) \mathbb{1}\{x_i \in B(x) \} \end{aligned} \]

Question 3

Bias

\[ \begin{aligned} \mathrm{E}\left[ \hat f(x) \right] - f(x) &= \frac{1}{N(x)} \sum_{i = 1}^n f(x_i) \mathbb{1}\{x_i \in B(x) \} - f(x) \\ &= \frac{1}{N(x)} \sum_{i = 1}^n f(x_i) \mathbb{1}\{x_i \in B(x) \} - \frac{1}{N(x)} \sum_{i = 1}^n f(x) \mathbb{1}\{x_i \in B(x) \} \\ &= \frac{1}{N(x)} \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} (f(x_i) - f(x)) \\ & \leq \frac{1}{N(x)} \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} \lvert f(x_i) - f(x)\rvert \\ & \leq \frac{1}{N(x)} \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} L \| x_i - x \|_2 \end{aligned} \]

Question 3

Bias

For \(x_i \in B(x)\): \[ \begin{aligned} \| x_i - x \|_2^2 &= \sum_{j = 1}^p (x_{i,j} - x_j)^2 \\ &\leq p \underset{1 \leq j \leq p}{\max} \, (x_{i,j} - x_j)^2 \\ &\leq p h^2 \end{aligned} \]

Thus \[ \begin{aligned} \mathrm{E}\left[ \hat f(x) \right] - f(x) &\leq \frac{1}{N(x)} \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} L \| x_i - x \|_2 \\ &\leq \frac{1}{N(x)} \sum_{i = 1}^n \mathbb{1}\{x_i \in B(x) \} L \sqrt{p} h \\ &= L \sqrt{p} h \,. \end{aligned} \]

Question 3

Variance

\[ \text{var}(\hat{f}(x)) = \frac{1}{N(x)^2} \sum_{i = 1}^n \text{var}(y_i \mathbb{1}\{ x_i \in B(x)\}) \]

\[ \begin{aligned} \text{var}(y_i \mathbb{1}\{ x_i \in B(x)\}) &= \mathrm{E}[(y_i \mathbb{1}\{ x_i \in B(x)\})^2] - \mathrm{E}[y_i \mathbb{1}\{ x_i \in B(x)\}]^2 \\ &= \mathbb{1}\{ x_i \in B(x)\}^2\mathrm{E}[y_i^2] - \mathrm{E}[y_i]^2\mathbb{1}\{ x_i \in B(x)\} \\ &= \mathbb{1}\{ x_i \in B(x)\} \left(\mathrm{E}[y_i^2] - \mathrm{E}[y_i]^2 \right) \\ &= \mathbb{1}\{ x_i \in B(x)\} \text{var}(y_i) \\ &\leq \mathbb{1}\{ x_i \in B(x)\} \sigma^2 \,. \end{aligned} \]

Thus

\[ \text{var}(\hat{f}(x)) \leq \frac{1}{N(x)^2} \sum_{i = 1}^n \mathbb{1}\{ x_i \in B(x)\} \sigma^2 = \frac{\sigma^2}{N(x)} \,. \]

Question 3

MSE

\[ \mathrm{E}[(f(x) - \hat{f}(x))^2] = \mathrm{E}[\mathrm{E}[\hat{f}(x)] - f(x)]^2 + \text{var}[\hat{f}(x)] \leq \underbrace{L \sqrt{p} h }_{\text{Bias}^2} + \underbrace{\frac{\sigma^2}{N(x)}}_{\text{variance}} \]

  • If we increase the number of splits, i.e. smaller \(h\), then bias decreases
  • With fixed number of data points, as \(h\) becomes smaller, the number of data points per region decreases, so variance increases

Question 3

Variance and curse of dimensionality

  • Let the features be uniformly distributed on \([0,1]^p\)
  • Let \(K = 1 / h\) be the number of regions
  • Let \(N_B\) be the number of points in a particular region \(B\)

\[ \mathrm{E}(N_B) = \mathrm{E}\left[\sum_{i = 1}^n \mathbb{1}\{x_i \in B\} \right] = n \Pr(x_i \in B) \]

\[ \Pr(x_i \in B) = \prod_{j = 1}^p \Pr(x_{i,j} \in B_j) = \prod_{j = 1}^p \Pr(U_j \in [0,h]) =\prod_{j = 1}^n h = h^p \]

\[ \mathrm{E}(N_B) = n h^p \]

Question 3

Variance and curse of dimensionality

\[ \mathrm{E}(N_B) = n h^p = n \left(\frac{1}{K} \right)^p = n \exp\{-\log(K) p \} \,. \]

  • The number of points in a region decreases polynomially fast with the number of regions for a fixed dimension (Bias variance tradeoff)
  • For a fixed number of splits and sample size, the number of points in a region decreases exponentially fast with growing dimension (Curse of dimensionality)

Learnings

  • Bias-variance tradeoff for regression trees and relation to number of partitions
  • Get comfortable with MSE derivation in tricky settings