\(\hat{\beta} = \arg \underset{\beta \in \Re^p}{\max} \, l(\beta; y, X)\)
In R this is done with the glm function
Code
logistic_fit <-glm(mpg01 ~ horsepower + weight, family =binomial(link ="logit"), data = Auto.train)
Logistic regression
Classification
Recall that \(\log \left(\mu /(1 - \mu) \right) = x^\top \beta\) so that \(\displaystyle \mu = (1 + \exp \{-x^\top \beta \})^{-1}\)
Given ML estimate \(\hat{\beta}\) and data point \(x\), we can estimate \(\Pr(Y = 1\mid x)\) as \[ \hat{\mu} = (1 + \exp \{ -x^\top \hat{\beta} \})^{-1}\,.\]
Classify data point \(x\) according to \[\mathbb{1} \left\{\hat{\mu} > 1/2 \right\} \,,\] where the function \(\mathbb{1}\{\mathcal{A} \}\) is one when \(\mathcal{A}\) is true and zero otherwise.
Logistic regression
Probability surface
Code
#load required libraries library(plotly)# Create a grid of horsepower and weight valueshorsepower_ticks <-seq(min(Auto.test$horsepower), max(Auto.test$horsepower), length.out =100)weight_ticks <-seq(min(Auto.test$weight), max(Auto.test$weight), length.out =100)background_grid <-expand.grid(horsepower = horsepower_ticks, weight = weight_ticks)# Predict probabilities for the gridbackground_grid$predicted_prob <-predict(logistic_fit, newdata = background_grid, type ="response")# Predict classifications for the actual dataAuto.test$predicted_class <-as.numeric(predict(logistic_fit, newdata = Auto.test, type ="response") >0.5)Auto.test$correct <- Auto.test$predicted_class == Auto.test$mpg01Auto.test$height <-ifelse(Auto.test$mpg01 ==1, 1, 0)plot_ly() %>%layout(autosize =FALSE, width =1000, height =500) %>%add_surface(x =~horsepower_ticks, y =~weight_ticks, z =t(matrix(background_grid$predicted_prob, nrow =length(horsepower_ticks), ncol =length(weight_ticks))), colorscale =list(c(0, 1), c ("#F8766D", "#00BFC4")),showscale =TRUE,colorbar =list(title ="Probability \n high fuel", tickvals =c(.25, 0.5, .75), ticktext =c("0.25","0.5","0.75")),hoverinfo ="x+y+z" ) %>%add_markers(data = Auto.test[Auto.test$mpg01 ==0& Auto.test$correct, ], x =~horsepower, y =~weight, z =~height, marker =list(color ="#F8766D", symbol ="circle",size =5,line =list(width =0.5, color ="#F8766D") ),name ="Low fuel – Correct", hoverinfo ="x+y+z",visible =FALSE,key ="low_fuel_correct" ) %>%add_markers(data = Auto.test[Auto.test$mpg01 ==0&!Auto.test$correct, ], x =~horsepower, y =~weight, z =~height, marker =list(color ="#F8766D", symbol ="cross",size =5,line =list(width =0.5, color ="#F8766D") ),name ="Low fuel – Incorrect", hoverinfo ="x+y+z",visible =FALSE,key ="low_fuel_incorrect" ) %>%add_markers(data = Auto.test[Auto.test$mpg01 ==1& Auto.test$correct, ], x =~horsepower, y =~weight, z =~height, marker =list(color ="#00BFC4", symbol ="circle",size =5,line =list(width =0.5, color ="#00BFC4") ),name ="High fuel – Correct", hoverinfo ="x+y+z",visible =FALSE,key ="high_fuel_correct" ) %>%add_markers(data = Auto.test[Auto.test$mpg01 ==1&!Auto.test$correct, ], x =~horsepower, y =~weight, z =~height, marker =list(color ="#00BFC4", symbol ="cross",size =5,line =list(width =0.5, color ="#00BFC4") ),name ="High fuel – Incorrect", hoverinfo ="x+y+z",visible =FALSE,key ="high_fuel_incorrect" ) %>%layout(scene =list(xaxis =list(title ="horsepower"),yaxis =list(title ="weight"),zaxis =list(title ="Probability") ),title =" ",updatemenus =list(list(type ="buttons",buttons =list(list(method ="restyle",args =list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE)),label ="Surface Only" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, TRUE, TRUE, TRUE)),label ="Show All" ),list(method ="restyle",args =list("visible", list(TRUE, TRUE, FALSE, FALSE, FALSE)),label ="Low fuel – Correct" ),list(method ="restyle",args =list("visible", list(TRUE, FALSE, TRUE, FALSE, FALSE)),label ="Low fuel – Incorrect" ),list(method ="restyle",args =list("visible", list(TRUE, FALSE, FALSE, TRUE, FALSE)),label ="High fuel – Correct" ),list(method ="restyle",args =list("visible", list(TRUE, FALSE, FALSE, FALSE, TRUE)),label ="High fuel – Incorrect" ) ) ) ) )
Logistic regression
Classification boundary
Code
# Create a grid of horsepower and weight valueshorsepower_ticks <-seq(min(Auto.test$horsepower), max(Auto.test$horsepower), length.out =100)weight_ticks <-seq(min(Auto.test$weight), max(Auto.test$weight), length.out =100)background_grid <-expand.grid(horsepower = horsepower_ticks, weight = weight_ticks)# Predict probabilities for the gridbackground_grid$predicted_prob <-predict(logistic_fit, newdata = background_grid, type ="response")# Predict classifications for the actual dataAuto.test$predicted_class <-predict(logistic_fit, newdata = Auto.test, type ="response") >0.5Auto.test$correct <-as.numeric(Auto.test$predicted_class) == Auto.test$mpg01# Calculate the decision boundary lineslope <-coef(logistic_fit)[2] / (-coef(logistic_fit)[3])intercept <-coef(logistic_fit)[1] / (-coef(logistic_fit)[3])# Plot the decision boundary and data points with color gradient backgroundggplot() +geom_tile(data = background_grid, aes(x = horsepower, y = weight, fill = predicted_prob), alpha =0.5) +scale_fill_gradient(low ="#F8766D", high ="#00BFC4", name ="Probability high fuel") +geom_point(data = Auto.test, aes(x = horsepower, y = weight, color =as.factor(mpg01), shape = correct), size =2) +scale_color_manual(values =c("1"="#00BFC4", "0"="#F8766D"), labels =c("Low fuel", "High fuel")) +scale_shape_manual(values =c("TRUE"=16, "FALSE"=4), labels =c("Correct", "Incorrect")) +geom_abline(intercept = intercept, slope = slope, linetype ="dashed", color ="black", size =1, alpha =0.6) +theme_minimal() +labs(title =" ", x ="horsepower", y ="weight", color ="Fuel economy", shape ="Classification") +theme(legend.position ="right")
Exercises
Predicting fuel economy
In this problem, we will develop a model to predict whether a given car gets high or low gas mileage based on the Auto dataset from the ISLR package.
Create a binary variable, mpg01, that contains a \(1\) if mpg contains a value above its median, and a \(0\) if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
Code
library(ISLR)mpg01 <-ifelse(Auto$mpg >median(Auto$mpg), 1, 0)Auto$mpg01 <- mpg01head(Auto, n =3)
Split the data into a training set and a test set.
Code
train <- (Auto$year %%2==0) # if the year is eventest <-!trainAuto.train <- Auto[train,]Auto.test <- Auto[test,]mpg01.test <- mpg01[test]
Predicting fuel economy
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
Last week, we saw how to use glmnet to get Ridge and LASSO estimates in a linear model. We can use these estimators for classification too. In this case our estimates of the class probabilities become
\[
\Pr \left(Y = 1 \mid x \right) = x^\top \hat{\beta} \,,
\] where \(\hat{\beta}\) is either the ridge or LASSO estimate. This is called a linear probability model. We can conduct classification just as with logistic regression. However, this linear relationship may not always be appropriate (why?).
Instead, we can use glmnet to add a ridge or LASSO peanlty to our logistic regression model, by specifying family = "binomial" in cv.glmnet.
Year Lag1 Lag2 Lag3
Min. :2001 Min. :-4.922000 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500 1st Qu.:-0.640000
Median :2003 Median : 0.039000 Median : 0.039000 Median : 0.038500
Mean :2003 Mean : 0.003834 Mean : 0.003919 Mean : 0.001716
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000 Max. : 5.733000
Lag4 Lag5 Volume Today
Min. :-4.922000 Min. :-4.92200 Min. :0.3561 Min. :-4.922000
1st Qu.:-0.640000 1st Qu.:-0.64000 1st Qu.:1.2574 1st Qu.:-0.639500
Median : 0.038500 Median : 0.03850 Median :1.4229 Median : 0.038500
Mean : 0.001636 Mean : 0.00561 Mean :1.4783 Mean : 0.003138
3rd Qu.: 0.596750 3rd Qu.: 0.59700 3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. : 5.733000 Max. : 5.73300 Max. :3.1525 Max. : 5.733000
Direction
Down:602
Up :648
Fit LDA and logistic regression models to the train dataset and evaluate their predictons in the test dataset. Fit LDA and logistic regression models to the train dataset and evaluate their predictons in the test dataset. Try to find a model based on some of the covariates such that it has as good accuracy rate as possible.