A statistician argues that logistic regression is correctly named as a regression model, not a classification model. The key insight is that regression should be defined as predicting E[Y|X] — the conditional expectation — rather than by whether the response variable is numeric or categorical. Since logistic regression predicts P(Y=1|X), which equals E[Y|X] for binary outcomes, it is genuinely a regression. Classification only occurs when a threshold is applied on top of those predicted probabilities. The post also critiques the data science convention of defining regression vs. classification purely by the type of response variable, noting that an overfitted decision tree predicting numeric values shows no regression-to-the-mean effect and arguably shouldn't be called a regression at all. R code examples illustrate both logistic regression and the regression-to-the-mean phenomenon.

7m read timeFrom r-bloggers.com
Post cover image
Table of contents
What data scientists think regression isWhat statisticians think regression isWhat regression actually is

Sort: