Thursday, March 5, 2020

We Can Learn Logistic Regression From the Wrong Introduction

NOTE: This entry is translated from the original entry written in Japanese, at May 23, 2018

I saw this tweet.

He strongly deprecated the following article by Data Science Central. What did he mean?

Why Logistic Regression should be the last thing you learn when becoming a Data Scientist

What is the point at issue?

Primaly I try to draw the same plot with R. Although the author doesn't mention that what the boundary curve mean, I read between the lines and used the logistic curve1 defined as the following: $$\begin{aligned} y= & \frac{1}{1+\exp(-x)}. \end{aligned}$$

On the other hand, You may have seen the following figure, introduced as the logistic regression. the objective variable $y$ is one of two, 0 or 1 because the logistic regression is binary classification algorithm. But the above figure shows diverse $y$ values. It's clearly contradictory.

Making reference to more advanced theory, we can expand expand the target variable from binary to integer, what we call Poisson regression. However this plot displays decimal numbers, not only integer(of course this is a introduction of logistic regression, the author must have been explained that). In addition, the legend on the right says there are two labels, True or False.

Then, I doubt that that y doesn't denotes the target variable, but an explanatory variable, and target variable is exactly binary and identified by two colors, True and False. So we can conclude that this figure shows a scatter between two explanatory variables x and y, identified by two colors which corresponds to binary target values. The 'boundary' curve means the decision boundary. We completely figure out that: The author says logistic regression can draw the decision boundary by a logistic curve.

Really?

I will check if logistic regression model is really able to classify with the curve like the above figure. I used scikit-learn library to learn the data, and plot the result. The result is:

To compare with the above one, I colored by the prediction values by the logistic model, not the target values, and highlighted by larger triangles if the pairs are not correct. We can see that there are some incorrect predicitons around the left-bottom and right-top area.

Conclusion

A logistic regression cannot draw its own boundary by a logistic curve. We need somewhat mathematically explanation (No black box!). With two explanatory variables $x,y$ and three coefficients $a,b,c$, and a dependent variable $z$ is defined as $$\begin{aligned} z= & ax+by+c.\end{aligned}$$ $z$'s range is infinite because it depends on the explanatory variables and coefficients. Next, we convert $z$ by the logistic curve. $$\begin{aligned} p &= \mathit{logistic}(z)\\ &= \frac{1}{1+\exp(-x)}. \end{aligned}$$ The variable $p$, the output of the curve function is always between zero and one. It's an important characteristic of the curve. Now, we can interpret $p$ as the probability if it is classified as True. So, we can obtain prediction by the following condition: $$\begin{aligned} \mathit{prediction}= & \begin{cases} \mathit{True} & \text{if }p\geq0.5\\ \mathit{False} & \text{if }p<0.5 \end{cases}.\end{aligned}$$

The boundary of classification given $x,y$ is $p=0.5$. We obtain the following classification condition expressed by $x, y$ $$\begin{aligned} \mathit{logit}(0.5)=0= & ax+by+c\end{aligned}$$ This equation means a line. $\mathit{logit}(p)$ is logit function, the inverse of the logistic function, $\mathit{logit}(p)=\log(p/(1-p))$. The boundary is clearly straight, not a curve. Logistic regression consists from the category named linear classification models. Thus the Logistic regression cannot draw nonlinear boundary2.

I submit the above programs as a jupyter notebook on gist. https://gist.github.com/Gedevan-Aleksizde/91be05eb10323aaa234e4351d7ed4db2

p.s. Mr. R Bohn posted a critical comment which contains some topics, which are same with my explanation above. The author dodges the point and responded that how difficult to give mathematical comprehension to the CEO or other non-experts. Ok, I understand the difficulty myself because I am working with non-experts as he say. But I can't understand why he argued we should teach the beginners with such mistaken and nonsense introduction.


1: I adjusted its center and scale

2: For advanced readers: In fact you can have the logistic regression drawing nonlinear boundary with some nonlinear transformation of the explanatory variables (e.g., log, square, and so on). However, the author didn't mention such advanced techniques at all. Besides I cannot find such known transformation that draws logistic-like boundary. In any case, his introduction is misleading.