Logistic Regression & Metric

Logistic Regression

Features vector $\boldsymbol{x}$
We want to find the weights $\boldsymbol{w}$
Classification is done using Sigmoid function:

$h_\boldsymbol{w} (\boldsymbol{x})=\sigma(\boldsymbol{w}^T\boldsymbol{x})=\dfrac{1}{1+\exp(-\boldsymbol{w}^T\boldsymbol{x})}$

Likelihood Approach

$P(y=1|\boldsymbol{x},\boldsymbol{w})=h_\boldsymbol{w} (\boldsymbol{x})\\ P(y=0|\boldsymbol{x},\boldsymbol{w})=1 - h_\boldsymbol{w} (\boldsymbol{x})$

More compactly,
$P(y|\boldsymbol{x},\boldsymbol{w})=h_\boldsymbol{w} (\boldsymbol{x})^y \left(1 - h_\boldsymbol{w} (\boldsymbol{x})\right)^{1-y}$

Cost/ Likelihood Function

$L(\boldsymbol{w})=\prod_{i=1}^n P(y_i|\boldsymbol{x},\boldsymbol{w})=\prod_{i=1}^n h_\boldsymbol{w} (\boldsymbol{x})^{y_i} \left(1 - h_\boldsymbol{w} (\boldsymbol{x})\right)^{1-y_i}$

Log-likehood function

$\log L(\boldsymbol{w})=\sum_{i=1}^n y_i \log\sigma(\boldsymbol{w}^T\boldsymbol{x}) + \left(1-y_i\right) \log\left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x})\right)$

We want to maximize the likelihood function, or minimize the negative of the likelihood function.

Graphical illustration of the cost function

cost = $-\log h_\boldsymbol{w} (\boldsymbol{x})$ , when $y=1$

cost = $-\log\left(1 - h_\boldsymbol{w}(\boldsymbol{x})\right)$ , when y=0

Solve using LMS algorithm:
$\begin{eqnarray} w_j &=& w_j - \alpha \dfrac{\partial J}{\partial w_j}\\ &&=w_j - \alpha \sum_{i=1}^n \left[h_{w_j}(x_i(j)) - y_i\right]{x}_i(j) \end{eqnarray}$

Note Unlike linear regression, here, h(x) is nonlinear function of theta and x

Metric

Accuracy works well for problem when the number of positive and negative classes are not skewed. However, if one of the class is skewed, then accuracy might perform really poorly. Note that, we usually assign the class with few samples as ‘1’ and the other as ‘0’.

E.g. data set of 2000 samples with 10 positive and 1990 negative. If we assign $y=0$ disregarding all input $x$ , we will have accuracy of $1990/2000=99.5\%$ which seems quiet high - however it doesn’t learn anything.

Some other metric that we can think of is:

Precision/ Recall
F1 score
ROC curve
AUC

One way to remember this is,
TP = True Positive = Truly predicted as Positive
FN = False Negative = Falsely predicted as Negative

Precision/Recall

Precision: Of all the prediction predicted as positive, what percentage of them are truly positive.
Recall: Of all the actual positive, what percentage of them are predicted as positive.

Using the figure above:

$\text{Precision} = \dfrac{TP}{TP+FP}\\ \text{Recall} = \dfrac{TP}{TP+FN}$

F1 score

F1 score is a single value that combines both precision and recall. Average doesn’t work well because we wanted to give more weight to the lower of the two (precision and recall) score.

$\text{F1 score} = 2\dfrac{PR}{P+R}$

ROC Curve

Very good explanation of ROC and AUC is here. This is basically True Positive Rate vs. False Positive Rate.

$\text{TPR}=\text{Recall}=\dfrac{TP}{TP+FN}\\ \text{FPR}=\dfrac{FP}{FP+TN}$

AUC is the Area under the ROC curve.

Shafi's ML Blog

Saturday, June 16, 2018