Saturday, June 16, 2018

Logistic Regression & Metric

Logistic Regression

Features vector $\boldsymbol{x}$
We want to find the weights $\boldsymbol{w}$
Classification is done using Sigmoid function:

$h_\boldsymbol{w} (\boldsymbol{x})=\sigma(\boldsymbol{w}^T\boldsymbol{x})=\dfrac{1}{1+\exp(-\boldsymbol{w}^T\boldsymbol{x})}$

Likelihood Approach

$P(y=1|\boldsymbol{x},\boldsymbol{w})=h_\boldsymbol{w} (\boldsymbol{x})\\ P(y=0|\boldsymbol{x},\boldsymbol{w})=1 - h_\boldsymbol{w} (\boldsymbol{x})$

More compactly,
$P(y|\boldsymbol{x},\boldsymbol{w})=h_\boldsymbol{w} (\boldsymbol{x})^y \left(1 - h_\boldsymbol{w} (\boldsymbol{x})\right)^{1-y}$

Cost/ Likelihood Function

$L(\boldsymbol{w})=\prod_{i=1}^n P(y_i|\boldsymbol{x},\boldsymbol{w})=\prod_{i=1}^n h_\boldsymbol{w} (\boldsymbol{x})^{y_i} \left(1 - h_\boldsymbol{w} (\boldsymbol{x})\right)^{1-y_i}$

Log-likehood function

$\log L(\boldsymbol{w})=\sum_{i=1}^n y_i \log\sigma(\boldsymbol{w}^T\boldsymbol{x}) + \left(1-y_i\right) \log\left(1 - \sigma(\boldsymbol{w}^T\boldsymbol{x})\right)$

We want to maximize the likelihood function, or minimize the negative of the likelihood function.

Graphical illustration of the cost function

cost = $-\log h_\boldsymbol{w} (\boldsymbol{x})$ , when $y=1$

cost = $-\log\left(1 - h_\boldsymbol{w}(\boldsymbol{x})\right)$ , when y=0

Solve using LMS algorithm:
$\begin{eqnarray} w_j &=& w_j - \alpha \dfrac{\partial J}{\partial w_j}\\ &&=w_j - \alpha \sum_{i=1}^n \left[h_{w_j}(x_i(j)) - y_i\right]{x}_i(j) \end{eqnarray}$

Note Unlike linear regression, here, h(x) is nonlinear function of theta and x

Metric

Accuracy works well for problem when the number of positive and negative classes are not skewed. However, if one of the class is skewed, then accuracy might perform really poorly. Note that, we usually assign the class with few samples as ‘1’ and the other as ‘0’.

E.g. data set of 2000 samples with 10 positive and 1990 negative. If we assign $y=0$ disregarding all input $x$ , we will have accuracy of $1990/2000=99.5\%$ which seems quiet high - however it doesn’t learn anything.

Some other metric that we can think of is:

Precision/ Recall
F1 score
ROC curve
AUC

One way to remember this is,
TP = True Positive = Truly predicted as Positive
FN = False Negative = Falsely predicted as Negative

Precision/Recall

Precision: Of all the prediction predicted as positive, what percentage of them are truly positive.
Recall: Of all the actual positive, what percentage of them are predicted as positive.

Using the figure above:

$\text{Precision} = \dfrac{TP}{TP+FP}\\ \text{Recall} = \dfrac{TP}{TP+FN}$

F1 score

F1 score is a single value that combines both precision and recall. Average doesn’t work well because we wanted to give more weight to the lower of the two (precision and recall) score.

$\text{F1 score} = 2\dfrac{PR}{P+R}$

ROC Curve

Very good explanation of ROC and AUC is here. This is basically True Positive Rate vs. False Positive Rate.

$\text{TPR}=\text{Recall}=\dfrac{TP}{TP+FN}\\ \text{FPR}=\dfrac{FP}{FP+TN}$

AUC is the Area under the ROC curve.

Bias-Variance Tradeoff

Actual model:

$Y = f(X) + e,~~~~$ where $e$ is the irreducible error.

Estimation:

We approximate $f(X)$ with $~~ \hat{Y}=\hat{f}(X)$

Bias and Variance of the Estimation:

$\text{bias}\left(\hat{f}(X)\right) = \mathbb{E}\left[\hat{f}(X) - f(X)\right] = \mathbb{E}\left[\hat{f}(X)\right] - f(X)$
Here, $~~\mathbb{E}\left[f(X)\right]=f(X)$ , since $f(X)$ is deterministic.

$\text{var}\left(\hat{f}(X)\right) = \mathbb{E}\left[\hat{f}(X)-\mathbb{E}[\hat{f}(X)]\right]^2=\mathbb{E}\left[\hat{f}^2(X)\right]-\mathbb{E}^2\left[\hat{f}(X)\right]$

MSE of estimation as a function of bias and variance of estimation

$\begin{eqnarray} &&\text{MSE}\\ &=&\mathbb{E}\left[Y-\hat{f}(X)\right]^2\\ &=&\mathbb{E}\left[Y^2 -2 Y\hat{f}(X)+\hat{f}^2(X)\right]\\ &=&\mathbb{E}\left[\left(f(X) + e\right)^2 -2 \left(f(X) + e\right)\hat{f}(X)+\hat{f}^2(X)\right]\\ &=&\mathbb{E}\left[f^2(X) +2f(X)e+ e^2 -2 f(X)\hat{f}(X) -2 \hat{f}(X)e+\hat{f}^2(X)\right]\\ &=&f^2(X) +\mathbb{E}\left[e^2\right] -2 f(X)\mathbb{E}\left[\hat{f}(X)\right] +\mathbb{E}\left[\hat{f}^2(X)\right]\\ &=&\left[f^2(X)-2 f(X)\mathbb{E}\left[\hat{f}(X)\right]\right] +\sigma_e^2 +\mathbb{E}\left[\hat{f}^2(X)\right]+\mathbb{E}^2\left[\hat{f}(X)\right]-\mathbb{E}^2\left[\hat{f}(X)\right]\\ &=&\left(f^2(X)-2 f(X)\mathbb{E}\left[\hat{f}(X)\right]+\mathbb{E}^2\left[\hat{f}(X)\right]\right) +\left(\mathbb{E}\left[\hat{f}^2(X)\right]-\mathbb{E}^2\left[\hat{f}(X)\right]\right)+\sigma_e^2 \\ &=&\left(f(X)-\mathbb{E}\left[\hat{f}(X)\right]\right) ^2+\left(\mathbb{E}\left[\hat{f}^2(X)\right]-\mathbb{E}^2\left[\hat{f}(X)\right]\right)+\sigma_e^2\\ &=& \text{bias}^2 + \text{var} +\sigma_e^2 \end{eqnarray}$

Example of bias-variance tradeoff

$\text{bias}=\left(f(X)-\mathbb{E}\left[\hat{f}(X)\right]\right)$
$\text{var}= \mathbb{E}\left[\hat{f}(X)-\mathbb{E}[\hat{f}(X)]\right]^2$

Overfitting is low bias, high variance
Underfitting is high bias, low variance

Red Line: Bias is high. Variance is low.
Blue Line: Bias is low. Variance is high.

Properties: Model Complexity

The more complex the model, the lower the bias.
The more complex the mode, the higher the variance.

Properties: Regularization

The lower the $\lambda$ , the lower the bias, the higher the variance.
The higher the $\lambda$ , the higher the bias, the lower the variance.

Properties: Number of Samples

Increasing sample size will decreasing variance.
If a learning algorithm is suffering from high bias (under-fit), getting more training data will not (by itself) help much.

If a learning algorithm is suffering from high variance, getting more training data is likely to help.

Summary

Getting more training examples fixes high variance (overfit model, use more training example)
Using smaller sets of features fixes high variance (overfit model, use smaller number of features)
Getting additional feature fixes high bias (underfit model, get more features)
Adding polynomial features (i.e. more complex model) fixes high variance (underfit model, increase model complexity)
Increasing $\lambda$ fixes high variance (overfit model, increase $\lambda$ )
Decreasing $\lambda$ fixes high bias (underfit model, decrease $\lambda$ )

So, for overfit model (low bias, high variance):

Increase sample size
Reduce number of features
Increase regularization

For underfit model (high bias, low variance):

Get more features
Increase model complexity
Decrease regularization

Quick Review of CNN

This is a summary of coursera course convolutional neural networks.

How convolution works

For horizontal edge detection, we can use:

How the edge detector works can be clear from the following figure:

Equations of Size

Image $n\times n$
Filter $f \times f$

Then:

Output $(n-f+1)\times (n-f+1)$

With Padding

If the image has a padding of $p$ on all sides, then:

Output $(n+2p-f+1)\times (n+2p-f+1)$

With Stride

Instead of rolling over each pixel, hop $s$ steps.

Output:
$\left\lfloor{\dfrac{n+2p-f}{s}+1}\right\rfloor \times \left\lfloor{\dfrac{n+2p-f}{s}+1}\right\rfloor$

Padding Types

Valid: No padding
Same: Idea is to make input size and output size the same. So, use the above equation to determine the padding size $p$ that makes input and output size the same.

Convolution over Volume

Note Number of channels in the image must match the number of channels in the filter.
We sum over all dimensions at the output.

Multiple Filters

Example of CNN

Note each convolutional layers also contains bias term and non-linear activation (e.g. Relu).

Below is an example of a simple CNN:

Layers in a CNN

Convolutional Layer: Convolution with filter (number of channels for the filter and the input must be same), Input might be padded, stride might be more than 1. Similar to a typical NN, output of the layer is the convolution layer also has bias + non-linear activation.
Pooling Layer
Fully Connected Layer
1X1 convolution

Pooling Layer

Has two hyper parameters: stride $s$ , filter size $f$
Pooling layer has no parameter to learn
Works well in CNN - however why not well understood
Because nothing to learn, very cheap.

Size

Input $n_h \times n_w \times n_c$
Hyperparameter $f$ , $s$
Output $\left\lfloor \dfrac{n_h-f}{s}+1 \right\rfloor \times \left\lfloor \dfrac{n_w-f}{s}+1 \right\rfloor \times n_c$

Note Number of input channel and number of output channels are the same. The same filter is applied to each of the channel independently.

Example: Max Pool

Example: Average Pool

Different CNN

Classic
- Lenet 5 (‘Lecun 1998)
- AlexNet (2012)
- VGG-16 (2015), VGG-19
Recent
- ResNet (very deep 152 layers) (2015)
- Inception (uses 1x1 convolution) (2014)

1x1 Convolution

(Network in Network, Lin et. el. 2013)
- Adds non-linearity in the network
- Help reduce number of channels if channels became too large

Example of 1x1 CONV for 1 channel image is as follows:

It seems like - it just multiplies by a constant. However, in case of multiple channel, we can think of it as a fully connected layer over the channels - i.e. $\sigma(\boldsymbol{w}^T\boldsymbol{x})$ , as shown in the figure below:

Example use-case to reduce number of channels:

Here, we have 32 1x1 conv filter, each of dimension 1x1x192 (192 because number of channels for input and the filter has to be the same).

LeNet

AlexNet

Similar to LeNet, but much bigger (60M params compare to 60K)

VGG-16

Simplified architecture using the same kind of operations over and over.
Main downside is pretty big network with ~138M params.
The two operations are:
- Convolution 3x3 filters, s=1, Same
- Pooling 2x2, s=2

ResNet

The issue with very deep network is the exploding/ vanishing gradient problem. ResNet makes use of ‘skip connection’ or ‘short cut’ to help with vanishing/exploding gradient problem.

Inception Net

Two key ideas:

Inception Block: (try out everything you want)
Bottleneck Layer: Reduce computational costs. E.g.

needs 120M multiplications. In contrast, using a bottleneck layer as shown below:

only needs 12.5M multiplications.

CART & XGBoost

CART (Classification & Regression Trees)

CART partition the space of all joint predictor variable values into $M$ disjoint regions $R_j ,j = 1, 2,... ,M$ (see top right-hand figure). The output of the model in each of this disjoint region is a constant $c_m$ . That is, the predictive rule:
$\boldsymbol{x}_i\in R_m\implies f(\boldsymbol{x}_i)=c_m$

Output of the overall model output can be written in a more general way as follows:

$T(x;Θ) =f(x) =\sum_{m=1}^M c_m I(x\in R_m)$
Here, $I$ is the identity function, $Θ=\{R_j, c_j\}$ , $j=1,\ldots,M$

At each step, CART Algorithm needs:

Splitting Variable
Splitting Point

We have $N$ observations $(\boldsymbol{x}_i, y_i)$ , $~~~i=1,\ldots, N$ . If we minimize RMSE, $\sum_{i=1}^N (y_i - f(\boldsymbol{x}_i))^2$ , the best $c_m$ for region $m$ turns out to be the average of all the $y_i$ in that region $R_m$ :
$c_m^*=\dfrac{1}{\left|y_i|\boldsymbol{x}_i\in R_m\right|}\sum_{y_i|\boldsymbol{x}_i\in R_m} y_i$

For regression tree, to find the best variable to split on and best splitting point, we need a greedy approach. Say, we consider variable $j$ to split on at point $s$ . So, we get two region,

$R_1(j, s)=\left\{X|X_j\le s\right\}, ~~~R_2(j, s)=\left\{X|X_j> s\right\}$

Then we minimize the following:
$\min_{s}\left[ \sum_{\boldsymbol{x}_i\in R_1}(y_i -c_1^*)^2 + \sum_{\boldsymbol{x}_i\in R_2}(y_i -c_2^*)^2\right ]$

We choose the splitting point, that minimizes the above objective function. Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions.

Boosting and Stagewise Additive Model

Boosting builds an additive models of weak learners. If $b_m(x;\theta)$ is a weak learner, then we combine $M$ of such learners in a stagewise manner, i.e. at each stage $k$ , we are gonna optimize the $k-th$ learner $b_k(x;\theta_k)$ , while keeping all the other learners constant. This significantly reduce the chances of overfitting.

For examples, boosting for a regression problem turns out to be the problem of repeatedly fitting the residuals, as shown below. At stage 0, the residual is the output, i.e. $r_0(i)=y_i$ , $i=1,\ldots,N$ . Here, $\epsilon$ is the shrinkage factor to shrink down each tree, also helps with overfitting.

For a more general framework, the forward stagewide additive model algorithm looks like this:

In boosted tree model, the weak learner is CART, and the algorithm is solved in the forward stagewise manner:
$\begin{equation} \hat{\Theta}_ k = \arg \min_{ \Theta_k}\sum_{i=1}^N \mathcal{L}\left(y_i ,f_{k-1}(x_i) + T_k(x_i ;\Theta_k)\right) \end{equation}$

in order to find $\Theta_K=\{R_j,(k) c_j(k)\}$ , $j=1,\ldots,M$ .

At each step $k$ , the solution tree is the one that maximally reduces the above equation, given the current model $f_{m-1}$ and its fits $f_{m-1}(x_i)$ .

To compare the above with a gradient descent algorithm, in gradient descent, we initialize with some random value at the first step, then in the subsequent step, we go in the negative direction of the slope.
The tree predictions here is analogous to the components of the negative gradient. The main difference is, in tree prediction, the components of the gradient vector are constrained to be the components of a decision tree, whereas in gradient descent, the negative gradient is the unconstrained maximal descent direction. In gradient boosted machine, at $m$ -th iteration, we train a tree a $T(x;\Theta_m)$ , whose predictions are as close as possible to the negative gradient in least-square sense:

$\begin{equation} \hat{\Theta}(m)=\arg\min _\Theta \sum_{i=1}^N\left(-g_{im}-T(x_i ;\Theta)\right)^2 \end{equation}$

here gradient is defined as
$g_{im}=\left[\dfrac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\right ]_{f(x_i)=f_{m-1}(x_i)}$

The following table shows gradients of the commonly use loss functions:

We can see for regression, negative gradient is just the ordinary residual $-g_{im}=y_i - f_{m-1}(x_i)$ .

The following algorithm shows a more generalized algorithm:

In summary, GBM involves three elements:

A differentiable loss function to be optimized.
A weak learner to make predictions. In this case CART
An additive model to add weak learners to minimize the loss function. Trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees.

XGBoost

XGBoost introduced regularization and momentum term in the above formulation. Objective:

With $k$ trees, model output,
$\hat{y}_i =\sum_{i=1}^K f_k(x_i),~~~f_k\in \mathcal{F}(\text{space containing all regression trees})$

Therefore, the objective functions becomes:

Here, we can not use methods such as SGD, to find $f$ (since they are
trees, instead of just numerical vectors). So, solution is additive training:

For regression problem, we minimize the residual error at each step, as shown below:

For a more general case, we use Taylor series expansion.
$f(x+\Delta x)\approx f(x)+f'(x)\Delta x + \dfrac{1}{2}f''(x) \Delta x ^2$
Then the objective function can be approximated as:
$\begin{eqnarray} Obj^{(t)}&=&\sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i)) + \Omega(f_t) + constant \\ &\approx&\sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i)+ \dfrac{1}{2}h_i f^2(t)+\Omega(f_t) + constant \end{eqnarray}$
Here,
$g_i = \dfrac{\partial l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}$
$h_i = \dfrac{\partial^2 l(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}$

Removing the constant, objective function becomes:
$\begin{eqnarray} Obj^{(t)}\approx\sum_{i=1}^n g_i f_t(x_i)+ \dfrac{1}{2}h_i f^2(t)+\Omega(f_t) \end{eqnarray}$

Example: Regression (again)

$l(y_i, \hat{y}_i^{(t-1)})=\left(y_i-\hat{y}_i^{(t-1)}\right)^2$
Then, $g_i = -2\left(y_i-\hat{y}_i^{(t-1)}\right)~~~$ and, $~~~h_i = 2$

We can re-formulate tree building mechanism based on the above expansion. We can define a tree by a vector of scores in leafs $w$ , and a leaf index mapping function $q$ that maps an input instance to a leaf.
$f_t(x)=w_{q(x)},~~~w\in\mathbb{R}^T,~~~q:\mathbb{R}^d\to\{1,\ldots,T\}$

For the following tree,

Mapping function $q$ looks like:

We can define (one possible definition) the regularization term based on the above definition,

Then, for the above example, we have,

We can define the instance set in leaf $j$ as $I_j=\{i|q(\boldsymbol{x}_i)=j\}$ . Example, for the following input instances:

the instance set looks like: $I_1=\{1\}$ , $I_2=\{4\}$ , $I_3=\{2,3,5\}$ .
We can re-group the objective by each leaf:
$\begin{eqnarray} Obj^{(t)}&\approx&\sum_{i=1}^n \left[ g_i f_t(x_i)+ \dfrac{1}{2}h_i f^2(t)\right]+\Omega(f_t)\\ &=&\sum_{i=1}^n \left[ g_i w_{q(x_i)}+ \dfrac{1}{2}h_i w^2_{q(x_i)}\right]+ \gamma T + \lambda \dfrac{1}{2}\sum_{j=1}^T w_j^2\\ &=& \sum_{j=1}^T \left[\left(\sum_{i\in I_j} g_i\right) w_j +\dfrac{1}{2}\left(\sum_{i\in I_j} h_i+\lambda\right)w_j^2\right] + \gamma T\\ &=& \sum_{j=1}^T \left[G_j w_j +\dfrac{1}{2}\left(H_j+\lambda\right)w_j^2\right] + \gamma T \end{eqnarray}$
This is sum of $T$ independent quadratic equations.

Using the above properties of quadratic functions, we have

Exhaustive search algorithm to find the optimum tree at step-t:

So, we can use a greedy algorithm:

Summary of XGBoost

Tree Size in each tree

Tree of size $k$ can involve $k$ variables, or interaction order of the model.