Shafi's ML Blog: Quick Review of CNN

Quick Review of CNN

This is a summary of coursera course convolutional neural networks.

How convolution works

For horizontal edge detection, we can use:

How the edge detector works can be clear from the following figure:

Equations of Size

Image $n\times n$
Filter $f \times f$

Then:

Output $(n-f+1)\times (n-f+1)$

With Padding

If the image has a padding of $p$ on all sides, then:

Output $(n+2p-f+1)\times (n+2p-f+1)$

With Stride

Instead of rolling over each pixel, hop $s$ steps.

Output:
$\left\lfloor{\dfrac{n+2p-f}{s}+1}\right\rfloor \times \left\lfloor{\dfrac{n+2p-f}{s}+1}\right\rfloor$

Padding Types

Valid: No padding
Same: Idea is to make input size and output size the same. So, use the above equation to determine the padding size $p$ that makes input and output size the same.

Convolution over Volume

Note Number of channels in the image must match the number of channels in the filter.
We sum over all dimensions at the output.

Multiple Filters

Example of CNN

Note each convolutional layers also contains bias term and non-linear activation (e.g. Relu).

Below is an example of a simple CNN:

Layers in a CNN

Convolutional Layer: Convolution with filter (number of channels for the filter and the input must be same), Input might be padded, stride might be more than 1. Similar to a typical NN, output of the layer is the convolution layer also has bias + non-linear activation.
Pooling Layer
Fully Connected Layer
1X1 convolution

Pooling Layer

Has two hyper parameters: stride $s$ , filter size $f$
Pooling layer has no parameter to learn
Works well in CNN - however why not well understood
Because nothing to learn, very cheap.

Size

Input $n_h \times n_w \times n_c$
Hyperparameter $f$ , $s$
Output $\left\lfloor \dfrac{n_h-f}{s}+1 \right\rfloor \times \left\lfloor \dfrac{n_w-f}{s}+1 \right\rfloor \times n_c$

Note Number of input channel and number of output channels are the same. The same filter is applied to each of the channel independently.

Example: Max Pool

Example: Average Pool

Different CNN

Classic
- Lenet 5 (‘Lecun 1998)
- AlexNet (2012)
- VGG-16 (2015), VGG-19
Recent
- ResNet (very deep 152 layers) (2015)
- Inception (uses 1x1 convolution) (2014)

1x1 Convolution

(Network in Network, Lin et. el. 2013)
- Adds non-linearity in the network
- Help reduce number of channels if channels became too large

Example of 1x1 CONV for 1 channel image is as follows:

It seems like - it just multiplies by a constant. However, in case of multiple channel, we can think of it as a fully connected layer over the channels - i.e. $\sigma(\boldsymbol{w}^T\boldsymbol{x})$ , as shown in the figure below:

Example use-case to reduce number of channels:

Here, we have 32 1x1 conv filter, each of dimension 1x1x192 (192 because number of channels for input and the filter has to be the same).

LeNet

AlexNet

Similar to LeNet, but much bigger (60M params compare to 60K)

VGG-16

Simplified architecture using the same kind of operations over and over.
Main downside is pretty big network with ~138M params.
The two operations are:
- Convolution 3x3 filters, s=1, Same
- Pooling 2x2, s=2

ResNet

The issue with very deep network is the exploding/ vanishing gradient problem. ResNet makes use of ‘skip connection’ or ‘short cut’ to help with vanishing/exploding gradient problem.

Inception Net

Two key ideas:

Inception Block: (try out everything you want)
Bottleneck Layer: Reduce computational costs. E.g.

needs 120M multiplications. In contrast, using a bottleneck layer as shown below:

only needs 12.5M multiplications.

Shafi's ML Blog

Saturday, June 16, 2018