Convolutional Neural Network
Mainly used for pattern and object recognition in image data.
CNN usually consists of the the following Layers
Convolution
A convolutional layer performs a convolutional operation (not shown) using two matrices - the image and the features detector. The feature detector is usually of size 3x3, but also other sizes as 5x5 or 7x7 are common in some cases.
The following example shows an input matrix of size 7x7 and a feature detector of size 3x3. To simplify the illustration, the input image and feture detector are boolean matrices.
The convolution layer applies a sliding windows to the input. In a simplified way: The function counts overlapping ones of window and feature and saves this number in a so called feature map. In the first iteration no overalp was found, so the first field of the feature map contains a zero.
The second iteration shows a single overlapping one. Therefore the second fields receives a one. On so on.
After a couple of iteration, even a position is found with the maximum overlapping of four ones for this feature detector. Final result of the convolutional opertation for the example image and feature detector looks like this.
The layer normally consists of multiple feature detectors, creating multiple feature maps.
For an interactive live demonstration of images kernels and the convolve operate have a look at http://setosa.io/ev/image-kernels/
Pooling
The convolution layer extracts features likes edges and areas from an image. To allow slight distortions usually a pooling layer is inserted right after. There exist multiple different pooling layers, most common the max pooling, min pooling and mean pooling.
As an example, we show the pooling operation by a max pooling layer of size 2x2 and stride of 2. A window of size 2x2 slides over the matrices and the maximum value is written as output.
The first window consists of values 0, 1, 0, 1 and therefore its maximum is 1, which is inserted in the resulting matrix. The stride value choose the amount of fields, the windows is shifted in each iteration.
Since we chose a stride of the, the window is shifted by 2 fields and the second iterataion creates the maximum of 0,0,1,1, which is also 1.
The next shift by 2, results in window, that overlaps with the borders of the matrix, which is allowed. The maximum of the observed 0, 0 matrices is 0.
Since the stride is 2, we shift down the windows by 2 rows and observe a maximum of 4. This is repeated till the end of the matrix.
The pooling operation is applied to each of the feature map. Pooling layer is of huge importance. A 2x2 stride 2 pooling layer reduces the size of the matrices by 75%, allows the already mentioned slight distortions of features and it prevent overfitting.
Flatten & Fully Connected Layer
The flatten layer is concatenates the values of all pooled layers matrices to one long 1 dimensional vector, serving as input for one or more fully connected layers.
Additional Reading
Additional Reading
- Yann LeCun et al., 1998, Gradient-Based Learning Applied to Document Recognition
- Jianxin Wu, 2017, Introduction to Convolutional Neural Networks
- C.-C. Jay Kuo, 2016, Understanding Convolutional Neural Networks with A Mathematical Model
- Kaiming He et al., 2015, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
- Dominik Scherer et al., 2010, Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition
- Adit Deshpande, 2016, The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)
- Rob DiPietro, 2016, A Friendly Introduction to Cross-Entropy Loss
- Peter Roelants, 2016, How to implement a neural network Intermezzo 2