All you need to know about Deep Neural Networks Image Classification using Convolutional Neural Networks (CNNs)

10 min readJan 7, 2023

This article majorly focuses on giving a 360-angle intuition of CNN architectures for any person able to get a full understanding of these deep neural architectural components, usage and future direction.

What is classification and why it is important

Classification comes as a fundamental problem of Digital Image Analysis and Computer Vision. In image classification, the desired algorithm tries to analyze the spectral pattern present in the pixel level of the image by differentiating it and identifying the correct ‘class’ of the image. When there are many target classes provided in the dataset, the classification task measures the probability of the image being related to one of the ‘classes’ in the given set of classes. The above-mentioned class is most likely a label explicitly provided by the practitioner. Integrating computational techniques for image classification minimized the deadly dull traditional way of checking and classifying the images manually and it currently outperforms the human cognitive ability of classification [1]. Image classification is used in almost every industrial application from simple face recognition to complex autonomous car driving applications.

How Humans & Computer do Image Classification

Intuitively human eye performs classification by dividing the image into small sub-images and analysing them one by one from the captured image to the lens. By assembling those sub-images, the human brain process and interpret the classified image. When a computer sees an image, it will see an array of pixel values. Usually, it is a representative array with a set of integers depending on the resolution of the image. If it contains more than one colour channel, for each colour channel, it creates such an array. Each of these array elements is given a value from 0 to L — 1 where L denoted bit levels of the image pixel, which describes the pixel intensity at a certain pixel.

Figure 1: Representation of an Image with pixel intensity values [Photo Credits]

Similar to humans classifying images by differentiating identifiable features of a certain object, Computer tries to learn and create a mechanism to differentiate all the images given and figure out the unique features that make it into a certain class. Image classification algorithms execute this by looking for low-level features such as edges and curves, and then building up to more abstract features through a series of image processing techniques like convolution, thresholding, and transformation.

Why does CNN are special for Image Classification?

Deep Neural network (NN) architecture mimics the behaviour of how the human brain works and the brain connectivity pattern. Just like the human brain consists of billions of neurons, NNs also have neurons arranged in a specific way with n number of layers. The word deep conveys the complex nature of this neural structure. Throughout the literature, researchers found many varieties of NNs based on their structural nature. Among them, Convolutional Neural Networks (CNN) or ConvNet is a special type of Multi-Layer Deep Neural Network specialised for extracting the features of images and has grown a wide success in image classification tasks [1]. The main reason to move CNNs instead of Standard Neural Networks is that it has the drawback of scalability, even though it produces adequate results on smaller images with fewer colour channels, it doesn’t work well when the image size and complexity get larger. On the other hand, the overfitting issue arises over time in NNs.

Figure 2: Biological Inspiration of Convolutional Neural Networks [Photo Credits]

Same as NNs, Convolutional Neural Network architecture is analogous to how the visual system of mammals is exclusively designed. A CNN’s neurons are arranged like the brain’s frontal lobe, the area responsible for processing visual stimuli [1]. To be specific, CNNs do take biological inspiration from the visual cortex. The visual cortex consists of small regions of cells that are sensitive to specific regions of the visual spectrum. Hubel and Wiesel in 1962[4] ideated this concept by expanding upon several experiments. CNN implement this biological inspiration with the convolutional layers. These layers have the advantage of covering the entire visual field, thus avoiding the piecemeal image processing problem of traditional Deep Neural Network architectures, which tend to reduce resolution pieces on images. Compared to the older networks, a CNN delivers better performance with image inputs, and also with speech or audio signal inputs [1–3].

Convolutional Neural Networks Architecture

CNN architectures mainly use to extract the features. Integration of CNNs for image classification date back 1980s. But until early 2000, not many applications developed using CNNs because of the limitation of the computation power [2]. Typical CNN architecture has three core layers in order to achieve the image classification task namely a Convolutional layer, a Pooling layer, and a Fully connected layer. Let’s dive into each of these layers.

Convolutional Layer

The convolutional layer is the most vital component of the whole CNN architecture. It performs the main computational load of the algorithm. This operation usually uses a kernel (or filter). It is a dot product between the two matrices, where the image matrix or respective field of the original image works as one matrix and the second matrix is the kernel which consists of a set of learnable parameters. The resulting image is usually known as Feature Map.

Figure 3 : Representation of Original Image and Kernel (3*3) and resulting Feature Map

The kernel is spatially smaller than the image but the image is more in-depth than the kernel. This means that, if the image is composed of three (RGB) channels, the kernel height and width will be spatially small, but the depth extends up to all three channels. The following figure shows how the three-colour channel (or RGB) image convolute using a 3×3 kernel and how the feature map is generated mathematically.

Figure 4 : Convolutional operation applied for 3-channel RGB image [Photo Credits]

Convolution operation applied to the whole image or a respective field of the image from left to right and top to bottom. This is called the forward pass where the kernel slides across the whole region. It delivers an activation map (also known as the feature map) which is a resultant two-dimensional representation of kernel at each spatial position of the original image. Sliding operation can also define another parameter called stride which denoted the sliding size or hop size. The greater the stride it minimizes the feature map. Convolution reduces the dimensions of the image compared to the input image and the following equation depicts the resulting size after completing convolution. Padding technique can be used to avoid this shrinking as it passes through the convolutional layers.

𝑖𝑛𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 𝑠𝑖𝑧𝑒=𝑛∗𝑛
𝑘𝑒𝑟𝑛𝑒𝑙 𝑠𝑖𝑧𝑒=𝑘∗𝑘
𝑜𝑢𝑡𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 𝑠𝑖𝑧𝑒=(𝑛−𝑘−1)∗(𝑛−𝑛−1)

Selection of the kernel type and size is application-dependent [6]. kernels like Sobel are very good at detecting vertical edges and horizontal edges whereas blurring kernels are not good at edge detection. At the same time, kernel size manifests the performance of the image classification task. Smaller size of kernels extracts much information from the input image and highly localized characteristics whereas larger kernel extracts only less information and drop layer dimension drastically. Larger filter size results in poor performance. Generally, it is recommended to use a smaller size of kernel which yields better performance in image classification.

2. Pooling layer

This layer is responsible for the image compression task by reducing the size of the feature map. Technically, it reduces the amount of variance, complexity, and computational power required to process the image. Pooling layer accomplished this by integrating different pooling mechanisms. Max Pooling and Average Pooling are the most common functions for pooling operations. As the following figure depicts, it reduces the neighbourhood size after applying the pooling. Similar to the convolutional layer, the pooling operation should define the kernel size and stride hyperparameters.

Figure 5 : Resultant image after applying pooling for 4 * 4 input image

As the name implies, Max pooling takes the maximum of a defined sub-region of the feature map. While applying this algorithm, it is a known fact that it selects the brightest pixel in a particular sub-region. So, after applying this technique, it will substantially convert the bright version of the previous image or feature map.

Average Pooling or Mean Pooling technique considers all pixel information of the feature map and takes the average. This technique does not completely ignore non-essential parts like Max pooling, instead of that keeps all significant information in less magnitude. So average pooling smooths the image and it has the drawback of reducing the sharpness of the image.

Some applications use Min pooling instead of the above two techniques which is the inverse operation of Max pooling which results in a darker version of the feature map.

3. Fully Connected Layer

This layer inherits from the traditional Neural Networks consisting of a massive amount of multi-layer perceptron. Since each neuron is connected with every other neuron in the next layer it is called a Fully-connected layer. The main task of this layer is to apply the extracted features from the previous convolutional and pooling layers and employ the classification task. This layer is also known as Flatten layer since it converts the n-dimensional layer in the previous layers into a one-dimensional column vector. As an example, size of 10×10 RGB image (10×10×3) converts into a size 300 vector. Different activation functions in the output layer decide the target class of the given image class.

Figure 6: Formation of a Fully Connected Layer [Credits]

4. Activation functions

Activation function is involved to get the output from the neural network. The transfer function is another popular name for the activation function. It maps the input features into a desired range of values between 0 to 1, -1 to 1 etc. Activation Function majorly can be divided into two types linear

and non-linear. Among them non-linear activation heavily used in classification tasks such as softmax, ReLU and sigmoid [5].

Softmax activation is a popular choice for multi-class classification applications [2]. This activation function preserves the output probabilities of the CNN sum to 1. This operation is useful to scale the output model into probabilities and select the most related target class for the input image. The fundamental difference between the standard normalization and softmax function is, while standard normalization rescales the values between 0 to 1, softmax effectively differentiates the argmax with weights where rescale doesn’t. Softmax function mathematically denoted as follows.

ReLU or Rectified Linear Unit activation copes non-linearity to the neural networks. It produces non-linear decision boundaries. One of the major reasons CNN obtain such accuracies is to learn non-linear features [2]. It is observed that CNNs with ReLU activation is faster in the training phase compared to other activation [2]. Sigmoid activation is usually better for binary classification problems and softmax is a more generalized version of sigmoid activation.

Figure 7: Activation function diagram & equation of Sigmoid and ReLU [Photo Credits]

[Usage] Design a CNN Architecture for MNIST data

As the last few sections progressively present the critical components of a CNN architecture, it is much more straightforward to assemble those components into a real-world application. In accordance, the following figure demonstrates how the image classification task was applied for the MNIST dataset [4] which consists of grey-level handwritten digits images from 0 to 9 with the size of 28×28.

Figure 8: Classifying the MINST handwritten digit with two convolutional layers and two max pooling. Two Flatten layers for the Classification [Photo Credits]

In the input layer, it feeds 28×28×1 images. Since it is in the grey-level form it only has one channel. Next, it applies two convolutional layers as the primary operator together with two max pooling layers. The convolution layer operates with 3×3 filter size with a stride of 1 and 2×2 with 2 strides for the max pooling. Padding applies to preserve the image size. In the end, the flatten layer comprises 3136 (64×7×7) features and it reduces to 128 by another fully-connected layer that finally carried out the classification with a softmax function and outputs the classified digit from 0 to 9.

State-of-the-art CNN architectures

As technology grows, CNNs architectures also grow with many advanced features after Alex Krizhevsky introduces ImageNet [7] in 2012. Many scholars presented several CNN architectures which gain more accurate results in image classification. Such popular CNN architectures are VGG-16 [8], ResNet [9], and Xception [10]. You can visit this link to inspect the current state-of-the-art CNN architectures and the latest research papers published under each architecture.

Conclusion

CNN is a handy architecture for image classification compared to many other algorithms. In the present day, CNN become indispensable for any classification task and it heavily manifests in every technological arena. Google uses CNNs in DeepMind and google photo search queries, Facebook uses it for tagging algorithms and Amazon for their product recommendation. It will continuously grow and many studies will help to bring image classification to the next level using CNN.

References

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.

[2] Rawat, W., & Wang, Z. (2017). Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation, 29(9), 2352–2449. doi:10.1162/neco_a_00990

[3] Lee, S.-J., Chen, T., Yu, L., & Lai, C.-H. (2018). Image Classification Based on the Boost Convolutional Neural Network. IEEE Access, 6, 12755–12768. doi:10.1109/access.2018.2796722

[4] Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6), 141–142.

[5] “CNN Explainer,” poloclub.github.io. https://poloclub.github.io/cnn-explainer/ (accessed Nov. 07, 2022).

[6] “Image Filtering,” lodev.org. https://lodev.org/cgtutor/filtering.html (accessed Nov. 07, 2022).

[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems, vol. 25, 2012, [Online]. Available: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

[8] Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv. https://doi.org/10.48550/arXiv.1409.1556

[9] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv. https://doi.org/10.48550/arXiv.1512.03385

[10] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” openaccess.thecvf.com, 2017. https://openaccess.thecvf.com/content_cvpr_2017/html/Chollet_Xception_Deep_Learning_CVPR_2017_paper.html (accessed Aug. 22, 2021).