Enhancing Convolutional Neural Network using Hu’s Moments

Convolutional Neural Networks (CNN) is a powerful deep learning method which is mostly used in image classification and image recognition applications. It has achieved acceptable accuracy in these fields but it still suffers some limitations. One of these limitations of CNN is the lack of ability to be invariant to the input data due to some transformations such as rotation, scaling, skewness, etc. In this paper we present an approach to optimize CNN in order to enhance its performance regarding the invariant limitation by using Hu’s moments. The Hu’s moments of an image are weighted averages of the image’s intensities of the pixels, which produce statistics about the image, and these moments are invariant to image transformations. This means that, even if some changes were made to the image, it will always produce almost the same moments values. The main idea behind the proposed approach is extracting Hu’s moments of the image and concatenating them with the flatten vector then feeding the new vector to the fully connected layer. The experimental results show that an acceptable loss, accuracy, precision, recall and F1 score have been achieved on three benchmark datasets which are MNIST hand written digits dataset, MNIST fashion dataset and the CIFAR10 dataset. Keywords—CNN; image transformations; invariant; Hu’s moments


I. INTRODUCTION
Convolutional Neural Networks (CNN) have achieved an acceptable accuracy in classifying images, but it still suffers some limitations [1], [17], [21]. One of these limitations is the lack of ability to be spatially invariant to the input data due to some transformations [14]. Most present approaches usually use dataset augmentation to solve this issue [2], [4], [21], but this needs larger number of model parameters and more training data, and may result in significantly increased training time and larger chance of under-or overfitting [14], [25]. The effect of this issue is even more obvious when dealing with domain-specific problems. E.g. in medical imaging datasets, the rotation can be extraneous due to the symmetric nature of some biological assemblies. However, the scale is constant during imaging process and should not be deemed as a nuisance factor. Moreover, scale-invariance can decrease the performance if object size is informative, for example, in case of classifying healthy cells from cancer cells [15], [28].
Equivariance and invariance are sometimes used interchangeably but these terms are different from each other. "Equivariance" means varying in a similar or "equivalent proportion" while "invariant" means "no variance at all" [6]. More formally, a function ƒ is equivariant with respect to transformation Ƭ if ƒ(Ƭ(x)) = Ƭ (ƒ(x)). This means that, applying the transformation to x is similarly equivalent to applying the transformation to the result ƒ(x). Invariant is a special case of equivariant. A function ƒ is invariant with respect to a transformation Ƭ if ƒ(Ƭ(x)) = ƒ(x). this means the result through ƒ does not change when a transformation is applied to the input image [27], [8].
CNN is translation equivariance by nature because of the convolution operations [32], since it convolves all over the input image in order to detect the image's features. So, even if an object was shifted, it will still be detected regardless to its position in the image. Also, pooling operations can make CNN rotation equivariance but only if the object was rotated slightly, but as the degree of rotation increases the CNN may fail to classify the object correctly. Although CNN is translation and slight rotation equivariance, it is not translation, scaling or rotation invariant [5], [7], [24], [26], [32], [31].
The problem of transformation invariant in image classification might cause issues in some fields like robotics and autonomous cars. Because of the movement of the robot or the car, the received images might be distorted, translated, scaled or rotated. Therefore, even if the robot or the car is trained to recognize an object they might fail to do so and might cause problems [17], [3], [35]. Image classification is important in surveillance systems to detect unusual activities. Therefore, the invariant problem might cause problems either by classifying an object to be a threat while it is not then making a false alarm, or by classifying an object as a safe object while it is a threat and, in this case, it might lead to a breach of the system [17]. In health care, image classification is used to classify medical images of the patient in order to help diagnose him/her based on the classified images. The invariant problem in this application might cause issues that lead to a misdiagnose of the patient's condition.
The main objective and contribution of this work is to enhance CNN regarding the invariant limitation in order to achieve higher accuracy in image classification by using Hu's moments of the image [23]. The Hu's moments of an image are weighted averages of the image's intensities of the pixels, which produce statistics about the image, and these moments are invariant to image transformations [18], [36]. This means that, even if some changes were made to the image or if the shape outline got slightly thicker, it will always produce almost the same moments values [9]. Therefore, the Hu's moments of the image can be fed to the CNN in order to make it invariant to image transformations. The taxonomy of this 130 | P a g e www.ijacsa.thesai.org paper will be as follows: Section 2 shows some previous works in this field, Section 3 explains the basics of CNN, Section 4 presents the proposed approach, Section 5 shows the experimental results, and Section 6 shows conclusion.

II. RELATED WORK
Mahesh et al., [18], [19] and Tahmasbi et al. [29] proposed approaches to solve the invariant problem of CNN using Zernike moments. Mahesh et al., [18], [19] proposed a technique which uses Zernike moments in CNN to evaluate the discrimination between face and non-face patterns, and gender classification using facial expression recognition. Their main contribution is the use of Zernike moments as an initial filter, in order to show some unique features of the image that might be helpful to distinguish faces from non-faces image, and gender classifications. They have achieved an accuracy of 100% in distinguishing faces from non-faces images but that is not impressive as it sounds because the discrimination between faces and non-faces is not a hard problem in computer vision any more [10]. In facial expression recognition, they have achieved an accuracy of 87.22%. The main drawback of their work is feature loss. The use of a filter based on Zernike moments might lead to feature loss in some cases.
McNeely-White, et al., [21] Anselmi, et al. [2] and Bruna & Mallat,[4] studied the CNN representations invariance and equivariance to input image transformations. McNeely-White, et al. [21] estimated the linear relationships between representations of the original and transformed images. Although they have achieved good results but their work is considered as data augmentation, and it is not a solid solution to the invariant problem of the CNN.
Cohen & Welling, [7] Gens & Domingos [11] and Mallat [20], analyzed the behavior of the linear representations in relation to symmetry groups, resulting in feature maps that are more invariant to these symmetry groups. Cohen & Welling [7], have revealed that the entire class of such models can be understood mathematically. Although, they have proven their concept mathematically, but their approach still suffers asymmetric world that we live in as described in their own words, "Our approach should also deal better with (approximately) symmetric objects, for which it is not possible to unambiguously estimate pose and motion (what is the pose of a circle?).". Also, their current model is not suitable for dealing with large images and they consider it as a proof of a concept.
Jaderberg, et al., [14] Hinton [13] and Tieleman [30], have introduced a self-contained module for neural networks. Jaderberg, et al., [14] performed spatial transformations of features by using localization network, parametrized sampling grid, and spatial transformer networks. As in [21] they have used data augmentation which, as it has previously mentioned, is not a robust solution for the invariant problem.
Hinton, et al. [13] and Hinton, et al., [26] proposed a novel CNN architecture which is built up of capsules. These capsules contain group of neurons that are responsible of the instantiation parameters of an entity such as pose velocity and albedo; these capsules will then represent information in a hierarchal from.
The basic theory of their work is that every entity is made up of several smaller entities, so each capsule will try to predict the output of the higher layer capsules, and the capsules which have a greater agreement with the higher layer will be coupled to the parent even more through a positive feedback loop.
Although this work is impressive but it has some shortcomings. The authors have not stated how the weights "W" are learned. Also, the algorithm produces an additional hyper parameter "r" which means more computational complexity. Although the algorithm has achieved the state-ofthe-art accuracy on MNIST dataset but it fails to preform so well in CIFAR-10 dataset.
Cheng, et al., [5] Girshick, et al. [12] and Zhang, et al., [34] proposed a method to make CNN rotation invariant. Cheng, et al., [5] added a rotation-invariant layer and Fisher discriminative layer to the CNN in order to make it rotation invariant. These layers will try to learn the objects rotations based on the class, so it can predict the rotation of an object when it recognizes it. They have implemented their algorithm to some famous CNN like VGG and AlexNet, and achieved high accuracy but their work is only directed to rotation invariant, but they did not solve translation or scaling invariant problem in CNN.
Laptev, et al., [15] Su, et al. [28] and Wu, et al., [33] proposed a framework to combine a previous knowledge on nuisance variations with data when training the network. Laptev, et al. [15], formulated a set of transformations and generated multiple images based on these transformations. Then these transformed images are passed through initial layers of the network, and through TI-POOLING operator to from transformation-invariant features. Although they have achieved transformation invariance by pooling transformed features maps, but it added huge computational complexity to the network because of the forward and backward passes for each element.
Worrall, et al., [32] Vedaldi [16] and Memisevic & Hinton, [22] presented a CNN which is equivariant to patch-wise shifting and continuous 360° rotation. Worrall, et al., [32] reconstructed the regular CNN filters by using derivations from complex harmonics, returning a maximal response and orientation for every receptive field patch. Using these derived filters CNN can be invariant to translation and rotation but not scaling. Also, their work has a disadvantage of the higher perfilter computational cost as they must derive and reconstruct all the filters in the CNN.
Up to our knowledge, most of the researches that were studied in the literature review solved the invariant problem of the CNN partially or used data augmentation. In this work, we proposed a general approach to solve the problem with no data augmentation.
131 | P a g e www.ijacsa.thesai.org III. CONVOLUTIONAL NEURAL NETWORK Convolutional Neural Network (CNN) [17] is a major in deep learning which is mostly used in image classification and image recognition tasks due to its convolutional architecture.
Generally, CNN consists of the following phases:

Phase 1: Feature extraction
In this phase, number of filters or kernels will be used to scan the input image, in order to extract features from that image, for example, vertical edges, horizontal edges, corners, etc.

Phase 2: Non-linearity activation
After scanning the filters on the input image, each filter will produce an image which contains the extracted features. The output image must go through a mathematical function which is called Activation Function. In this work, the activation function that will be used is ReLU, which stands for Rectified Linear Unit, which simply converts all the negative values to 0 and keeps the positive values the same as shown in equation (1).

Phase 3: Pooling
Similar to the Convolutional Layer, the Pooling layer is responsible for reducing the spatial size of the Convolved Feature. This is to decrease the computational power required to process the data through dimensionality reduction.

Phase 4: Dropout
Dropout is used to reduce the CNN overfitting by randomly turning off neurons, so the CNN can take different paths in the training phase. An n-layer fully-connected neural network (ignoring bias) can be defined as:

Phase 5: Input vector extraction
In this phase, CNN converts the 2D matrix to 1D vector, so it can be fed into the neural network.
Phase 6: Network training using the fully connected Neural Network.
Fully connected layer is a neural network which is used to provide the final classification of an image based on matrix mutilation operations, weights and biases. The input of this phase is the flatten vector which was extracted in the previous phase and the output is the predicted classification. CNN can have several fully connected layers where the output of each layer is the input of the next fully connected layer. The objective of a fully connected layer is to take the results of the convolution/pooling process and use them to classify the image into a label. The output of convolution/pooling is flattened into a single vector of values; each of which represents a probability that a certain feature belongs to a label. For example, if the image is of a cat, features representing things like whiskers or fur should have high probabilities for the label "cat". Fig. 1 shows an example of how flatten network is fed to the fully connected layer.

IV. PROPOSED APPROACH
The main objective of our work is to make CNN invariant to image transformations, in order to achieve higher accuracy in image classification by using Hu's moments of images [23]. The Hu's moments of an image are weighted averages of the image's intensities of the pixels, which produce statistics about the image, and these moments are invariant to image transformations [18], [36]. This means that, even if some changes were made to the image, it will always produce almost the same moments values. Therefore, the Hu's moments of the image can be fed to the fully connected neural network in order to enhance CNN regarding invariant to image transformations limitation.
The invariant features can be achieved using central moments, which are defined as follows [23], [36]: Where p,q = 0,1,2,…. , ̅ = 10 00 � = 01 00 The pixel point ( ̅ , �) are the centroid of the image f (x, y). The centroid moments μ pq computed using the centroid of the image f (x, y) is equivalent to the m pq whose center has been shifted to centroid of the image. Therefore, the central moments are invariant to image translations. Scale invariance can be obtained by normalization [36].
The normalized central moments are defined as follows: Based on normalized central moments, 2,3 introduced seven moment invariants: 132 | P a g e www.ijacsa.thesai.org The adopted research methodology comprises the following steps, as shown in Fig. 2 Hu's moments concatenation should make the flatten vector more informative and expository. The new vector will be fed to the fully connected network; therefore, CNN will be trained to see the Hu's moments, extent and solidity values alongside with the features vector, these values will affect the neurons' activations in the network in order to achieve transformation invariant. Fig. 3 shows an example of Hu's moments concatenation.

V. EXPERIMENTAL RESULTS
We have implemented the proposed approach using Python TensorFlow platform powered by Google Colab notebook. We have tested our approach on three datasets which were MNIST handwritten digits dataset, MNIST fashion dataset and CIFAR10 dataset. Finally, we have compared our results with the work of [18] which uses Zernike moments (ZM) as an initial filter to extract invariant features of the images, as motioned above, by implementing their approach on the three datasets. ZM are projections of an image on to the complex Zernike polynomials that are orthogonal over the unit circle. So, a radius must be provided in order to calculate the ZM of the image. Therefore, we have used the degrees 45° and 90° to extract ZMs of the images. Our approach has archived better loss, accuracy, precision, recall and F1 score compared with the work of [18] on the three dataset MNIST hand written digits, MNIST fashion dataset and CIFAR 10 dataset. The use of Zenick moments as initial filters led to feature loss which led to a decrease in loss, accuracy, precision, recall and F1 score. On the other hand, adding Hu's moments to the flattening vector led to discriminative and more informative vector therefore a better performance. Table I and Table II shows the results of our approach compared to the results of [18] approach on MNIST handwritten digits dataset. Fig. 4, Fig. 5 and Fig. 6 illustrate the loss, accuracy precision, recall and F1 Score respectively and they show that our approach achieved better performance than [18] approach.     Table III and Table IV and Fig. 7, 8 and 9 below show the results of our approach implemented on MNIST fashion dataset and we have compared our work with [18] approach and we achieved better results compared to their work.   Finally, we have tested our approach and [18] approach on CIFAR10 dataset and we have achieved better performance than their approach as shown below in Table V and Table VI and

VI. DISCUSSION
CNN suffers from the problem of being invariant to image transformations. Up to our knowledge most previous researches solved this problem partially or they used data augmentation. Our approach uses Hu's moments to make the flatten vector more descriptive so when it is fed to the fully connected layer it should lead to a better classification regardless to the image transformations. Concatenating the moments with the flatten vector was challenging, since the vector size will increase and it should be the same as the input size of the fully connected layer.

VII. CONCLUSION
This paper presents an approach to enhance CNN regarding the invariant problem by using Hu's moments. The mechanism behind this approach is by concatenating Hu's moments with the flattening vector before feeding it to the fully connected layer in order to make the vector more discriminative and more informative. In this study we have implemented our approach then we have compared out work with the work of Mahesh et al., on the three dataset MNIST hand written digits, MNIST fashion dataset and CIFAR 10 dataset. The results show that our method gave best results in all cases namely loss, accuracy, precision, recall and F1 score. The main limitation of our work is the fixed sizes of the flatten vector that means the size of the vector should precalculated and predefined so it be the same as the size of the input size of the fully connected layer.