Robust Convolutional Neural Networks for Image Recognition

Recently image recognition becomes vital task using several methods. One of the most interesting used methods is using Convolutional Neural Network (CNN). It is widely used for this purpose. However, since there are some tasks that have small features that are considered an essential part of a task, then classification using CNN is not efficient because most of those features diminish before reaching the final stage of classification. In this work, analyzing and exploring essential parameters that can influence model performance. Furthermore different elegant prior contemporary models are recruited to introduce new leveraging model. Finally, a new CNN architecture is proposed which achieves state-of-the-art classification results on the different challenge benchmarks. The experimented are conducted on MNIST, CIFAR-10, and CIFAR-100 datasets. Experimental results showed that the results outperform and achieve superior results comparing to the most contemporary approaches. Keywords—Convolutional Neural Network; Image recognition; Multiscale input images


I. INTRODUCTION
Convolutional Neural Network (CNN) has been widely used in many real world applications, including face recognition [1,2], image classification and recognition [3][4][5][6] and object detection [7] because it is one of the most efficient methods for extracting critical features for non-trivial tasks.CNN consists of a pipeline of alternative several different layers.Unlike neural network, CNN has three different types of layers which are considered a constituent element of CNN.Usually, Convolutional layer, subsampling layers, and fully connected layer are the main components of CNN.Also, there are some intermediate layers between those main layers that will be shown later.Then for a given task, images are passed into CNN to be processed.Passing images through several squish functions incorporated within CNN layers can lead to not leveraging some critical information used for recognition and some of the small features disappear after few layers.The reason for that is because the CNN architecture that implies like those restrictions.Specifically, both convolutional layers and max-pooling layers impose diminishing small features.To implement a robust model, small features must survive for long stages of CNN.To alleviate weaknesses inherited from former CNN models, in this work, different parameters that can influence features surviving for longer distance are explored.Deeper analysis for convolutional and max-pooling layers are presented, and then we introduce a model that has more chance for small features to survive until the final stage of CNN; specifically directly before fully connected layer.
The rest of the paper will be into five sections.In section II, prior works are presented.The most recent contemporary works are obtained.Then in section III, motivation and contribution of this work are introduced.The answer for questions, what have proposed and why it is proposed are presented in this part.Then in section IV, deploying different CNN architectures are presented.Different CNN structures are obtained in this section.Finally, experimental setup and conclusion are presented.

II. RELATED WORK
The most dominant recent works achieved using CNN is a challenge work introduced by Alex Krizhevsky et al. [8] used CNN for challenge classification ImageNet.Various other techniques are proposed later to enhance CNN performance as demonstrated in [9,10,11,12].Recently vast works have been proposed to improve image recognition accuracy results using different methods.Thus several proposed methods are proposed for variety of applications such as image recognition [13,14,15,16,8], object detection [17,18,19], scene labeling [20], segmentation [21,22], and variety of other tasks [23,24,25].In addition, image recognition can be accomplished using different other approaches such as Pedro F. Felzenszwalb et al. [26] proposed a method for image recognition using Deformable Part Models (DPM).Further works are devoted using different strategies of using DPM as demonstrated in [27,28,29].Varity of other methods are used for image classification such as SVM [30,31,32,33], boosting [34], spatial pyramid matching [35] and different other works described in [36][37][38][39].

III. MOTIVATION AND CONTRIBUTION
The state-of-the-art of image recognition specifically achieved on CIFAR-10, CIFAR-100, and MNIST is achieved using different technique as proposed in [40,41,42].This work has some common procedures with prior works which can be described as follows:  The first step is applying the pre-processing to the input images such as local contrast normalization.However, there are several factors that can influence and impact model performance leading to degrade model accuracy.Thus, in this work NIN will be recruited after diminishing its shortcomings.Weaknesses of general CNN used for image classification are various such as CNN's depth, width of the network, filter sizes, and network topology.All these are vital factors that can highly impact recognition accuracy.Consequently to diminish lethargy inherited from CNN architecture; this work endeavors to alleviate shortcomings of former networks by eliminating most limitations described earlier.Therefore in this work the most recent and very efficient methods are ensemble to be used for not trivial object recognition tasks.Variety of techniques is delved to enhance image recognition.Starting from leveraging models proposed in [3,17] both models have several deterministically advantages over prior models as elucidating later.Both concrete models are adapted in this work for image recognition.In addition, extensive work is deliberated for exploring the impact of different parameters that can drastically influence model performance.Virtuous model is mainly instantiated to overcome drawbacks of prior deep neural network architecture used for image recognition.Finally a robust paradigm of CNN architecture is proposed at the end of this work.It achieves superior results comparing with all existing models.
Elegant CNN architectures are adapted to be used for image recognition are originally proposed for image classification [3] and object detection [17].They are considered the robust deep neural networks models.It is worth mentioning that SPPnet proposed in [17] recruited in this work to provide multi-scale input to the image recognition model.Consequently, to best of our knowledge that image classification such as CIFAR-10, CIFAR0-100, and MNIST are trained with this like method.Providing multi-resolution input images to CNN enhances CNN accuracy drastically as it will be shown later.Furthermore, digging deeper for investigating and exploring most influential parameters is also devoted.Carefully exploring influential parameters can be best suited for mole recognition.Different model architectures are extensively analyze and investigated..After obtaining best suited parameters, a robust model is proposed to enhance recognition performance.Proposed CNN architecture achieves best results and outperforms over most existing models.The proposed model is compared to the prior efficient works specifically compared to the prior deep neural network models.In addition, the experiments are conducted on different benchmarks for evaluation purpose.The experiments are mainly conducted on CIFAR-10, CIFAR-100, and MNIST datasets.

IV. DEPLOY DIFFERENT CNN ARCHITECTURES FOR IMAGE RECOGNITION
As illustrated earlier, this work principally is recruited two different deep neural network models named NIN and SPPnet explored in [3] and [17] respectively and implemented new unified model.Next sections start exploring in depth the influence and leveraging of incorporating both models on network architectures and how they can influence classification performance.Then the unified proposed model is an elegant model because it shortens some weaknesses inherited from former models.Thus exploring both architectures is accomplished next sections to show model's robustness on image classification.

A. Pipeline Steps of image classification
The basis CNN architecture is depicted in fig. 1.It fundamentally consists of series of stages.Part (a) presents images with multiscale to the network.Providing multiresolution input is an essential step to gain higher accuracy.Part (b) trains the network with fed images.After choosing different scales for input images, they will feed to the CNN to extract features from different resolutions which increase the chance for small features to be enlarged using this technique.It is worth mentioning that using multi-scale input images is a method showed in [17] to increase object detection accuracy.However, we utilize it to be recruited in image recognition task.Then, finally part (c) classifies and scores input pattern.To look deeper for operations accomplished by CNN, the following steps are applied: 1) Input images are pre-processed using Goodfellow et al. [6] to be prepared for the next step.
2) After pre-processing, input images are fed to CNN.In this work a new architecture is proposed as shown in fig. 1.In addition, a robust and an efficient code are used for this purpose called Caffe [43].It is very fast implementation which can process huge amount of data efficiently.In addition, it is very flexible to be easily adapted to different CNN architecture.The final layer of CNN has n-dimensional feature vector which is used for final classification results, where n is the number of classes for a given dataset.
3) Soft max layer is used for final scoring output.However, the length of final feature vector is anticipated to be to match the number of classes.www.ijacsa.thesai.orgIt is worth mention that a dropout technique demonstrated in [12] is used in this work also to increase model performance by enhancing internal parameters and introducing more solid model.The accuracy achieved using CNN depicted in fig. 1 is 0.9953, 0.83, and 0.528 on MNIST, CIFAR-10, and CIFAR-100 respectively.It is obvious that this model achieves competitive results to the most recent works.Next section provides deeper analysis and investigation for exploring and proposing more robust model.

B. Exploring different CNN architectures
It is obvious that the proposed network in fig. 1 achieves competitive results comparing to prior works.In addition, it accomplishes results which outperform accomplished work in [44] specifically it dominants over deep neural network approaches.Moreover, it achieves competitive results to many other approaches.The stimulating results are supportive to dig deeper and to investigate influential parameters and explore more robust model.In this part, recruited models will be used for further investigation and more effort will be put to explore more appropriate architecture for image classification.Leveraging CNN architecture is proposed in this section used for image recognition.It achieves state-of-the-art results on given benchmarks.Consequently, more parameters that can influence model performance are discussed next.This work proposes a new topology for CNN architecture.Fig. 2 depicts the proposed model and it has drastically changes comparing with one implemented and explored in fig. 1.The proposed model inherits some leverage points from NIN.Instead of using conventional connection between convolutional layers as describe in [12,9,10,11].the robust connection proposed in NIN is incorporated in this work to increase and gain more accuracy on image classification.The size of CNN is kept the same as depicted in fig. 1.The merit of this CNN architecture combines more than one elegant method such as multi-scale input images and nonlinear transformation between convolutional layers as demonstrated in [3] as shown in fig. 2.
To look deeper inside CNN and investigate the most critical parameters that can influence model performance.Fig. 3 shows both convolutional and sub-sampling layers of CNN.It is clear the subsequent of alternative between these kinds of layers; it quickly diminishes the input images after few stages of CNN leading to losing vital information useful for final stage of classification.Specifically this work is dealing with small image sizes as will be obtained later.All the benchmarks used in this work have image sizes of 32x32 pixels.Consequently the small features will be not available after few stages.Therefore an elegant model of CNN architecture is proposed in this work as shown in fig. 4. It is clear that new model propose different connection than standard connection of conventional CNN.Some layers are received their connections not only directly from the layer below but also from two and three layers below.The reason for this kind of connections because small features within the input images can survive longer and will be part of the final scoring detection results.Furthermore, the first layers of CNN extract global features of input objects but as the images advance toward final fully connect layers, more accurate features are extracted.

C. Exploring Different CNN Sizes
In order to precisely analyze the influence of different CNN architectures, a new CNN architecture is proposed and carefully selected their parameter because same CNN architecture might work sufficiently for some tasks and inadequately for other tasks.Hence, in this part different deep model architectures is investigated that can fit for   In order to evaluate the proposed architecture models, extensive experiments are conducted on different challenge datasets.The most popular datasets are used for evaluation.MNIST, CIFAR-10, and CIFAR-100 are the benchmarks used in this work.To obtain the challenge accompanied with those datasets, next parts explain the related details for the datasets such as size number of image samples.It is worth mentioning that data augmentation is not used in these experiments.

A. MNIST dataset
MNIST [18] is a hand written digits 0-9.The dataset consists of 60000 samples.50000 samples are used for training and the rest used for testing.All samples have the same size which is 28x28 pixels.The pixels are scaled to be between [0, 1] before the training.There is no preprocessing or data augmentation used in this work.The first CNN, which is named network1, structure is 192C-192S-256C-256S-192C-192C-200F-128F-10-soft-max, where C stands for Convolution layer, S is for subsampling layer, and F is for full conned layer.In this dataset, the size of mini-batches is 128 images.Test accuracy is 0.9961 % for MNIST dataset.This result is superior comparing with results [44].A summary of the best published results on MNIST dataset is shown in Table II.Network2 which has a structured described as 192C-192S-256C-256S-384C-256C-192C-192S-400F-128F-10-soft-max achieves lower results than the prior model because MNIST might not require large network.The result achieved using network2 is 0.9958 on MNIST dataset.Comparing with other results, Table II shows the final results on MNIST.From Table II, it is obvious that both introduced morels achiever better results than what have been accomplished by Hayder et al. in [44] using their method hybrid training algorithm called Hybrid PSO-SGD which represent training algorithm using Particle Swarm Optimization and Scholastic Gradient Descent.

B. CIFAR-10 Dataset
CIFAR-10 dataset consists of 10 classes of natural 32x32 RGB images with 50,000 samples for training and 10,000 samples for testing [19].The same structure of network1 is used first for evaluation.The same steps are followed as in MNIST for CNN training.The performance achieved on this dataset is 86.73%.
On the other hand, the test accuracy on CIFAR-10 using network2 is 88.13% which is higher than network1 because CIFAR-10 is more challenge dataset than MNIST.Thus it requires more complicated structure.From table III, it is evident that the proposed method surpasses the other state-ofthe-art works.C. CIFAR-100 CIFAR-100 is one of the most challenge dataset and it has 100 classes.Images are similar to CIFAR-10 even with size.However, the main difference is that number of image samples per class are very few comparing with CIFAR-10.The total number of images is 50,000 training examples.Thus each class has 500 samples only.Testing samples has 10,000 samples.Like CIFAR-10, the pixels are scaled to be between [0, 1] before the training.Since CIFAR-100 is similar to CIFAR-10 are similar, the same setting of CNN was used for both networks.
Table IV shows the final results achieved using the proposed two models.The first network achieves 53.52% test accuracy on CIFAR-100 while network2 achieves higher accuracy which is 59.85%.

Method
Reference # Accuracy CONV.NET + PROBOUT [45] 61.86% Baseline + learned tree [46] 63.15% NOMP encoder [47] 60.8% Stochastic Pooling [23] 57.49%NIN [3] 64.32% Smooth Pooling Regions [48] 56.29%Beyond Spatial Pyramids [49] 54.23% Maxout Networks [5] 61 In feature work, more effort will be devoted in exploring more powerful network to handle more challenge tasks.More enhancements can be achieved by utilizing more technique to be recruited together and implemented the final model.Future works could also include more details such as reporting time consumption for each method and whether it is suitable for real-time applications or not.Also, those implemented models can be re-adapted to be used in object detection tasks.

1 ,
ReLU 256x1x1,str:1, ReLU 384x1x1,str:1, ReLU 256x1x1, str:1, ReLU 192x3x3, str:1, ReLU 3x2 3x2 Spp layer image recognition.Accordingly, CNN architectures are explored to be best suited for image classification.There are two model architectures are used in our experiments.They are shown in table I.In addition to the structure obtained in table 1, each network has more additional two fully connected layers build on the top of the final max-pooling layer.Then finally, soft-max layer is built on the top of final fully connect layer used for final scoring results.It is clear that there are two CNN architectures detailed in table1 called Network1 and Network2.It is obvious that network1 is smaller than the network2.Where, network1 consists of three convolutional layer and three max-pooling layers.

TABLE I .
TWO CNN ARCHITECTURES.THE ABBREVIATION CON REFERS TO CONVOLUTION.XXYXY: X REPRESENTS NUMBER OF FEATURE MAPS AND Y IS THE KERNEL SIZE.LRN AND RELU ARE ABBREVIATION FOR LOCAL RESPONSE NORMALIZATION AND RECTIFIED LINEAR UNIT RESPECTIVELY

TABLE IV .
TEST SET ACCURACY RATES ON CIFAR-100 DATASET In this work, image recognition using the deep neural network is introduced.Different model architectures are proposed by incorporating different prior elegant CNNs.Specifically both NIN and SPPnet are incorporated in a single unified model that achieves superior results comparing to former results.Then a new model is presented and outperforms prior work and accomplishes state-of-the-art results on the datasets.Also, different model architectures are introduced, and extensive parameters are discussed that can influence model performance.Deeper exploring different parameters that can be suited for CNN recognition model are presented as well.For evaluation, the experiments are conducted on challenge datasets.MNIST, CIFAR-10, and CIFAR-100 are the datasets used in this work.www.ijacsa.thesai.orgFUTURE WORK