A Survey on Computer Vision Architectures for Large Scale Image Classification using Deep Learning

The advancement in deep learning is increasing day-by-day from image classification to language understanding tasks. In particular, the convolution neural networks are revived and shown their performance in multiple fields such as natural language understanding, signal processing, and computer vision. The property of translational invariance for convolutions has made a huge advantage in the field of computer vision to extract feature invariances appropriately. When these convolutions trained using back-propagation tend to prove their results ability to outperform existing machine vision techniques by overcoming the various hand-engineered machine vision models. Hence, a clear understanding of current deep learning methods is crucial. These convolution neural networks have proven to show their performance by attaining state-of-the-art performance in computer vision over years when applied on humongous data. Hence in this survey, we detail a set of state-of-the-art models in image classification evolved from the birth of convolutions to present ongoing research. Each state-of-the-art model evolved in the successive year is illustrated with architecture schema, implementation details, parametric tuning and their performance. It is observed that the neural architecture construction i.e. a supervised approach for an image classification problem is evolved as data construction with cautious augmentations i.e., a self-supervised approach. A detailed evolution from neural architecture construction to augmentation construction is illustrated by provided appropriate suggestions to improve the performance. Additionally, the implementation details and the appropriate source for the execution and reproducibility of results are tabulated. Keywords—Image classification; deep learning; computer vision survey; convolution neural networks; IMAGENET dataset


II. CONTRIBUTION
The contributions of this survey to the present existing literature are described as, 1) Firstly, a prerequisite introduction to convnets is provided and the successive advancements and the individual parameters involved in architecture are detailed.
2) The evolution of the convnets from its beginning is explained and a sequential state-of-the-art advancement in image classification utilizing the convnets are elaborated in detail. 3) Finally, a set of recommendations are provided to enhance the neural architectures to obtain successive state-of-the-art performance and pave a path to future advancements.

III. ORGANIZATION OF THE SURVEY
The organization of this survey is described in three phases. Further, Fig. 1 describes the complete flow of this survey.
1) The first phase gives a complete description of the convolution neural networks i.e. specifically describing the components involved in convolutions and their visual illustrations are provided equivalently. This section provides a clear intuition of the working of convnets with a glimpse of the terminology used. Finally, the advantages and disadvantages are equally provided to understand where convnets can perform best and fail.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 10, 2021 2) In the second phase, a clear understanding of the state-of-the-art networks is provided. Each architecture is described in detail by detailing the method implied and hyperparameters tuned for variant settings. This gives insights to the reader to understand the flow and the evolution of convnets and its developing aspects in the current research. 3) 3. The final phase provides suggestions to construct a novel architecture to provide a good transferability of features with low computational expense by considering various factors.

IV. THE CONVOLUTION NEURAL NETWORKS
First, it is aimed to discuss the mathematical intuition of convolution neural networks and next, the first implementation of convnets is described. Next, a set of components involved in the construction of convolution architecture are described accordingly. Subsequently, a set of properties for convnets are detailed. Finally, the advantages and the disadvantages carried by convolution neural networks are specifically mentioned [61,62].

A. The Initiation of Convnets
The convnets are inspired by the convolution theorem. Convolution is a combinatory operating between two functions where their arguments are real.

Conv(y) ← f (x).g(y − x)dx
The equation above, Conv(.) is a convolution operation. This convolution operation is mentioned typically 1 as, Conv(y) ← (f * g)(y) The first function f(.), denotes a probability density function which is referred to as input. The second argument, g(.) is 1 The * mentioned denotes the convolution operation. referred to as kernel . Hence, these mathematical implications are helped in building the first convolution neural network.
The first convolution neural network was observed in the literature by LeCun. Y et al. [63]. The main object of this research was to implement a convnets to recognize handwritten postal zip codes. To train the model, backpropagation was implied and then successively able to extract variant features. The complete architecture has 1 input layer, two convolution layers and two fully connected layers. This first work helped revolutionize the convnets to a greater extent.
Subsequently, work by LeCun. Y et al. [64] implemented multi-layered NN by training the model end-to-end using backpropagation. This helped to learn and implement gradientbased optimization. In addition to the previous work, this work implemented a graph transformer network for language understanding which utilizes convnets by training with global techniques. The convolution architecture proposed is known as LeNet-5 which had 4 convolution layers and 3 fully connected layers. The final fully connected layer i.e. final activations are Gaussian connections. This initial conceptualization of convnets produced rigorous outcomes after the evolution of large computational devices to obtain state-of-the-art performance every year in large scale visual recognition challenge ILSVRC-12.

B. Components in Convnets
There are a set of components involved in convnets and this help understand the terminology regarding the convnets. A visual illustration of individual components is provided accordingly. a) Kernel: The kernel is described as a grid or a matrix that convolves on the input. b) Stride: The stride is a step taken after each convolution i.e., the number of steps moved by the kernel on the input. c) Feature Map: The feature map is considered as the output activation incurred after completion of the convolution operation.

d) Padding:
The padding is the process of filling the borders of the input equivalently in every dimension i.e., mathematically the input is surrounded by zero eventually increasing the size of the input.
Hence, In Fig. 2 these components involved in convolution are explained in detail. The blue component which is of size 2x2 is input. The grey 3x3 size matrix refers to the kernel. Next, the dotted square grid bordered around the input is called padding. The dark green component which is projected on the top of the input is a feature map obtained. Hence, the convolution operation is carried by moving the kernel onto the input. This kernel applies dot product on the input and a set of values are obtained. Further, these values are aggregated using a sum function. Then, a feature map is obtained accordingly. This process is iterated till the complete input is convolved. In a convnet, kernel size determines the shape of the kernel to perform the convolution operation. The number of kernels determines the number of variant types of kernels with varying values inserted into them. Next, padding is generally used to produce similar dimensional output. In the next section, a set of advantages and disadvantages involved in convolutions are detailed. Further, a detailed explanation is provided by considering varying situations and altering the above-mentioned components in the work [65].

C. Pros and Cons of Convnets
The convnets do have certain abilities which provide higher performance on multi-domain tasks. Even having definite advantages, convolutions carry a set of disadvantages which are discussed in detail. a) Advantages: • Rotational Sensitivity:The convnets cannot extract the features of the entity residing an input which is rotated until and unless the objects in the images are rotationally symmetrical. Hence, to overcome these many techniques are implied such as augmentation. Horizontal flip, vertical flip and angular rotations are provided to an individual image to extract features even having rotational changes.
• Time-Variant signals: The convolutions lack in understanding the signal processed during a variant time pattern to that of a non-linear system. This can lead to the problem is speech specifically problem underlying the acoustic detections. But this problem is not seen in image recognition.

A. Alex-Net
Krizhevsky et al. [66] proposed an end-to-end trainable deep convolutional network for large scale image classification i.e., on IN12. They observed the problem of using ML methods for image classification. They developed an eight layered deep NN which has 5 conv layers and 3 fully connected layers. The kernel size and stride implied are clearly illustrated in figure.3.
Firstly, they used relu [67] as non-linearity to forward the activations from one layer to another where they observed speeding up of convergence when relu is used as non-linearity. Second, they used GPU's for training their network in which, two GPU's are used with parallelized computing and having communication mutually layer to layer. This improved the performance of the model by reducing T-1 error and T-5 error by 0.017 and 0.012 respectively. Next, for normalization a technique (which is a similar normalization technique to that of [68]), named local response normalisation, is operated for conv layers which are tuned while validation procedure. This improved the performance of the model by reducing the T-1 error and T-5 error by 0.014 and 0.012 respectively. Next, the overlapping pooling technique is utilized which pools the pixels which are not only adjacent but also which are overlapping with correspondence. It is achieved by reducing the step during convolution. This reduced error-rate of T-1 and T-5 activations by 0.004 and 0.003 respectively.
The constructed architecture has consumed 60 M parameters which are mentioned in Fig. 3. To have good generalization a sequence of tasks was done to reduce the problem of overfitting in the networks. Firstly, data augmentation is done. In this step, the samples regarding an image are increased either translating the image in horizontal (or vertical) directions or the pixels of the images are changed in terms of colour intensities. This is done by considering the principal components of images and adding weight to pixels accordingly. This procedure led to an increment of T-1 accuracy by 1%. Secondly, the fully connected layers are attached with two dropout layers [69] (for the last two layers excluding class activations) with a drop ratio of 50% i.e. 50% of the neurons are inactive during the training and the network tend to learn during validation.
The complete model was trained on 90 epochs. It is optimized using SGD with an initial learning rate of 10 −2 and 9 × 10 −1 as momentum through 128 batches (batches considered per iteration). When no convergence invalidation was observed in the learning, the rate was decreased 10 times to that of initial learning. The model achieved a T-1 error rate of 37.5% and a T-5 error rate of 17%. Further, the model was altered in various types and different accuracy score are obtained. These details are tabulated in the Table I.

B. Ze-Net
Matthew D. Zeiler et al. [70] proposed a Conv NN which is very similar to Alex-Net by visualizing the feature maps and kernels for better understanding internal computations of the convnets. They developed an eight layered deep NN which has 5 conv layers and 3 fully connected layers. The kernel size and stride implied are clearly illustrated in figure.4.
The architecture of the model is designed with a decoder and an encoder which extracts latent representations and reconstructs the image respectively. Several conv layers are used to extract the spatial features with ReLu as activation throughout the network. The decoder helps in unspooling the visual representations by a switch variable. This switch variable is used for memorizing the pooled information in the encoder structure and mapping it on the decoder structure. Further, to observe feature extraction, translation scaling and rotation mechanisms are performed in which the convnets were invariant to translation and scaling but not for the rotation. Finally, to observe the localization ability of convnets a certain part of the image consisting of important feature is occluded. It is observed that convnets significantly degraded in terms of performance due to occlusion.
The model implied is very similar to that of AlexNet with two variations, the filter size is reduced in the first layer from 11x11 to 7x7 and stride 4 of convolution is reduced to stride 2. Augmentation is performed by subtracting the input with individual pixel mean and used 10 variant sub crops techniques such as horizontal flip, vertical flip etc. The learning rate with which the SGD optimizer was initialized as 0.01. A momentum of 0.9 was implied for faster training. The bias components were initialized with zero and 50% of the densely connected layers are dropped during the training process. The model acquired T-1 and T-5 error rates of 36% and 14.7% respectively. Further, the network pre-trained on ImageNet is implied on Caltech-101 dataset with an accuracy score of 83.8 for 15 images per class whereas increasing 30 images per class it obtained an accuracy score of 86.5%.

C. OverFeat
This work was inspired by the standard concepts that injected good improvement in the field of classification [71][72][73][74]. Sermanet et al. [75] proposed a framework implying CNN's not only for classification but also for detection and localization. The novel localization criterion in this work is obtained by capturing and aggregated to multiple object boundaries. When the localization task is performed on ImageNet the best performing OverFeat model secured the first position in the 2013 challenge.
The main objective of the OverFeat is to perform classification by simultaneously locating and detecting the objects with the use of a single conv architecture. A novel method is implied to detect and localize the bounding boxes of the image which is predicted by the neural architecture. With a combination of various localization predictions, the process of detection acquires good features and hence the performance is increased and eventually training time can be reduced. This method not only helps to provide less computation but also with greater performance acquiring higher accuracy scores. The complete OverFeat model has three ideologies and the methodology is implemented accordingly, 1) Initially, a conv net is applied at variant locations captured in the specified image. Sequentially, a sliding window approach is implied using different scales. This eventually helped to provide a better classification model but, the localization performance was degraded. 2) The system was not only trained to produce distribution for the set of categories but also improved localization by properly constructing the size of the bounding box to capture the region of interest for that specified category. 3) Lastly, a proof of concept was provided for a specific category at individual locations.
The implementation works by training a conv net by using a sliding window as the decision box by choosing the centre pixel and classifying it accordingly to a definite object. The advantages of this method are the bounding contours utilized for localization need not be rectangular. The disadvantage of the model is that it acquires numerous pixel-level labels which in turn increases computations cost. This work was the first implementation of localization, and the detection task for ImageNet by using a unified framework. The localization and detection task performed by overfeat is done by allowing the model to guess the labels for the specified object five times and if the probability of the guess turns to be 0.5 and above (matching the ground truth label) then, a definite label for the object is assigned to definite class accordingly. The five times guess the pattern is chosen to specify the correct object in the presence of multiple objects without labels.
During the construction of the OverFeat, a set of hyperparameters are tuned and are mentioned individually. The optimizer implied in this method is SGD with an initial learning rate of 5 × 10 −2 . A momentum was used to faster the training procedure (an initial momentum of 0.6 was implied). Weight decay for the L2 regularization is initialized as 10 −5 . ReLU is used as an activation function at the utmost every layer. The initial five layers of the model implied are very similar to AlexNet with ReLU activations and successive pooling layers (max-pooling layers). But with many similarities, certain differences are to be noted and they are mentioned. No local response normalization is utilized in this work as it did not improve performance. The pooling layers implemented do not overlap as they depicted better performance. Further, implying small stride in the first two layers, better invariances was obtained i.e., large stride speedups the training process but performance in terms of accuracy can be degraded.

D. VGGNets
Simonyan et al. [76] worked on deep neural networks with different depth of layer's in them to know the changing rate of accuracy concerning the depth of the neural network. The depth of the neural networks proposed in this paper varied from 11 to 19 layers. Six different types of networks were used by the author to know how the models perform based on different configurations in it. The kernel size or the receptive field is set to the size of 3X3 rather than 5X5 or 7X7 because a smaller receptive field help in capturing the details of the image in a more specific way and use fewer parameters. The six types of networks built in this paper have been given the following names A, A-LRN, B, C, D and E. These networks differ by the depth of layers. An A-LRN is the two networks with a depth of 11 layers, the only difference is that in A-LRN, Local Response Normalization (LRN) is used to check how the accuracy is varied when LRN is used in a network. It was observed that adding LRN to the network was not much of use to improve the accuracy score. B has a depth of 13 layers, C is just an extension of B where there are 3 extra 1X1 convolutional layers in it. D and E have 16 and 19 layers of depth in their network configurations respectively.
In the aforementioned networks, a max-pooling layer is present after a few convolutional layers or a block of these layers. Inside each block, there is a combination of 3X3 and 1X1 convolutional layers accordingly. The input image is of size 224X224 pixels, which is downsampled by the convolution and max-pooling layers next the extracted features are passed into the dense layer for the classification or detection task of the image. These architectures used Stochastic Gradient Descent (SGD) with 0.9 momentum and has a batch size of 256. Drop out was also used in two fully connected layers followed by a dense layer and a softmax layer to predict the class of the image. The learning rate was set to 10-2 and this was decreased by a factor of 10 if the accuracy got saturated at a point. Training of these networks was completed after 74 epochs. During the training time, Lr was decreased by a factor of 10 for 3 times in total. First, the networks (A, A-LRN, B) were trained on a single scale of 256 and the remaining networks (C, D, E) were trained using multiple scaled images (scale jittering) with the scale ranging from 256 to 512. It was observed that the performance of these networks improved significantly with the use of scale jittering and by increasing the depth of the network, E convnet got a top-5 Val-error of 8% which is a competitive score.
To further assess the capabilities of the network, the VGG team used scale jittering even more aggressively on the traintest set this time and saw that convnets D and E got a top-5 Val-error of 7.5%. Multiple crops were also used in the next experiment and it was compared with the dense evaluation method. From this experiment, it is concluded that the multiple crop method outperforms the dense method. The testing method shown by the VGG team was very different from the previously mentioned works, where the last FC layer was converted into a convolutional layer and this receptive field was put on a whole image and then obtained a single vector with the individual class score. The vector was pushed into the softmax layer to get the prediction score.
An ensemble of all there convnets was made and it was seen that the seven networks ensemble model has a test error of 7.3% and the ensemble of D and E convnets had a test error of 6.8%. The 2 convnet ensemble networks secured second place in the ILSVRCV-2014 challenge. But the margin between the scores was very close when compared to Google Net (first place). The single net performance of VGG architecture outperforms all the other architectures (even Google Net) with a large margin of 0.9%.

E. Google-Net and InceptionV2
Szegedy et al. [77] presented a deep learning model which has an inception module in it. Google-Net mainly focuses was to develop a deep neural network architecture with a less computational expense. As the network goes deeper the arithmetic operations performed by the models also increases and this gives scope for newer error that occurs with computing gradients. Because of the previously mentioned reasons the author suggests creating a sparse network rather than a fully connected network. The goal is very simple all we have to do is find optimal weights through a sparse network that could approximate or predict an image. Translation invariances added in this work by building a network through several convolution layers. and 5x5. Next, pooling is done to the input image and these activations are concatenated using correlation statistics instead of stacking up the layers can increase computational expense.
With this understanding, a new inception module is created, and the dimensions are reduced by bottleneck convolutions i.e., 1x1 convolution kernel to the input image and it is observed that lower-dimensional space preserve the information of the corresponding image. These dimensionality reduced inception modules are now stacked on each other by applying max pooling layer of stride 2 in between the modules occasionally.
The proposed Google-Net architecture consists of 22 layers in total. An ensemble of 7 such models was created and tested on IL-14 for classification as well as detection. Fig. 6 shows the architecture of GoogleNet. In between this network for few inception modules, a classifier was assigned to them. This has helped to generalize images more precisely. These classifiers contain a 1x1 convolutional filter with 128 filters in it. Next, the convolutional layer is stacked with a fully connected layer with 1024 neurons in it. Followed by a dropout layer and a SoftMax layer to provide the probabilities of each class and then predict the image class. In this network, every layer uses the ReLU non-linearity function for the activation of each neuron in the network.
Seven distinct types of networks were built based on the new inception module to train them on the ImageNet dataset with different learning rates and sampling methodologies. The probabilities of all these networks were averaged to get the output. With this ensemble method, Google Net got a top-5 error rate of 6.67% on testing and validating set. An ensemble of 6 models was used in the ILSVRC 2014 detection challenge which achieved map of 43.9% and secured first place in both the classification and detection challenges. From this work, it can be deduced that the sparse network can be useful in deep neural networks to know the deep representation of the image while using less computational resources. Kindly refer Fig. 7 for detailed understanding of architecture.
A certain problem, covariant shift is observed while training a deep neural network is addressed and solved by implementing the Batch Normalization (BN) procedure. This paradigm was proposed by Sergey Ioffe and Christian Szegedy [78][79][80]. BN procedure was implemented on Inception ensemble with an Image resolution of 224x224 produced a T-1 accuracy score of 79.9% and T-5 accuracy score of 95.1%.

F. InceptionV3
Szegedy et al [81] implied the aforementioned Inception architecture and scaled the convolution layer to provide higher performance by decreasing computational expense. This is the upgraded version of GoogleNet and maintained appropriate convolution by doing defiant regularization throughout the network. The authors illustrated the work by defining a set of principles and scale the conv layer by optimizing techniques. The principles are defined in such a way that they performed experimentation on different datasets by considering much architecture. The principles of the network are • A cautious decrement in representation is preferable, instead of bottleneck layers at the beginning of the network.
• Higher dimensions in the network are easier to process with piling up the activations in a conv network for extracting invariant features.
• Even though pooling provides faster learning, spatial aggregation in the network holds the representational features without any loss in the lower dimensions.
• The width and depth of the network must be optimally selected with a balanced criterion.
Generally, a 5x5 or large conv layers can capture the activations of the previous layers. Reducing the feature map size would decrease the no of parameters, training time and computational cost of the network. The inception module consists of 5x5 conv layers, instead of these, the authors have replaced 5x5 conv layers with two 3x3 conv layers which are shown in Fig. 8 with 28% relative gain. But this method has a problem of loss of expressiveness or using a low filter size (below 3x3) which may produce the best outcome. So, the authors have come up with an idea of using asymmetric convolutions. The concept of asymmetric convolution is any nxn convolution can be replaced by 1xn convolution which is followed by nx1 convolution. As n increases the computation of the model decreases. The 3x3 convolutions in the network are replaced by 1x3 and 3x1 as shown in the figure.9. By using this method, it reduces the computation cost by 33%. The activation maps in the network filters are improved because to get rid of the bottleneck representation. The network consists of 42 layers and has 2.5% more computation than GoogLeNet.
The concept of asymmetric convolution is any nxn convolution can be replaced by 1xn convolution which is followed by nx1 convolution. As n increases the computation of the model decreases. The 3x3 convolutions in the network are replaced by 1x3 and 3x1 as shown in the figure.9. By using this method, it reduces the computation cost by 33%. The activation maps in the network filters are improved because to get rid of the bottleneck representation. The network consists of 42 layers and has 2.5% more computation than GoogLeNet.
The model takes SGD as an optimizer with a batch size of 32 which is trained across 100 epochs by considering a learning rate of 0.045. The model achieves the state-of-theart results with T-1 and a T-5 error rate of 21.2% and 5.6% respectively. By ensembling 4 Inception-v3 models they got a T-1 and T-5 error rate of 17.2% and 3.58% respectively.

G. Inception-v4, Inception-ResNet
Szegedy et al [82]. has extended the idea of Inception-v3 by combining residual connections to it with accelerating training. This model has won the 2015 ILSVRC challenge by acquiring state of the art performance. The authors have  Fig. 10 shows the whole architecture with residual connections inside an Inception network. There are filter expansion layers (1x1 convolutions with no activation) inside the network. Batch-normalizations are omitted on the top of the network and overall the no of inception blocks was added subsequently. While experimentation the authors have found that if the networks have more than 1000 filters, the model has died before the training has started. There is no use in increasing the batch size or lowering the learning rate. It seemed to stabilize the training process by scaling the residuals and then adding to the before layers.
Using RMSProp [83] as an optimizer and learning rate of 0.045 they achieved a T-1 and T-5 error rate of 19.9% and 4.9% respectively on ILSVRC 2012 by considering Inception-ResNet-V2 as the base model. By combining three residual and one Inception-v4 they achieved a T-5 error rate of 3.08% on the ImageNet classification challenge.

H. ResNext
Saining Xie et al [84] has developed a model succeeding the ResNet model which is known as "ResNext". This model is the 1 st runner-up in ILSVRC 2016 competition. This model contains extra dimensionality called cardinality which deals with the depth and width of the network. The ResNext   blocks which are subjected by two rules first, maintaining the same shape in the spatial maps i.e. ensuring width size and a filter size of each block are the same. Second, it maintains the complexity of the network where width is multiplied by 2 when the spatial maps are down sampled. This model takes fewer parameters when compared to existing ResNet's with 4.2x10 n FLPOs.
Each block in the ResNext network has the same number of internal dimensions. ResNext-50 (32x4d) indicates four internal dimensions with 32 paths (cardinality=32). When compared with the Inception-ResNet block ResNext model is designed with less effort in each path and implemented in different forms illustrated in the figure.12. The third form of the network is chosen because it is much faster and has grouped convolutions than the other two models. The grouped convolutions consist of 32 convolutions with input and output of 4 dimensions. The experiments were carried out with increasing the cardinality and width which results in the increase of FLOPs by a factor of 2. By increasing the cardinality, the error is reduced by 1.3% to 20.7% rather than increasing the width of the network. The ResNext-101 (64x4d) has obtained a T-1 error rate of 20.4% and a T-5 error rate of 5.3% with an image size of 224x224. They also evaluated this model on different dataset like ImageNet-5K and got an error rate of 40.1% which reduces the error by 2.3% when compared to ResNet-101.

I. Dual Path Networks
Y. Chen et al [85] proposed an architecture Dual path network (DPN). It is the combination of a residual network (ResNet) and a Densely connected network (DenseNet). The proposed architecture takes the feature reusage from ResNet and feature exploration from DenseNet by maintaining low complexity and more number of parameters. DPN includes higher-order recurrent neural networks (HORNN) which benefit from sharing weights throughout the network and also proves that ResNets and DenseNets are the same by using HORNN. By optimizing the network they had achieved a state of the art results on ImageNet-1k.
To understand the connection between the two networks they formulated the HORNN as Where i j is the state which is hidden in RNN at a particular step which is denoted by Q, the current step is indicated by j. R j Q (.) function is for extracting features. R j Q (.) and p j (.) do not share weights but extracts the same features more times. So that it may lead to feature redundancy this is one of the drawbacks of the network. ResNet has the problem with

J. NASNets
Zoph et al. [86] contribute a new search space for constructing neural architectures by transferring the weights from a smaller dataset to that of the larger one. This research introduces a new regularization method (known as scheduleddrop-path) for the models developed through their proposed search space which improves generalization. The efficient model developed through this search space attains SOTA results in classification (IN-12 dataset). Additionally, utilizing the R-CNN framework the learned representations are captured through the best model attains SOTA on the CoCo dataset. The proposed NAS (Neural Architecture Search) [87] implements a reinforcement strategy to optimize the configurations to design a good neural architecture. This method implies 2 different cells with a similar structure and separate weights. These cells are normal and reduction which is shown in Fig. 14. The normal cell input and output are of the same dimensions whereas, the reduction cell reduces the shape of input dimensions to half the previous input (i.e. stride 2 is applied). These cells provide faster and efficient search with appropriate generalization. The NAS which is mentioned in Fig. 14. has a controller block is a recurrent neural network that predicts multiple architectures with multiple probabilities. Then a small network (child) is trained to reach convergence with a certain accuracy score. The gradients of multiple probabilities attained are scaled in such a way to attain new accuracy scores and are updated to the controller. Observing the Fig. 14 the cells have two hidden states. The input of the hidden states is passed from the output of the preceding cells. If there are no previous cells then each hidden state takes an image as input. The architecture is formed by predicting subsequent convolution which can be formed using those two hidden states. The complete algorithm for NAS is determined [87]. Instead of random search, NAS provides a reinforming learning strategy to construct a deep architecture. The random search lack in providing significant result only for CIFAR-10 dataset and Fig. 15 illustrates the NAS search architecture.
The architectures which attained greater performance for the ImageNet, as well as CIFAR-10, are mentioned in Fig.  16. The controller is trained on the PPO criterion [88]. The learning rate was set as 3510 ( − 5). All the activations of the convolution are fed using relu as non-linearity with successive batch normalization layers. Additionally, implied bottleneck convolutions i.e. 1x1 convolutions and implied RMS prop as the optimizer. The best performing model takes 331x331 input image size and attains a T-1 accuracy score of 82.7% and T-5 accuracy score of 96.2% with 88.9 Million parameters. As a note, for object detection NAS-Net implied in Faster-RCNN obtained state-of-the-art mAP of 43.1%.

K. PNASNets
C. Liu et al [89] proposed a network by using reinforcement learning and different algorithms. Sequential modelbased optimization (SMBO) is used in the model which finds for structures in the network by increasing complexity with simultaneous learning. The model is compared with the previous method which is efficient up to 5 times within the same  search space. The architecture consists of a search algorithm where it finds the best conv "cell". Each cell includes a certain number of blocks where it consists of two input tensors with a combination operator. These blocks are stacked and determined based on the training time this approach easy transfer datasets from one to another. The search space in the network is based on the heuristic approach which starts with a basic model and improved complexity as the search goes on. The detailed architecture of the model is shown in Fig. 17.
The architecture details of the PNASNet.
• Considering simple structures, the training of the model becomes faster and inherit the process quickly.
• A set of surrogates (proxy) are requested to obtain the predictions of the quality of the structures which tend to be higher from the input is received.
• The search space is factorized by multiplying smaller search spaces which give the advantage of finding

L. EfficientNet
M. Tan and Quoc V. Le proposed a model by scaling the depth of the network, width of the network and resolution of the image [90]. The model is developed by using a combination of MobileNets and ResNets. By scaling those parameters, the model led to better performance with less computation cost. The scaling of the network is done in such a way they maintained a constant ratio throughout the network this scaling method is known as effective compound scaling. The scaling of the model is done as shown in Fig. 18. Due to various resources, they face a problem while scaling the convnet. So, they increased the depth, width and resolution of the image by a factor of P k , Q k and R k respectively where P, Q, R are small grid constant coefficients. By increasing the depth of the network, a convnet can apprehend more complicated features. Here, a problem of vanishing gradients arises and it is very complicated to train the network. By scaling the depth with a coefficient P, they maintained balance in the network. The next constraint is to balance the width of the network which is usually done in very small models. Increasing the width can capture more fine-grained features and takes less time for training. Their experimentations have shown, the wider the network, the more is the drop in accuracy. The resolution of the image is scaled by a factor R because of the higher resolution of the image takes more time for training. The proposed method enhances the accuracy and optimizes the FLOPS. By examining the depth, width and resolution values of the network to be 1.4, 1.2 and 1.3 respectively are found to be accurate with 2.3B FLOPS.
The model (EfficientNet-B7) achieved a T-1 and T-5 accuracy of 84.4% and 97.1% respectively with 66M parameters which are 8.4x smaller than the previous state-of-the-network.

M. FixResNeXt
Hugo et al. [91] performed augmentation trails for acquiring better generalization by choosing appropriate train and test size for a network. During experimentation, it is justified that, lower training resolution for an image and higher testing resolution eventually improved the performance to a greater extent by reducing training computational cost. This procedure was implemented on ResNeXt-101 by outperforming the existing models and obtained state-of-the-art performance on 2019 ILSVRC. There was a good significant shift in the model when the training and the testing methods are fine-tuned separately. A joint optimization is done by scaling the train-test resolutions equivalently by maintaining individual RoC (region of classification) sampling. To overcome the distribution, shift the first two layers of the model are prioritized to fine-tune by varying the crop resolution. A detailed analysis is done to pre-process the model by increasing the crop resolution at the testing phase and during training, roc sampling is done appropriately. This eventually, acquired a good generalization by providing lower train resolutions and higher test resolutions. The computation is reduced by 3-fold by halving training resolution which in turn speed up the training procedure. Implying larger batches for training impacted a good performance with saving GPU memory. A further modification is done to the model by adjusting activation statistics of the layer which is preceding the global average pooling (GAP) layer. When these techniques are implemented on ResNet-50 by varying the test size the results obtained are mentioned (CR is equivalent to crop resolution). First, with 64 as CR, the model obtained an accuracy score of 29.4% on ImageNet. Further, with an increase in resolution by 128 the model obtained an accuracy score of 65.4%. A higher accuracy score of 78.4% was obtained for 288 as CR.
It is observed that increasing test resolution further (more than 288) the accuracy score was gradually decayed. Even after assigning appropriate test resolution a set of skewed activations were observed and they were addressed by two methods. First, a parametric adaption is chosen and the other is an adaption by tuning appropriately i.e., fine-tuning. Hence the parameters of the architecture are to be addressed in detail with experimental results. Instead of performing the train-test method for generalization 10-fold cross-validation is implied with mean and standard deviation for each execution.
During the training process, extra training data was provided for most of the implementations. The best performing model (ResNeXt-101) acquired parameters of 829 M. While training ResNet-50 learning rate was initialized as 0.1 and is decayed by 10 for every 30 epochs. Initially, 512 samples were fed into the network as a batch with a horizontal flip, color jittering and random resize crop as augmentation parameters. The experimentation was performed on eight Tesla V100 GPUs. Subsequently, a set of 80 CPU clusters were inserted along with GPUs. The experimentation was carried out on standard pre-trained networks such as ResNet-50, ResNeXt-101 and PNASNet. Large network classification was done by complete fine-tuning PNASNet-5-Large with a train resolution of 331x331 which obtained the highest T-1 accuracy and T-5 accuracy of 83.7% and 98.0% on 480x480 test resolution respectively. Whereas, ResNeXt-101 was trained on 224x224 as a resolution to obtain a state-of-the-art accuracy of 86.4% with 320x320 as test image resolution. Further, this method was effective even on various transfer learning tasks and it obtained state-of-the-art performance for CUB-200-2011 and Birdsnap datasets.

N. NoiseStudent
Q. Xie et al. [92] implied self-supervised for training large scale images. This approach is based on a student-teacher learning paradigm. First, EfficientNet is trained on a set of labelled images(as a teacher model) of ImageNetand then produced pseudo labels by evaluating on a different data set which consists of 300 Images. Second, the larger EfficientNet model is considered as a student model and this is trained on the grouped labels i.e., pseudo labelled and labelled images. Next, the student model is replaced with the teacher and this process is iterated to attain significant performance. It is observed that the teacher model dose does not contain nosy labels as they were trained through a supervised approach. In the student model, a noise component such as dropout, stochastic depth, and random augmentations are implied. These implementations helped the student model to have greater generalization to that of the teacher model.
There are certain hyperparameters involved in tuning the model. The batch size is assigned as 2048 as default. A varying batch was implied i.e. 512, 1024, and 2048 to the EfficientNet model and all the batches turned out to have the same performance. The student model was trained for 350 epochs and smaller student models were trained for 700 epochs. The noise implied to student model with a dropout of 50%. Further, the random augmentation [18 for STNS] provided a magnitude of 27 for two operations. Finally, the probability of survival is set to 0.8 for the stochastic depth. The Noisy Student model beats the current state-of-the-art BiT Large with a 0.9% increment in accuracy i.e., the best performing NoisyStudent acquired an accuracy score of 88.4% T-1 accuracy and 98.7% T-5 accuracy respectively. This model consumed 480 Million parameters and which is approximately half the computational resource of the previous state-of-the-art by training the model with 300 unlabelled samples considered from the JFT dataset. The best performing model considered EfficentNet-L2 as the backbone to imply the NoisyStudent approach as mentioned in the Fig. 19. Further, the importance of adding a noise component in training the student model is discussed and evaluated. The training signal tends to vanish if the student samples were trained in a similar approach to that of a teacher by attaining zero cross-entropy loss. The T-1 accuracy obtained on ImageNet is 83.9%. This indeed shows large variation from the proposed method i.e., high variance from the current state-of-the-art. The co-training helps in segregating the two disjoint segments and training two models in a student-teacher self-supervised fashion helped in improving the performance to a greater extent.

O. BiT (Big Transfer)
Kolesnikov et al [93], performed transfer learning on large scale image recognition to improve tuning of hyperparameters and sample efficiency. The parameters are tuned cautiously by focusing on certain components for various vision tasks to improve performance with feature reproducibility. To provide greater performance transferability is provided on large scale vision tasks and performed transfer learning to produce three variant models BiT-Small, BiT-Medium and BiT-Large. with 21 thousand labels. Finally, the large utilized JFT dataset consisting of around 300 million samples and approximately 1.2 labels per sample. A set of tricks are considered by understanding certain components to attain higher performance for a neural network. They addressed two necessary components to build an effective neural architecture which is upstream and downstream components.
Upstream Components: Upstream components are implied for pre-training definite task. The components considered during up-stream pre-training are scale, Group normalization, and Weight standardization. Properly adjusting these components led to having a lower computational budget and greater efficiency. Further, group normalization and weight standardization obliged faster training over large batch structures.
Downstream Components: Whereas, Downstream components are applied for fine-tuning a similar visual task. In this, a heuristic rule is applied by discarding computationally expensive hyperparameters. Simple image pre-processing techniques such as resizing input to squared shape, cropping a short square randomly, and performing horizontal flip at training time. The parameters tuned while pre-training the model, at upstream and downstream are discussed independently. Most of the BiT models utilize ResNetV2 as backbone architecture to imply transferability. The upstream models utilize SGD as an optimizer and initializing the learning rate by 3x10-2. Additionally, a momentum of 0.9 was induced for faster convergence. The input samples were isotopically resized to 224x224 shape. Next, the small and medium models were trained with 90 epochs. But the training procedure was different as the learning rate was reduced by 10 after 30, 60 and 80 epochs. Subsequently, the large model was trained by decaying learning rate after 25%, 57.5%, 75% and 92.5% of the training progress. Similarly, for the downstream task, the SGD was implied as an optimizer with a learning initializer of 0.03 and to progress convergence, a momentum of 0.9 is added. The input shapes were reshaped appropriately to the context of the dataset. In a large scale visual classification challenge, the T1 accuracy obtained by the BiT-Large model on ImageNet-1K is 87.54% (with a standard deviation of 0.02). It remained a state-of-the-art model not only for ImageNet but also, for multiple standard data sets such as CIFAR-10, CIFAR-100, Pets, Flowers, VTAB. Further, the BiT was analysed on object detection, which implied RetinaNet as the backbone. This attained a state-of-the-art average precision of 43.8. With the BiT transferability, the object detection model attained an improvement of around 7.3%.

P. ViT (Visual Transformer)
Dosovitskiy et al. [94] utilized a transformer, the standard neural architecture for natural language processing onto computer vision task to drive self-attention for large scale visual recognition. This visual transformer was able to drive present state-of-the-art with lower computational cost to that of Convnets. The transformer is implied invariant fields depicting its performance. A Visual Transformer (ViT) is trained by appropriately setting the input embedding to the transformer to extract visual representations. The patch embeddings are obtained by resizing the image of a 2D image into sequential 2D patches. These embeddings are inserted into the transformer  The ViT model and the models considered for comparison were trained on certain parameters. The optimizer implied is adam with an initial learning rate set to default (0.001). Further, the β,and β 2 were set as 0.9 and 0.999 respectively. A weight decay of 0.1 was applied and it helped to construct good performance. Further, to fine-tune the model was initiated with a batch size of 512 and the optimizer as SGD. A small momentum was applied to improve the training speed. A maximum dropout of 0.1 was used for the ViT model trained on a large ImageNet dataset. Self-attention is provided by the transformer helped to combine the features extraction at the lower layer on focusing on the definite set of entities residing in the image.

VI. SUGGESTIONS FOR ARCHITECTURE CONSTRUCTION
Observing the state-of-the-art literature in the convnets there are certain factors observed in the construction of a novel architecture with greater performance and lower computational cost. These certain factors are constructed by analysing the minute parameters providing a better model. The performance (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 10, 2021 of various SOTA models is produced in Table II.

A. Architecture Tips
The architecture tips fairly include all the factors influencing to develop a resilient architecture that extracts invariant features. Larger kernel size in the beginning layers of the convolution provides loss of information which degrades performance but, speeds the training process. Similarly, the higher the stride faster the model is trained but the accuracy degrades successively. Without adding residual connections developing a model just by increasing the depth can lead to the problem of degradation. A network architecture without bottleneck activations can explode in terms of computational cost hence, a set of bottleneck activations are to be implanted into the networks. The varying dimensionality of the receptive field can provide invariant features. An architecture trained on very small data cannot perform well on most of the unseen samples. Hence, solutions for these problems are explicitly provided for building a resilient convnet.
• A small receptive field provides a set of variant abstract features which carry detailed invariances.
• A smaller stride can eventually provide good representation by reducing the loss of the information through excessive pooling.
• To skip the problem of degradation, residual connections can be implied accordingly. Further, this improves the performance and also reduces the computational cost for deeper architectures • A set of bottleneck connections can provide a generic feature representation and reduce computational effort while developing a convent width-wise.
• The asymmetric receptive fields with an appropriate bottleneck layer provide a greater representation of features.
• Finally, a model trained on multiple tasks with an appropriate set of samples can eventually improve in terms of performance acquiring state-of-the-art without much effort in parametric tuning.

B. Optimization Tips
The optimization tips include developing representation in convnets by altering the hyperparameters and indicating their right implementation. The hyperparameters which are widely implied in the deep learning paradigm to observe a conventional change in the model behaviour during stochastic optimization are described in detail.
• Dropout: Dropout helps in generalizing the model by halting a set of neurons during training and releasing them during the validation or testing time. Hence, selecting the percentage of dropout is crucial. According to the present implementations, most of the research implies 50%. But it can be varied from 30-50% and choosing it in this interval provides good generalisation is densely connected networks.
• Normalization: Local response normalization implemented in AlexNet did not perform well in most of the instances. As it has a huge number of hyperparameters it is a complicated task to imply such a normalization technique. Further, the Batch normalization technique was implied and it provided a great deal of succession in convnets by solving the problem of covariate shift. It is mostly utilized in the present research as it does not include very few parameters to tune and it works globally for variant architectures. Next, some problems were addressed in batch normalization and overridden by group normalization. It shows very minute performance variation when incurred on a smaller task but has a good variety when applied on large scale. Hence, group normalization can be used while developing a deeper model and for a small architecture batch normalization and group normalization works equivariantly.
• To skip the problem of degradation, residual connections can be implied accordingly. Further, this improves the performance and also reduces the computational cost for deeper architectures • Lastly selecting optimizer and scheduling the learning rates is still tedious. Hence, most of the research imply SGD with varying learning rate based on the problem and varying momentum by observing the convergence. Hence, for building a small scale convnets Adam optimizer with small learning rates and high batch size is provided for good performance. Whereas, training a large-scale model the parameters might vary from the architecture and choice of dataset.

VII. CONCLUSION
A detailed survey regarding the previous state-of-the-art is conducted. Additionally, a section explicitly gives an intuition of developing a good model with high performance and less computational power. This illustrates developing resilient architecture by tuning specific hyperparameters which as insightful in developing deep models.
Further, a set of details are not mentioned in this survey are to be described and held as our future direction. There a variant model which is developed in between these high-performance models which are not mentioned in this work. A set of smallscale models which resolve the problems in convolutions (i.e. Capsule Networks) does not describe explicitly. A detailed set of implementation framework which can reduce the effort of the implicit utility of architectures is not provided. These are taken as a challenge for the successive research and designing a framework overhauling these problems is chosen as future scope of work.