Landmark Recognition Model for Smart Tourism using Lightweight Deep Learning and Linear Discriminant Analysis

—Scene recognition algorithm is crucial for landmark recognition model development. Landmark recognition model is one of the main modules in the intelligent tour guide system architecture for the use of smart tourism industry. However, recognizing the tourist landmarks in the public places are challenging due to the common structure and the complexity of scene objects such as building, monuments and parks. Hence, this study proposes a super lightweight and robust landmark recognition model by using the combination of Convolutional Neural Network (CNN) and Linear Discriminant Analysis (LDA) approaches. The landmark recognition model was evaluated by using several pretrained CNN architectures for feature extraction. Then, several feature selections and machine learning algorithms were also evaluated to produce a super lightweight and robust landmark recognition model. The evaluations were performed on UMS landmark dataset and Scene-15 dataset. The results from the experiments have found that the Efficient Net (EFFNET) with CNN classifier are the best feature extraction and classifier. EFFNET-CNN achieved 100% and 94.26% classification accuracy on UMS-Scene and Scene-15 dataset respectively. Moreover, the feature dimensions created by EFFNet are more compact compared to the other features and even have significantly reduced for more than 90% by using Linear Discriminant Analysis (LDA) without jeopardizing classification performance but yet improved its performance.


I. INTRODUCTION
Scene recognition is a crucial aspect for the development of many software applications such as in the area of intelligent robotics, autonomous driving and intelligent video surveillance. Moreover, scene recognition is the basis component in accomplishing the tasks for any object detection tasks [1]. The basic goal of scene recognition is to label all photos of scenes, whether they are outdoor or indoor, semantically and properly.
The magnificent scenery as well as the beautiful and historical landmarks of certain places has become one of the attraction factors for the tourist to come and visit these places. In this context, a software application that equipped with an intelligent landmark detection based on scene recognition algorithm can be developed to serve certain useful tasks. For instance, a tourist may get useful information and recommendation based on the detected landmark such as the nearby food attractions, and transportation and accommodation information. Besides, the application may assist the tourist agent while guiding the tourist visiting the attraction places. However, the scene recognition is a challenging task due to the difficulty to distinguish the common structure of the public scene objects such building, monuments, parks, beaches and so on [2]. Scene images also might be captured from different angles which triggered the high intra-class difference problems [3].
Deep learning and transfer learning based classification is the emergence approach in any machine learning tasks [4]- [6]. In scene recognition, the pretrained CNN models by using ResNet50 architecture were adopted [4], [5]. Although the classification accuracy obtained was good (92.17% and 94.4%), ResNet50 produced larger features dimensionality. Therefore, there are lot of studies in the other domain have various of Efficient Net (EFFNET) CNN architectures such as masked face recognition [7], smoke detection [8], chest X-ray scanning [9]- [11] and fake face video detection [12] due to its exceptional classification performance as well as to generate lightweight features.
The key contributions of this paper are the proposed super lightweight Landmark recognition model trained by using Convolutional Neural Network (CNN) to address the challenges of distinguishing the common public structure of landmark scenes. The features extracted by using the pretrained CNN model of EfficientNet (EFFNET) which produced the lightest features as compared to the other CNN models. Afterwards, Linear Discriminant Analysis (LDA) feature selection algorithm has been adopted that has significantly reduced the dimensionality of features without sacrificing classification performance at all and even have improved the classification performance. The recognition model training by using CNN was also very efficient as it required very minimal number epoch to complete and yield the best classification performance.
The remainder of the paper is organized in the following way: Section II provide the previous studies conducted in scene recognition. In Section III, the Methodology is described in more detail. Sections IV presents the experimental results. The conclusions and directions for the future studies are presented in Section V. www.ijacsa.thesai.org II. RELATED WORKS Scene recognition is a subset of object recognition and can be treated as classification problem to serve certain purposes. It is a problem to describe the content or the objects that exist in the outdoor or indoor scene images. Scene recognition algorithms have adopted in many areas in computing field such as human computer interaction, robotics, smart surveillance system and autonomous driving [1]. Besides, scene recognition was also studied for tourism industry in assisting tourist or tourist guide to recognize the tourism attractive places or landmarks. There are Monulens [13], a real-time mobile-based landmark recognition, Smart Travelling [14] that used to recognize tourist attraction, nearby events, police stations and hospitals, Augmented Reality (AR) based landmark detection [15] and a system to distinguish large number of landmarks. All the aforementioned applications used the handcrafted features such as Histogram of Oriented Gradients (HOG), Scale Invariant Feature Transform (SIFT) and Bag of Features (BoF), and traditional machine learning approach such as Support Vector Machine (SVM). The recent works of scene recognition have shifted to deep learning-based approaches as tabulated in Table I. The study conducted in [16] established Places dataset to benchmark the performance of scene recognition algorithm which was denser in term of density and diversity of scene images in comparisons to the other scene recognition benchmark datasets such as SUN, Scene-15 and MIT Indoor67. The scene recognition algorithm trained by using Places dataset outperformed the accuracy performance of scene recognition algorithm trained by using ImageNet dataset for all scene recognition benchmark datasets. The evaluations were carried out by using CNN based features and linear SVM as classifier. The problem of high density and diversity of scene images as well as to determine whether the scene images contained landmark objects have been also addressed in [2] study. A metric learning-based approach was proposed in which the CNN is trained by curriculum learning technique and updated version of Center loss to overcome large variations of scene images. On the other hand, the existence of landmark objects in scene images determined by calculating distance between the image embedding vector and one or more centroids per class. Other than landmarks diversity issue, the scene recognition algorithm is also facing the high inter-class similarities where numerous landmarks have very similar building or architecture design. To overcome this problem, the CNN model based on ResNet50 was adopted in [4] to classify tourist attraction places in Jakarta, Indonesia such as Cathedral Church, Jakarta Old Town, Istiqlal Mosque and Maritime Museum. The ResNet based model also demonstrated exceptional performance in [5] via its proposed method namely Scene-RecNet to classify the aerial scene views such as airports, forests and rivers. The Scene-RecNet was more versatile and stable as the features are adjusted and modified in the convolutional and fully-connected layers that eventually improved the processing speed, small storage space and good recognition accuracy. Table II shows the summary of previous studies that have adopted deep learning approaches, specifically transfer models. Accuracy-74.4% [11] Chest X-ray EFFNetB0+CNN Accuracy -95.82% [19] TripAdvisor and Google CNN Accuracy-46.4% The study conducted in [17] addressed the problem of landuse classification at the hilly and mountainous area by using ensemble learning approaches to improve the overall classification accuracy performance and classes number optimization to solve classification accuracy problem for coniferous forest. The bagging-based CNN using Bagging (Bootstrap AGGregatING) ensemble classifier is capable to overcome the problem of unstable procedures which means the great impact on classification due to minor differences of the data. Whereby the optimization of the classes" number was carried out by utilizing spectral clustering (SC) that divides data into subsets based on its similarity. The pre-trained LeNet CNN architecture have used for feature extraction. The pretrained CNN architecture was also proposed in [18] for automatic screening of COVID-19. Specifically, two pretrained CNN architecture ResNet50 and VGG16 were fused with the combination of Moment Invariant methods that improved the performance of previous COVID-19 classification models. It is also worth to note that many previous studies were adopted variant of EfficientNet (EFFNET) CNN architectures for extracting the features from the X-ray to detect lung related diseases. A variant of EFFNET namely EFFNETB0 with Bi-LSTM was proposed by [9] detect Covid-19 faster and with high accuracy low cost on chest X-ray images. Along with that, the features from EFFNETB0 were fused with DenseNet121 and LAB and CIE color space. The model training was performed by using Bi-LSTM classifier that yield the best classification accuracy as compared to the other ensemble classifiers. Similar techniques were also used in [11] to detect COVID-19 from lung X-ray. The other variant of EFFNET so called EFFNET B2 was found to be most effective as compared to the other variants in [10] to reduce the class imbalance problem for diagnosing pneumonia from chest X-RAY. The fine tuning on EFFNET architecture provides desirable impacts which reduce computational effort and the use of batteries. The evaluation of several EFFNET variants were also carried out in [12] to detect fake face video in social media website. Based on the evaluation, the optimal performance of detection is by using EFFNET B4 and B5 and the classification accuracy performance drops when using EFFNET B6 and B7. Next, The EFFNET with Linear SVM were used to address the issues images complexity to recognize the face mask wearing in [7] . In this study, the classification accuracy EFFNET has outperformed the other CNN models using DENSENET201, NASNETLARGE and INCEPTIONRESNETV2 with very light size of features. The lightness of features produced by EFFNET have been utilized by [8] through the proposed novel lightweight smoke detection for detecting fire in its early stages. A module for smoke region segmentation was also proposed in this study where the encoder-decoder approach with atrous separable convolutions were investigated.
According to the comprehensive survey conducted by [1], the top three performance recognition approaches fall under Patch Feature Encoding, Discriminative Region Detection and Hybrid Deep Models. Specifically, the CNN based feature extraction using ResNet-152, AlexNet and SE-ResNeXt-101 were recorded the significant performance on Scene-15, Sports-8, Indoor-67 and SUN-397.
Based on the discussions of the previous studies, it can be summarized that the pretrained CNN architecture is flexible and capable to provide robust recognition performance in various fields and domains. The CNN architecture is flexible as the layers and its parameters can be easily fine tuned to fit the requirement of data and optimum performance could be achieved. In particular, the EFFNET based CNN architecture has proven quite decent performance so far in terms of classification performance as well as to produce lightweight features. Therefore, the use EFFNET also might be extended in the domain of scene recognition to overcome the issue of high inter-class similarity in scene images.

III. METHODOLOGY
This section describes the methodology undertaken to carry out this research, as depicted in Fig. 2. The methodology consists of four parts which are data acquisition, feature extraction, feature selection and model training.

A. Experimental Setup
The experiment in this study was performed by using Python libraries based on Spyder 4.2.2 and PyCharm 2020.3.3 (Community Edition) software tools. Specifically, the feature extractions and classifications were performed by using Scikitlearn and Keras libraries.

B. Scene Recognition Model Training
The landmark recognition model training consists of four main steps which are data acquisition, feature extraction, feature selection and classification model training.

1) Data acquisition:
The images for UMS Landmark Dataset were captured with a Nikon D7100 camera with a resolution of 6000 × 4000 pixels between 10.00 am. and 11.00 am. Fig. 1 shows the image samples of the popular landmarks in UMS [20]. This dataset has been made public and is available for download on the Kaggle website [21]. These landmarks are the popular tourist attractions for sightseeing and photography. Aside from this dataset, the public Scene-15 dataset [22] for scene recognition benchmarking was also evaluated in order to test the efficacy of the landmark recognition algorithm. This dataset contained 15 scene categories, comprising outdoor and indoor sceneries. There were 200 to 400 images in each category with an average resolution of 300 X 250 pixels. www.ijacsa.thesai.org 2) Feature extraction: Feature extraction is a process to transform the representation of the data into meaningful semantics for determining the category of the data in classification. In this work, the feature extraction was carried out by using transfer learning approach. The features of the images were extracted by re-using the model weights on the pre-trained Convolutional Neural Network (CNN) model. Transfer learning reduces the time it takes to train a neural network model and lead in decreasing generalization error. The extracted features of an image had created a vector of values that the model would use to characterize the image features. These characteristics were used in designing a new model.
Many previous studies have shown that the NASNeTMobile model performs well, such as the classification of rice diseases with an accuracy of 85.9% [29], ECG signal classification for cardiac examination [30] with an accuracy of 97.1 %, lung nodule classification from CT lung images with an accuracy of 88.28% [31] and skin lesion classification from dermoscopic images with an accuracy of 88.28% [32]. For on-device and embedded applications, the proposed MobileNetV2 has a low-latency, low-computation architecture. For instance, MobileNetV2 was used as an embedded food recognizer [33].
The pretrained CNN models were built with various layer types. In this work, two layer types of EFFNET layer were chosen to generate the feature matrices, namely top_cov and avg_pool. The model weights used in the EFFNET were ImageNet and both layers produced 62,720 and 1,280 feature dimensions. On the other hand, the avg_pool was the selected layer to generate 2048 features dimensions for RESNET152 model. Then, both NASNetMobile and MobileNetV2 produced 1000 feature dimensions.
The extracted features consist of one dimensional (1D) features matrix which will be fed into the traditional machine learning classifiers and the 1D CNN classifier (Conv1D). To work with 2D CNN classifier (Conv2D), the 1D features matrix was reshaped into 2D features matrix. The top_cov and avg_pool layers in EFFNET produced (16,16,5) and (16,16,245) output shape after being reshaped. Meanwhile, the avg_pool layer of RESNET152 produced (32, 32, 2) feature shape after being reshaped. Meanwhile the prediction layer of NASNeTMobile and MobileNetV2 generated a (2, 2, 250) feature shape. The feature shape represents the height, width and depth of the images which make the edge and colors of the spatial features to be detected.

3) Classification model:
The extracted Conv1D or 1D features as described in (2)    On the other hand, the Conv2D training features produced by EFFNET were fed into 2D Convolutional Neural Network classifier which is a fully connected layer. Table VII shows all the layers, the output shapes and the total parameters for EFFNET (avg_pool), EFFNET (top_conv) and RESNET152. CNN possesses convolution layer that has several filters to perform the convolution operation, which are RELU, pooling layer, and fully connected layer. The RELU layer produces the rectified feature map by performing the operation on the elements. The rectified feature map next feeds into a pooling layer. Pooling is a down-sampling operation that reduces the dimensions of the feature map. The rectified feature map is fed into a pooling layer. Pooling is a down-sampling operation that decreases the feature map's dimensionality. By flattening the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 2, 2023 203 | P a g e www.ijacsa.thesai.org two-dimensional arrays from the pooled feature map, the pooling layer turns them into a single, long, continuous, linear vector. When the flattened matrix from the pooling layer is given as an input, a fully connected layer forms classifies the images.
The dataset will undergo training and testing phase in creating the classification model. In CNN, the epoch refers to the number of times the model trains all datasets. Whereby, batch size is a small amount of data used for training. A suitable number of epochs needs to be adjusted until a small gap between test and training error can be observed. When the appropriate number of epochs is not chosen, underfitting and overfitting problems occur.
The learning rate determines how frequently the weight in the optimization method is updated. Fixed learning rate is used and the Adam is chosen as optimizer.
Dropout is a better regularization strategy for deep neural networks to avoid overfitting. The method essentially removes units from a neural network based on the probability desired. A default value of 0.5 is set in this experiment. Loss function measure the successfulness of classification and in this experiment by defining the distance between two data points. In this experiment, the categorical cross-entropy loss function was used.

4) Feature selection:
Feature selection plays important roles to improve the performance of recognition model by reducing the features dimensionality and transforming the feature into meaningful features [34], [35]. The meaningful features are characterized by the features that are more salient, less overfit and reduced the training execution time which eventually improve the accuracy performance [36]. In this work, Principal Component Analysis (PCA) [37], Linear Discriminant Analysis (LDA) [38], Boruta [39] and Recursive Feature Elimination (RFE) [40] were evaluated. Table VIII shows the number of features selected after performing the feature selection algorithms. Unlike PCA, LDA and RFE, Boruta provided automatic mechanism in determining the number of features. Therefore, manual parameter configurations to determine the number of features selected were not required. Meanwhile, the number of features selected for LDA need to be set to less or equal to the total class in the dataset. For PCA and RFE, experiments were conducted to test three configurations with different percentages of features selected, which are 70%, 40% and 10%.

5) Classification model performance metrics:
The model's overall performance on the testing set was measured using the accuracy metric as the performance metric. Assume that CM is a n by n confusion matrix, with n equaling the total number of various scene categories. Next, the actual category is indicated by the row of CM, while the anticipated category is indicated by the column of CM. Then, let C (i,j) denote the value of the CM cell in index row I and column j, with i,j=1,2,...,n. The following equation defined the accuracy metrics:

IV. FINDINGS
This section presents the analysis from the experiment results comprising feature extraction, classification and feature selection performance. The first part of this section presents the discussion of classification performance evaluation, the second part discusses about the feature dimensions size, shape and the number of epoch used in CNN training, followed by the performance analysis for feature selection. Table IX   In comparison to the Scene-15 dataset, most of the algorithms performed well on the UMS landmark dataset, as shown in Table IX. As the UMS landmark dataset had a higher image resolution, the quality of the collected images was more likely to have influenced the result. The bar charts in Fig. 3, Fig. 4, Fig. 5, and Fig. 6 show how the features and classifiers employed in the UMS landmark and Scene-15 datasets compare in terms of performance. The classification accuracy of various features on various classifiers is shown in Fig.3. EFFNET with avg_pool layer is the best feature due to its perfect achievement on all classifiers except MLP. To demonstrate its efficacy, Fig. 4 shows the classification accuracy of various classifiers on various features. Except for NASNetMobile, CNN 1D and GBDT had been found to be resilient to a variety of features, including the ability to attain 100% classification accuracy on all features. In contrary, CNN 2D performed poorly with NASNetMobile and MobileNetV2. This was most likely because the 2D shape features generated by the CNN 2D classifier were incompatible.  EFFNET based features performed well across many classifiers in the Scene-15 dataset, apart from GBDC and MLP, as shown in Fig. 5. RESNet152, NASNetMobile, and MobileNetV2, on the other hand, produced less discriminative features. Fig. 6 shows that LSVM and CNN 1D perform consistently across all features and worked exceptionally well with EFFNET features. GBDC and MLP, on the other hand, only achieved 67.61% and 43.39% accuracy, respectively. Moreover, the CNN 2D and SGD only worked well with EFFNET features. Overall, the best classification accuracy on the Scene-15 dataset was 94.26% using CNN 1D classifier and EFFNET (AVGPOOL) features. Based on the study conducted in [1], the RESNet152 indeed yielded the best performance on Scene-15, Sports-8, Indoor-67 and SUN-397. However, based on the result of the experiment in this paper revealed that the EFFNET have better performance on Scene-15 dataset. Next, the confusion matrix of classification accuracy is illustrated in Fig. 7.   As plotted in Fig. 7, there are few scene images had been miscategorized. For instance, category 1 (office) was classified as category 5 (store), category 7 (tall building) was classified as category 11 (coast), category 9 (street) was classified as category 3 (living room), category 13 (mountain) was classified as category 9 (open country), and category 12 (open country) could be classified as category 9 (open country) (street). This shows that the high inter-class similarity classification problem still exists due to the appearance diversity of scene photos. Table X and Table XI shows  206 | P a g e www.ijacsa.thesai.org false positives. Recall is the capacity of a classifier to find all instances that are positive. It is described as the ratio of true positives to the total of true positives and false negatives for each class. A weighted harmonic mean of recall and precision is used to get the F1 score, with the best result being 1.0 and the lowest being 0.0. Due to the inclusion of precision and recall in their computation, F1 scores are lower than accuracy measurements. Support is the number of instances of the class that occur in the particular dataset. The requirement for stratified sampling or rebalancing may be indicated by unbalanced support in the training data, which may point to structural flaws in the classifier's reported scores. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing.

B. Features Shape and Number of Epoch
The extracted features were reshaped into 1D and 2D representations, as can referred in Table XIII. The 1D feature shape was being fed to LSVM, CNN 1D, GBDT, SGD, and MLP, whereby the 2D feature shape was being fed to CNN 2D classifier. For both datasets, Table XII and Table XIII show the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 2, 2023 207 | P a g e www.ijacsa.thesai.org features" form as well as the best number of epoch for training the CNN. As seen in Table XII, the EFFNET generated the largest 1D features (62720) by using the average pool layer. NASNetMobile and MobileNetV2, on the other hand, generated the smallest number of features (1000). The best classification accuracy can be obtained by using only 30 epochs via CNN 1D for all the features. Whereby, the number of epochs was higher for training the CNN 2D are except for MobileNetV2. Based on Table XIII, the number of epoch required for training the CNN classifiers for Scene-15 dataset was larger than UMS landmark dataset. It was found that the CNN 2D required up to 150 epochs for CNN training. Fig. 8, Fig. 9, Fig. 10, and Fig. 11 present the graph of model accuracy and model loss over number of epochs for EFFNET and MobilenetV2 by using CNN 2D and CNN 1D classifiers. To determine the appropriate number of epochs for each CNN architecture, the evaluation was made on 30, 60, 90, 120, and 150 epochs. By using 120 number of epoch, the EFFNET with avg_pool layer managed to obtain the best classification performance with very minimal gap between training and test model lost, as can be seen in Fig. 9. On the other hand, a slightly larger gap size can be observed between training and testing in model loss in EFFNET using top_conv layer with stagnant performance in model accuracy despite of larger number of epochs being used as shown in Fig. 11.
In summary, the feature extraction by using EFFNET by using avg_pool and top_conv layers with both CNN and SVM classifiers can be considered as the best option in this context and with their own merits. For instance, the EFFNET with avg_pool layer produced a light feature size which definitely use less computational effort for storage and classification. Meanwhile, the EFFNET with top_conv layer, even though it produced a larger size of features, but required a very minimum number of epochs to run the CNN classifier with a high classification accuracy. Thus, the trained model, by using EFFNET-avg_pool with CNN 1D classifier could be deployed in the development of Landmark Recognition System. 208 | P a g e www.ijacsa.thesai.org

C. Effect of Feature Selection
The performance of feature selection methods such as PCA, LDA, Boruta, and RFE on the UMS and Scene-15 datasets, is discussed in this section. The feature selections were applied to EFFNET, RESNET152, NASNETMobile, and MobileNetV2 features, in particular.

1) UMS dataset:
The three PCA variations, as shown in Table XIV, mirrored the varying proportions of features selected, as seen in Table IX. As shown in Table XIV, the baseline referred to the findings achieved in the prior trial without any treatment employing feature selection. Based on the overall result in Table XIV, the treatment of PCA had a positive effect on majority of the features as it has retained accuracy performance and even more, slight improvement on the accuracy can be observed on all the features especially the MLP classifiers. In a flipside, the accuracy performance using GBDT has slightly affected regardless any features used.    Table XVII. Moreover, the performance of MLP on RESNET152 had substantially improved from 0.12 to 0.64. On the other hand, RFE absolutely failed to perform on NASNetMobile and MobileNetV2, resulting in a significant fall in the accuracy of all classifiers used.
Next, the detailed analysis of feature selection performance on each feature and machine learning classifier are shown in Fig. 12, Fig. 13, Fig.14 and Fig.15. As shown in Fig. 12, except for MLP, all classifiers in EFFNET performed remarkably well on all feature selections. Whereby, the EFFNET features would be more compatible with MLP if PCA and BORUTA is applied as the accuracies were increased by 55% and 89% respectively. On RESNET152 with EFFNET, a similar pattern of feature selection performance can be observed, as shown in Fig. 13. In fact, regardless of which feature selection is employed, the accuracy of SGD can be improved. When PCA and RFE were used with MLP, a positive effect on accuracy was noticed. According to the graph in Fig. 14, GBDT's performance appeared to be consistent across all feature selections, but the performance of the other classifiers dropped when RFE was applied. The best performance of LSVM and SGD could be seen when LDA was used. On MobileNetV2, CNN performed very well with all the feature selections and GBDT was slightly incompatible with PCA. Similar with NASNetMobile, LDA had also improved the accuracy of LSVM and SGD. The summary of feature selection performance across features and classifiers for the UMS dataset is shown in Fig.  16. The PCA was found to be the most robust feature selection method since its performance was consistent across various features and classifiers. However, when accuracy and feature size were taken into account, LDA's performance was the most significant. Meanwhile, if execution time was not a major concern and automatic feature selection is one of the criteria for selecting features, the BORUTA could be considered. Aside from that, the results of Tables XII, XIII, and XIV implied that EFFNET is the best and stable features. The best classifiers were GBDT and CNN, which consistently excelled across a variety of feature selections. Table XVII shows the performance analysis of PCA on Scene-15 dataset. Overall, PCA did not enhance classification accuracy considerably. SGD and MLP are the only two classifiers that performed better with PCA. For instance, EFFNET-SGD accuracy increased from 0.68 to 0.94, whereas NASNETMobile's classification accuracy increased from 0.39 to 0.63.

2) Scene-15 dataset:
The accuracy performance of LDA and BORUTA treatment as compared to without feature selection treatment (Baseline) can be referred in Table XVIII. As depicted in Table  XVIII, LDA performed excellently on many features and classifiers, except EFFNET-GBDT, NASNETMobile-GBDT and MOBILENetV2-GBDT. In contrast, BORUTA did not increase the accuracy of nearly all features, and there was even a slight drop in accuracy.  The analysis of RFE accuracy performance is shown in Table XIX. The pattern of data presented in Table XIX obviously indicates that RFE has brought less impact on improving almost all feature representation. However, the positive effects of RFE can be seen on EFFNET-SGD, EFFNET-MLP and MobileNetV2-MLP.  Fig. 17 to 20 show a detailed analysis of feature selection performance for each feature and machine learning classifier. Based on the graph shown in Fig. 17, the transformation of EFFNET feature by using LDA had improved the classification accuracy of LSVM, CNN, SGD and MLP. In addition to that, the PCA, BORUTA and RFE brought significant effects on the accuracies for MLP and SGD. As for RESNET152, as shown in Fig. 18, there was a tremendous increase on the accuracy when LDA was being used to transform the features for CNN, LSVM and SGD. The rest of the feature selection techniques by using PCA, BORUTA and RFE seemed to have less positive impacts on the accuracies. Similarly in Fig. 19 and Fig. 20, LDA still outperformed the accuracy of PCA, BORUTA and RFE on all classifiers except GBDT. For NASNETMobile, PCA demonstrated a bit of an improvement on the accuracies for CNN, SGD and MLP. There were no positive effects on the LSVM, CNN, SGD, and MLP accuracies for BORUTA and RFE.  Fig. 21 shows the summary of feature selection performance on the Scene-15 dataset. LDA was the best feature selection technique for the Scene-15 dataset since it not only worked with a wide range of features and classifiers, but it also improved classification accuracy significantly. BORUTA and RFE, on the other hand, have no substantial impact on classification performance. Due to the constant performance across numerous feature selections, it can also be inferred that EFFNET is the best features and, LSVM is the best classifier.

V. CONCLUSION AND FUTURE WORKS
This paper evaluated several transfer learning approaches and feature selections for effective and super lightweight landmark recognition model. A landmark recognition model was trained through the features extraction by using the pretrained CNN architectures and machine learning classifiers. The new UMS landmark datasets were created, and the landmark recognition model was also evaluated with the Scene-15 dataset. The findings showed that the EFFNET CNN architecture with CNN classifier is the best feature extraction and classifier in this study. EFFNET-CNN achieved 100% and 94.26% accuracies on UMS landmark and Scene-15 dataset, respectively. Moreover, the features created by EFFNET were more compact compared to the other features. Furthermore, based on the evaluation of several feature selection algorithms, LDA was determined to be the best feature selection technique for vastly reducing feature dimensionality by 99.69% for UMS landmark dataset and 98.90% for Scene-15 dataset while maintaining good accuracies. However, although a super lightweight landmark recognition model was produced, it must undergo extra pre-processing step to reduce the dimensionality of features which will impose excessive computational costs of processing. Therefore, future works that can be suggested are to evaluate the effect of the proposed dimensionality reduction technique on the computational cost of the algorithms as well as to test it on various benchmark datasets.