Detection of Android Malware App through Feature Extraction and Classiﬁcation of Android Image

—Android apps have security risks due to rapid development in android devices. In the Android ecosystem, there are many challenges to detecting Android malware. Traditional techniques such as static, dynamic, and hybrid approach, most of the existing approaches require a high rate of human intervention to detect Android malware. Most of the current techniques have the most signiﬁcant security challenges to detect Android mal- ware, the inspection of Android Package Kit(APK) ﬁle structures, increased complexity, high processing power, more storage space, and much human intervention. This paper proposed Machine Learning(ML)based algorithms to detect Android malware apps through feature extraction and classiﬁcation of grayscale images. In our proposed approach, convert most of the ﬁles of APK such multiDex, resources, certiﬁcate, and manifest ﬁles transform into a grayscale image, using the image algorithm to extract the local feature of the image. In the paper used different ML models to classify the local features with the help of multiple images of malware families. This approach deals with the obfuscation attack.it can hide in any ﬁles of APK. The proposed approach enhanced accuracy reached up to 96.86%, and computation time did not increase more than the existing techniques. The quality of that proposed worked; it has a high classiﬁcation accuracy and less complexity validation loss.


I. INTRODUCTION
Android operating system (OS) is the most popular OS in the smart device ecosystem. Due to intelligence devices, every android user is very close to and dependent on Android Application Package (APK). In the present scenario, the android users sharing sensitive information, banking operation, e-shopping, locations information, the identity of the users, and privacy of data are also involved. In Android, device security is the biggest challenge and severe issue. A survey report of GDATA in 2019 [1] showed that 1,852,170 Android malware samples were detected in the first half of 2019. Here, data showed an android malware is detected every 8 seconds. The statistical report represents eight mobile infected by malicious out of ten Android devices [2]. One more research report, Google detected 86% of the total Android devices market in 2017. The most popular GlobalStats website showed that 73% of Android-based devices counted sales of total devices of Android in 2019 [3]. Due to the popularity of Android devices, Android app becomes more targeting apps compared to other kinds of apps. As per one evolution report of mobile malware, 5,321,142 apps were installed on devices, 151,359 mobile apps were detected as Trojans, 60,176 were detected as mobile ransomware by Kaspersky 2018 [4].
Android users threaten by different types of malware families; some are distributed by Google Play stores, some type apps such as downloader, banker, and hidden ads [5]. Most of the extensive attacks pointing the Android OS. The hackers mainly focused on attacking games, banking, academics, eshopping domain. However, this domain published many malicious apps, which have gaps between the app development and the number of works. Third-party stores have untrusted apps; most gaming apps have adware due to the repackaging technique [6]; with the help of repackaging tools, reassemble the original app and add the malicious code with the original code and then assemble again, upload on third party store. Here, the main challenging task to identify malicious apps is the most severe issue. Most existing techniques are static and dynamic; most techniques used behavior and signature base to identify the android malware.
Static techniques do not require running apps; they disassemble the code and extract the feature of apps to identify. The dynamic approach always needs to run the application and identify the android malware through behavior and signature base. These techniques have significant drawbacks; it requires more computing power, resources, and space [7]. The dynamic analysis was evaded by some powerful and intelligent malware [8]. Moreover, existing dynamic and static techniques used the manual intervention of humans. It also needs domain expertise to identify reverse engineering [9]. The existing approach used single classes.dex file, but in the current scenario, we have multiclass files, or multiDex files [10], which have not been converted into a grayscale image to detect malware.
Our proposed work takes care of all essential files such as multiDex (MD), resources.ARSC(RS), Manifest.xml(MX), and certificate (CR) files of APK to detect Android malware. The existing approaches need human intervention to separate the dex, meianfest.xml resource.ARSC(RS) and certificate files convert into the grayscale image [11][12].

META-INF:
This file very essential in Android apps, which the information about the signature and information Lib: lib file is used to run the specific device architectures of the native library, such as armeabiv7a and x86.
Res: to Keep the resources such as images. Which is not compiled with resources.arsc Assets: Raw information about resources AndroidManifest.xml: Meta information about the apps such as version, content, and name of APK files.
Multiple classes.dex: Main and necessary file of apps, which run java class methods on the devices.
Resources.arsc: Compiled all resources on the devices which is used by the apps.
Android apps development using java.class files. By the DX tools convert multiple java.class into the DEX files. DEX and manifest are essential files in APK, and DEX consists of the data structure; the interpreter used the different data types that belong to the data structure. All static reverse engineering tools used the DEX files to reassemble the apps for reengineering. Multiple methods are proposed to protect the DEX files. Our Proposed approach does not require any human intervention, does not require separate files, and does not need reverse engineering to find different types of files. Our proposed approach takes less computation power to detect the android malware because it takes less time complexity because it worked without any reverse engineering operation. Our proposed approach used DEBIAN and AMD datasets containing 10560 apps (5000 benign and 5560 malicious apps). The grayscale image datasets, each containing 10560 samples (5000, 5560 benign, malicious samples, respectively), were constructed based on diverse files from the contents of the APK collections. Firstly, all the benign and malicious APK convert into Grayscale images, a block diagram depicted in Fig. 2. Secondly, extract the local features from images using imagebased feature extraction techniques such as SIFT, SURF, and ORB. Thirdly, apply the BOVW approach to convert multiple local feature descriptor vectors into a single feature vector to feed into ML classifiers. Finally, extract the global and local features and apply the different ML classifier techniques such as AdaBoost K-Nearest Neighbors(KNN), Support Vector Machine(SVM), and Random Forest(RF). The Proposed approach worked on the raw bytes of grayscale images; the main advantage of this approach does not require any reengineering operation and making different types of datasets. The existing approaches have the main disadvantage, approaches that require human intervention. Our approach proposed safe from human intervention and reengineering operation. Many ML algorithms are developed for the detection of malware apps. The most common challenge in Android malware detection is obfuscation attacks. Malicious code can be hidden in any files of APK, which is very dangerous to android malware app detection. Our proposed works have a novelty that now no needs to do reverse engineering to obtain all files of APK. Directly conversion of Entire APK files structure converts into a grayscale image. Most of the existing techniques used separate files to transform into grayscale scale images to analyze the image-based android malware detection. All existing methods do not care about multiple DEX, Share Object (SO), Meta-Inf, lib files, etc., just observation of manifest, single DEX files, resource files only. In the meantime, the author should explain the functions of multiDEX (MD). Resources, ARSC (RS), Manifest.xml (MX), and certificate (CR) files of APK separately because they are used to detect Android malware. Then, the proposed methodology to detect the android malware is well represented in Fig. 1.

II. RELATED WORK
Many researchers worked in the domain of Android malware detection; some are listed below in this section. An approach designed to analyze the suspicious behaviors and detection of resources abuse [13]. The major drawback of this approach is the need to decompile the app and embedded hook code; this approach used runtime events to track and monitor the logging. The SafeDroid static framework approached, which statically analyzed the DEX (Dalvik Executable). By this approach, extract the binary feature vectors to train various ML classifiers [14]. Moreover, the multiple features are system calls, app permission, system events [15]. Those features train RF classifiers to analyze whether Apps are affected by malicious or not. Some approaches differentiate whether the app is a malicious or normal app based on patterns permission [16], the required permissions extracted statically. Most popular permissions are registered into class [17] to define whether the permission is benign or malicious. The permissions of a class determine the benign and malicious app. Moreover, a data mining technique made the constructive pattern of permission to determine whether the android app is malicious or benign. Here, the authors applied the bi-clustering method to used permissions. Also, the authors used the information of the Android app package and permissions to train of KNN, Linear Discrimination(LD) function, and Radial Basis Function (RBF) network. Moreover, Application Programming Interface(API) system calls integrated with permissions [18] are used as features to train the RF classifier of android's apps classification. It is a very lightweight method for detecting Android malware through ML and dataflow-related API system calls used in this approach [19]. In [20], the proposed approach used the n-gram series to extract the features from the opcode of malicious and benign apps. This approach used a limited number of features to train RF and Support Vector Machines (SVM) classifiers. The proposed approach [21] installed the Android application(APK) on Android devices to extract dynamic features such as networks behavior, memory consumption, computation power, time-space, battery, and binder; these features are used to classify malware. This dynamic approach [22] captured network traffic behaviors of running Android applications(APK) from different android devices. This traffic correlates with malware URLs and with DNS service network traffic for the detection of malware. An approach [23] used to fog computing reduces the load and dynamically enhances the computation power to detect Android malware. Another approach [24] used the API system calls and network behaviors, collectively applied to detect Android malware. In [25] this paper, the authors showed the multiple network behavior and emulator-based dynamic experiments to analyze android malware. Android operating system embedded by an extension kit has been proposed [26] to deals with confused delegate attacks [a genuine APK is manipulated for communicating with the trusted application for Inter-Process Communication(IPC)]. To enhance permissionsbased policy [27]at runtime tracking and communication link analysis by pre-defined policy to prevent malicious behavior. Moreover, the signature set is constructed by network log and correlated with the permissions-based methods [28] for android apps classification. The recent approach [29] uses reverse engineering techniques to decompile the APK, extract the source code, and convert it into a grayscale image. The constructed dataset of images is used to train a convolutional neural network (CNN) to detect the malicious app. API system calls and semantic information is used to train the Short-Term Long Memory (LSTM) [30] model to classify the android malware. Moreover, a hybrid approach includes CNN and deep autoencoder (DAE) [31] to detect Android malware. In [32], this proposed approach used the hybrid scheme; it extracts dynamic and static behavior features used to train the deep learning model. Also, in [33] approach extracted the four features, such as permissions, rate of permission, system events, APIs system calls used to train the collective RF classifier. DREBIN [34] is a static analysis approach; this approach used similar malicious apps as per experiment works (5,560 malicious apps). This method used as many possible features of apps and was added with joint vector space. Due maximum number of features and determination increased the complexity level. This paper [35] proposed the classification of the dependency graph. The features extracted from the dependency graph make the semantic feature set from the weighted contextual API of the graph. The metric of the homogeneous app determines same the application behaviors The sensitive and important API call allocated the weight according to the Android malware family [36]. Every app implemented the function call graph (FCG), and each FCG construct the sensitive API call-related graph (SARG). The SARG has the parent and sensitive API call nodes. Here, train multiple machine learning approaches to classify the common behavior of the malware family. Moreover, from source code is extracted from hexadecimal representation and converted into RGB images [37]. The color RGB dataset is used to train a CNN classifier to classify Android malware. Furthermore, the Android (APK) application converted to grayscale images, then extract the feature of grayscale images for training the RF classifier for classification in [38]. Also, in [39], extracted the feature from 2D of Opcode Sequences and assigned the weight based on their occurrence. The weight value is converted into grayscale images. The image detection approach is very limited to detecting Android malware domains. Local and global feature extraction of the entire APK is more effective than existing approaches. Our paper has mainly converted the image into grayscale without any reverse engineering tools. It does not require separating the files of APK such as resource, Multidex, manifest, and certificate. Moreover, it does not require human intervention; most existing techniques have a common issue of human intervention and extracting the source code from reverse engineering tools.

III. METHODOLOGY
This section discussed the full detail of the proposed model. The first subsection briefly describes constructed dataset, the other section described the brief details of extracted features, and the last section briefly describes the training Machine Learning (ML) classification. The primary objective of our proposed approach is to detect Android malware. The DREBIN dataset has the most famous malware such as DroidKungFu, GingerMaster, GoldDream, and Fake Installer. The primary objective of our proposed approach is to detect Android malware. Many researchers used the DREBIN dataset to analyze the android malware, and various institutions utilized this dataset to investigate Android malware. Our Proposed models used the DREBIN dataset because it has 179 different android malware families, appropriate for any investigation dataset.

A. Transformation APK into Grayscale Images
The Android APK files convert into grayscale images [40]. In this proposed article, the authors construct the malware images using files of the Android app from malware APK. The APK is transformed into 8-bit vectors, and then the 8-bit vector transforms into a grayscale image. Every substring has an 8-bit value as a pixel converted into a decimal value between 0-255, shown in Fig. 2. Any digital file on the memory device is stored as a stream of a bit of '0' and' 1'. In the model read every APK file as a binary stream, group every eight bits, and store them in a new file with the image file extension.

B. Local Features Extraction
The local feature is a defined image object (basically, in the image, a cluster of pixels or small blobs) [41]. The local feature of images is the most stimulating point in the image, which defines the image descriptor vectors(DV) or feature vectors. The set of feature vectors is described by different types of algorithms. Our proposed approach used the four different algorithm types to extract the local features as Scale invariant feature transform(SIFT), Speeded up robust features (SURF), Oriented FAST, and Rotated BRIEF(ORB). Those methods are very famous in the malware domain for better accuracy.
where L(x,y,ρ) is the Laplacian Gaussian on the position (x, y) at scale ρ. L(x,y,kσ) is the Laplacian of Gaussian(LoG) on the position (x, y) at scale kρ, and the kρ is a scale a little more than ρ. The SIFT methods identified the stimulating points at the level of 128-bit descriptors in Eq.1. The extracted feature from the input images through the SIFT matched each feature of k nearest neighbors. The main objective of SIFT is to object recognition techniques to panorama stitching. As a result, the system is insensitive to the images' ordering, positioning, scale, and illumination. Two-Dimension isotropic measure by the Laplacian to the second spatial derivative of an image. The Laplacian Gaussian approach highlights areas of speedy intensity change and is often used for zero-crossing edge detectors. In our system, the Gaussian smoothing filter reduces its sensitivity to noise for smoothing with something approximating.
2) SURF: The algorithm that Speeds up robust features(SURF) [42] is the faster algorithm, and it can be the replacement for SIFT. This algorithm is faster and more robust for similarity comparison and similarity invariant of images. SURF algorithm plays a vital role in the real type of tracking and recognition of the object. The main merit of this algorithm is box filters approximation and calculation of the integral images. Additionally, it has the location and scalebased determinant of the Hessian matrix. The Hessian matrix has good performance to obtain the image key points, and it has good accuracy. In the SURF algorithm filtered by Gaussian H(x, y) = Sxx (x, ρ) Sxy (x, ρ) Sxy (x,ρ) Syy (x,ρ) where, Sxx (x,ρ) has a Gaussian kernel derivative on the point of x in the image, and similarly for Sxy(x,ρ) and Syy (x,ρ). Haar-wavelet responses determine horizontal and vertical paths to the neighborhood of size six and used the 64 Bit Descriptor. Within interest point neighborhood, distribution of Haar-wavelet responses obtained from descriptor description. We deed integral images to speed up the system. Additionally, using the 64 Bit Descriptor dimensions to improve the system's performance for feature computation increases robustness and matching. In the invariant to rotation, we recognized the reproducible orientation for the interest points. For this reason, we obtained the Haar-wavelet responses in the vertical and horizontal directions. The circular neighborhood of radius 6s around the interest points, with s the scale that the interesting point detected. Therefore, our proposed approach uses integral images for fast filtering again. Only six actions are needed to SURF: Speeded Up Robust Features, the seventh determines the feedback in the vertical and horizontal directions at any scale.

3) ORB:
The feature vector Oriented FAST and rotated BRIEF (ORB) is a high-speed keypoint detector [43]; in BRIEF, descriptors have much modification to improve the algorithm performance. The ORB algorithm detects the keypoint in images by using the FAST algorithm. Also used the Harris corner to detect the key point. Moreover, it used the multiscale feature with 32 bits BRIEF-based descriptor where S is the flattened spot in the image, and S(x) is the intensity in Eq. 3. In the implementation of the FAST algorithm, we extract the kernel windows from single line buffers. In the approach, the center pixel is subtracted from each circle pixels. The result is measured with the minContrast value whenever the obligatory number of consecutive pixels exceeds the threshold level; the center is marked as the corner. For the circle region, evaluate the sum-of-absolute-difference (SAD) metric. Only the differences that exceed the minimum contrast threshold level are involved in the metric. This calculation means that the algorithm detects a light center pixel surrounded by dark pixels or a dark center pixel surrounded by light pixels as corners with high metrics. The Harris algorithm used five image filters, and three circular windows and evaluated the two gradients. The design of the calculation of the eigenvalue of the Harris matrix practices three multipliers and three adders and is pipelined to optimize performance.

C. Machine Learning (ML) Classification
Our proposed models used four types of Machine Learning models such as Adaboost, K-Nearest Neighbors(KNN), Support Vector Machine(SVM), and Random Forest(RF) to classify the extracted local features from Grayscale images.

1) K-Nearest
Neighbors (KNN): K-Nearest Neighbors(KNN) is a supervised ML models, which is used for the classification of input data. It recognizes data points classified into multiple classes and calculates the class label for the new input data point. This method is famous for classifying the object into the train closest feature space. The nearest neighbors are signified by K in KNN, and the maximum unknown data points classify near to K neighbors. The primary benefit of the KNN algorithm uses the minimum distance to search the nearest neighbors. The selection of the number of nearest neighbors is essential to obtain the augmented KNN model. The selection of the number of nearest neighbors is essential to get the augmented KNN model.
2) Support Vector Machine (SVM): Support Vector Machine(SVM) also is a supervised ML algorithm. In this model, take the past input data and predict the feature output. The primary purpose of SVM is classification, but it is also used for regression statements. The SVM algorithm chooses the support vectors in the dataset at the extreme points. It selects the maximum distance between the support vector and hyperplane as much as possible. A class in support vectors has the maximum distance from the hyperplane. The distance margin defines as the distance between different support vector classes. The sum of D+ and D-is calculated as distance margin, where D-, hyperplane has the minimum distance from the closest negative point and D+, hyperplane has the minimum distance from the closest positive point. The main aim of SVM is to find the maximum distance margin, which gives the optimal hyperplane. The optimal hyperplane always gives excellent classification. In the case of non-linear, which produces low and no distance margin, SVM showed misclassification. In that scenario, SVM used the kernel functions to convert the non-linear data into 2D or 3 D dimension arrays. The minor dimensional feature is converted into high dimensional feature space by the kernel functions.
3) Random Forest (RF): Random Forest (RF) is one of the most common and powerful supervised ML algorithms. RF executes efficiently massive datasets and predicts accurate results. This algorithm support both types of functionality, such as classification and regression-the decision tree support RF to enhance the accuracy and flexibility. In general, with more trees in the forest, the output would be more predictable. The more trees in the RF reduce the risk of when a statistical model fits exactly against its training data. RF can obtain good accuracy in case of missing a large proportion. According to attributes, the new object classifies, and the decision tree gives the classification output per the ruleset.

4) AdaBoost:
The first boosting algorithm is AdaBoost, which solved multiple problems. The AdaBoost constructs a robust classifier from multiple week classifiers. This algorithm keeps a single split of the decisions tree with the week stump, known as the decision stump. AdaBoost always keeps more load on tough to classify, easy to handle the problem, and do less. This algorithm has solved both types of problems, such as classification and regression. Multiple APK's are repackaged, which steal code by reverse engineering methods and reassemble with another name by adding adware or small scripts of malicious code into repacked APK. Here the APK has very slightly changed, so the dataset has slight noise in data We found that in the case of less noisy data, only a few hyperparameters need to be tuned to improve the Adaboost performance. In the case of the small number of input variables, KNN models provide excellent performance. Whenever we increase more number of input variables, the performance of KNN degrades. In our dataset, we used multiple DEX files based on APKs. All files structure of APK were converted into grayscale images, which increased the number of input variables and memory size and complexity of the KNN model.

IV. PROPOSED MODELS
In our work, we proposed an image-based detection of android malware using machine learning classification. In this process, Android APK converts into a grayscale image, extracts the image feature using image processing techniques, and trains the machine learning classification to detect malicious or benign apps depicted in Fig. 1. The novelty of this approach the entire files of APK transforms into images to deal with the obfuscation attack. Most of the existing techniques used only three files of APK to transform into the image. The main disadvantage of the other techniques requires decompiling the APK and separating the files such as DEX (MD), ARSC(RS), Manifest.xml(MX), and certificate files. Moreover, the disadvantage is that it does not take care of the mutltiDEX files. If APK has more than 6500 methods in the app, it needs to create the multiDex files [40]. If the malicious code is embedded with second or third classes.dex files, no existing algorithm detects the Android malware app from multiDex class files. The primary source of the malicious code is embedded into classes.dex files. Our models used three algorithms (SURF, ORB, and SIFT) to extract local features (LF) descriptors from the grayscale image dataset. One by one, local features (Extracted from each image) train to multiple machine learning algorithms (RF, KNN, DT, and AdaBoost). The multiple descriptors represent an image. Above mentioned machine learning algorithm gave the multiple vectors as outputs, which cannot be direct as inputs for any machine learning algorithm. This model used the Bag of Visual Words(BOVW) to create one feature vector with multiple local feature descriptors [41]. The BOVW uses any clustering techniques to fragment the extracted descriptors vectors into multiple clusters. Then the cluster is predicted by the clustering algorithm.

A. Accuracy Assessment
The accuracy metrics for multiple Machine Learning models were determined based on the precision, recall, f1-score, and accuracy in Eq. 4, 5, 6 , 7 respectively , and precision in fraction of data entries of malicious activity are categorized as truly Android malware.

Precision =
T rue P ossitive T rue P ossitive+F alse P ossitive The recall is the fraction of malicious apps data of correctly classified malicious families.

Recall =
T rue P ossitive T rue P ossitive+F alse N egative (5) f1-score is the harmonic mean between sensitivity and precision.
f1 − score = 2 * P recision * Recall P recision+Recall The accuracy or complete classification accuracy is the portion of all suitably classified negative and positive records with the losses.

Accuracy =
T N +T P T N +F N +T P +F P Cross- The experiment has been designed on Intel core TM i-7 10700 CPU @ 3.8 GHz with 16 GB RAM. The experiment used the BOVW algorithm, which needs a size 120 codewords vocabulary. The K-means technique collected all key points from created datasets, and it has the codeword vocabulary size 120. Moreover, the proposed model used Opencv, Sklearn python libraries for the implementation of laboratory works.
The performance of different Machine Learning models has been achieved in terms of the whole percentage of true positive, true negative, false positive, and false-negative decisions. Our local features extractor models, such as SIFT, SURF, and ORB, extract key points from the image dataset. The extracted local feature passed to train four renowned machine learning models, i.e., K-Nearest Neighbors(KNN), Support Vector Machine (SVM), Random forest, and AdaBoost. The complete result with multiple ML models and local feature extractor models is presented in Table I and shown in Fig. 3. The validation set of the accuracy and losses in our proposed works proves that the results are correct, not overfitting problems depicted in Fig. 4. The high accuracies, precision, recall, and Fi-score from different machine learning models are displayed in Table II and Fig. 3. if the losses decrement and accuracy developments of both groups are like the same, then the process aborted changed the modeling parameters to remove the overfitting problem. Last, the AdaBoost model accuracy touched 96.86%. The traditional machine-learning algorithm shows the performance of each algorithm in Table III and is [11] 98.75% Image-based Jaiteg Singh [12].

VI. CONCLUSION
Our proposed model uses an image-based framework to classify the android app, whether malicious or benign app. Most image-based detection techniques do not care about the multiDex files of APK, using only a single class.DEX file for image conversion. In existing techniques of imagebased malware detection found [11][12]23,36,37] the maximum detection probability of malware in classes.DEX file, not in another file such as Resources.ARSC, Manifest.xml, certificate files, if the hackers hide malicious code into second or third classes.DEX files, there is no chance to detect the malware in previous approaches. In our experimental works, transform all classes.DEX APK file's contents into grayscale images. We used the image processing techniques to extract the local feature of images, including SIFT, SURF, and ORB models. The Local features are classified using machine learning models (KNN, SVM, RF, and AdaBoost) to detect Android malware.
The achieved results exhibited that the proposed approach overtakes the existing techniques in classification accuracy and computational time. Our work showed that the AdaBoost detection rate reached up to 96.86 %, shown in Fig. 5, and run time did not exceed 0.0195 s on average for each sample. In the future, we will try to use the local and global features of images on multiDEX files to classify the Android malware to improve accuracy.