Intelligent Hybrid Approach for Android Malware Detection based on Permissions and API Calls

Android malware is rapidly becoming a potential threat to users. The number of Android malware is growing exponentially; they become significantly sophisticated and cause potential financial and information losses for users. Hence, there is a need for effective and efficient techniques to detect the Android malware applications. This paper proposes an intelligent hybrid approach for Android malware detection using the permissions and API calls in the Android application. The proposed approach consists of two steps. The first step involves finding the most significant permissions and Application Programming Interfaces (API) calls that leads to efficient discrimination between the malware and good ware applications. For this purpose, two features selection algorithms, Information Gain (IG) and Pearson CorrCoef (PC) are employed to rank the individual permissions and API’s calls based on their importance for classification. In the second step, the proposed new hybrid approach for Android malware detection based on the combination of the Adaptive neural fuzzy Inference System (ANFIS) with the Particle Swarm Optimization (PSO), is employed to differentiate between the malware and goodware Android applications (apps). The PSO is intelligently utilized to optimize the ANFIS parameters by tuning its membership functions to generate reliable and more precise fuzzy rules for Android apps classification. Using a dataset consists of 250 goodware and 250 malware apps collected from different recourse, the conducted experiments show that the suggested method for Android malware detection is effective and achieved an accuracy of 89%. Keywords—Android malware detection; features selection; fuzzy inference system; particle swarm optimization


INTRODUCTION
Recently, the use of smartphones in all aspects of our daily lives is increasing continuously.The global shipments of smartphones hit a record 1.4 billion in 2015 [1].This number has grown 12% compared with the last year.The massive popularity of smartphones have been accompanied with a potential increase in the number of malwares.With Android dominating 82.8% of the market in 2015 [2], Android become the main goal for mobile malware.The number of Android malware applications is increasing continuously.The total number of malware attacking the mobile devices increased more than three times in 2015, compared to that of 2014 [3].The dangerous threats targeting mobile devices in 2015 were ransomware.Malware can access all the resources in the attacked mobile device, and data stealers, like business malware.
Google's Play store is a market for Android apps, also there are many other third-party stores for Android apps.The Android apps developers use the Google's Play and third-party stores to publish the apps they developed, and make it available for download and install by users.Detecting the huge number of Android malware and isolating them from application markets is potential and great challenging issue.Very recently in 2016, a significantly sophisticated new form of Android ransomware/Android.Lockdroid.E is detected by Symantec, this variant of ransomware malware employs the accessibility tapjacking method to pose a real threat for more than 67% of Android devices [4].
Several research efforts have been presented for malware detection depending on the Android permissions used in the app.However, using Android permissions only is not enough for accurate detection of malware [5], [6].Moreover, the existence of permissions in the Android application's Manifest.xml is not evident that it has been used by app code [7], [8].On the other hand, some researches [9] consider the API level information only to get the features from big data set, but it requires a large number of features for the discrimination between malware and goodware apps.Moreover, efficient detection of the new and ever-evolving Android malware is a continues challenge.To address these challenges, this paper proposes a new hybrid method for Android malware detection based on the hybridization of the Adaptive neural fuzzy Inference System (ANFIS) with the Particle Swarm Optimization (PSO).This paper has the following contributions: 1) Finding the most significant permissions and API calls that lead to efficient categorization of malware and goodware apps.
2) Designing and implementing a new hybrid approach for Android malware detection based on the combination of (ANFIS) with (PSO).
3) An accurate dataset was collected which consists of 250 goodware and 250 malware apps from different resources including Google's play.
The rest of this paper is structured as: Section 2 presents the related work.Section 3 introduced the employed features and feature selection methods.Section 4 explains the proposed method for Android malware detection.Section 5 presents the experimental results and discussion.Section 6 includes the conclusion and future work.www.ijacsa.thesai.orgII.
RELATED WORK Android malware detection has been a very active research area in the recent years.The two main approaches for Android malware analysis are static and dynamic analysis [10].Static analysis examines the Android application without executing it.Many static analysis approaches for Android malware detection have been proposed [11]- [15].For example, Kirin's approach [12], explores the used permissions in the Android app to determine their maliciousness.Stowaway [13] examines API calls to detect Malware applications and Risk Ranker [14] statically classifies applications based on different security risks.In 2014, DREBIN proposed by Arp et al. [15], it detects the Android malware application by performing application analysis using different features such as permissions, API calls, network address, hardware access, etc. Examples for static analysis tools include Smali [16] and Androguard [17], which facilitate the static analysis of Android apps.However, the static analysis approaches are unable to detect malwares that use dynamic code loading obfuscation techniques.
The dynamic analysis examines the Android application by monitoring its behavior during execution.A number of dynamic analysis approaches have been proposed for Android malware detection [18]- [22].DroidScope and TaintDroid proposed in [21] and [22], they monitor the application during its running in a protected environment (similar to the java sand box concept), by exploring different components of the application.Dynamic analysis approaches require many resources compared with static analysis, which made them unsuitable for the limited resources of the mobile devices.
Machine learning approaches also have been proposed for Android malware detection, motivated by the problem of manually creating and updating detection models for Android malware.Aafer et al. [23] used machine learning approach to classify Android apps as malware or goodware.They compared the malware detection accuracy of four classifiers using the API calls and permissions as features for Android application.Shabtai et al. [24] used six machine learning based classifiers for Android malware detection, namely, k-Means, NB, DT, decision tree, BN, logistic regression and histogram using the Andromaly framework.Andromaly achieved a 99.9% accuracy rate using the information gain algorithm for features selection and decision tree classifier.The main disadvantage of Andromaly is using self-written Android malware applications to test it.In contrast, 250 real-world malware apps were used in our approach.
Dini et al. in [19] combined the system calls in kernel with system calls in user level.They used features set consisting of 12 system calls and the K-nearest neighbors (KNN) as classifier for Android malware detection.Dini et al. achieved an accuracy of 93% using 10 Android malwares.The main disadvantage of this approach is its inability to detect malware that use root permissions to avoid the system call.Therefore, Android API calls and permissions are used in this research rather than using system calls only.Zhao et al. in [25] used the Support Vector machine learning method to develop RobotDroid in order to detect unknown Android malware.RobotDroid depends on the hidden payment services and the privacy information leakage.They considered three malware types, namely, Plankton, DroidDream and Gemini.Consequently, this approach is limited to the considered Android malware types only.Amos et al. proposed STREAM in [26], it employs several machine learning classifiers to detect Android malware based on a set of features such as permissions , memory and battery usage.However, the features set collected from Android emulator, which is not accurate as real Android device [27].In this work, we used features collected from real-world malware and goodware apps.
Our proposed approach is related to these methods and uses comparable features for detecting malicious applications, such API calls and permissions.However, it differs in two main aspects from previous research: First, we find the most significant permissions and API calls that leads to efficient discrimination between the malware and goodware applications.Second, we design and implement a new hybrid classifier for efficient Android malware detection based on combining (ANFIS) [35] with the PSO.

A. Android Permissions
Android uses the permission system to regulate the application's access to the resources and data in the mobile device, such as camera, storage and internet access.Android permissions offer the security characteristics by restricting certain tasks an application can execute [28].The selection of the most suitable permissions corresponding to the tasks performed by the application is the ultimate responsibility of the developer.The shortage of knowledge about the permissions could guide developers to add in essential and excessive permissions to the application which may leads to over privileged Android application [29] that make users stop installing the application.Moreover, adding in essential permissions makes the application similar to malware [30], causing re-delegation attacks [31].Furthermore, the shortage of knowledge about permissions makes the user uncertain about the decision of installing the application.Currently available information about permissions is not quite enough to support the users decision regarding the installation of application [32].Since the used permissions and API calls in the application reflect the behavior of the app, it is believed that it is potentially feasible to detect the malware application by exploring the patterns of used permissions and API calls, and help both developers and users in getting better understanding of permissions and API calls.

B. Android API Calls
APIs are functions used by the developers to interact with Android operating system.Since there are many Android APIs, it is more appropriate to consider specific APIs frequently used by malware and enable them to access Smartphone's sensitive resources, rather than considering all APIs used in the app's source code.Seo et al. [33] examined malware apps and specified suspicious APIs frequently used by malware.They compared the number of suspicious APIs in malware applications with the number of suspicious APIs in goodware apps.In this paper, the suspicious APIs from the malware and goodware apps were extracted and used as part of the features set.The selected suspicious APIs are similar to the subspecies APIs defined in [33], [15], examples of the selected suspicious APIs include, API calls for accessing important information such as getSubscriberId()and getDeviceId(), and API calls for getting accessibility services which can activate several types of malicious payloads.Finding the rank of the importance of API calls and permissions in terms of categorization of malware and goodware application, and evaluating the efficiency of our proposed hybrid approach PSO-ANFIS for malware detection, are main challenges of this research work.

C. Artificial Neuro Fuzzy Inference System (ANFIS)
Artificial Neuro Fuzzy Inference System (ANFIS) is proposed by Jang et al. [34] and it became one of the wellknown researches related to fuzzy inference systems in the recent years.ANFIS associates the advanced reasoning and explanation of fuzzy logic with the powerful computation and learning capabilities of neural networks [34].There are many architectures for neuro-fuzzy networks, however, the most dominant architectural models are the Mamdani and the Takagi Sugeno (TSK).TSK model uses a single-spike as the output membership function while Mamdani model uses a fuzzy set.The ANFIS used in this paper is based on TSK model.Fig. 1 shows example for ANFIS architecture that consists of five layers with two inputs X1 and X2 and one output Y.
The rule base consists of Takagi-Sugeno fuzzy rules as follows: Where, x1 and x2 are the input for the ANFIS, conditions specified in the IF part are called antecedent, and conditions in the THEN part are called the consequence.ANFIS consists of five layers [34].
Layer One: Crisp input data entered into the ANFIS are transformed into linguistic expressions.Membership function (μ) is used to transform the inputs into linguistic expressions, there are several types of membership functions.In this paper, Gaussian membership function is used because it leads to more smooth model behavior.The Gaussian membership function is defined as follow: Where, Oi's are the outputs of this layer.σ and C represent the variance and the center of the Gaussian membership function, respectively.This layer includes parameters called antecedent parameters, which are tuned by ANFIS for obtaining results that are more accurate.
Layer 2: This layer determines the level of accuracy of the statements specified in antecedent parts, which is also called the firing strength, using the following equation.

∑ (2)
Layer 3: This layer normalizes all the firing strengths computed in the former layer (Wi), where it computes the proportion of the ith rule's firing strength to the all rules firing strengths using the following equation: Layer 4: This layer contains the consequent portion of the fuzzy rules, the influence of each rule in the final output is defined as: Where, mi, ni, and ri are the consequent parameters.The consequent parameters and the antecedent parameters in the first layer, are tuned by ANFIS in the learning process to decrease the differences between the output and the target result.
Layer 5: This is the defuzzification layer where all the rules generated for one output are collected and defuzzified into numerical outputs, according to a weighted average sum as follow:

D. The Particle Swarm Optimization (PSO)
The particle swarm optimization (PSO) is a well-known method for searching, which is grounded on the behavior of bird flocking.In PSO, each possible answer for the optimization issue can be considered as a point in the population, the point is called a particle.The particle propagates in the population with a specified speed which is tuned based on its movement knowledge and its companions' movement knowledge.Each particle is assigned a fitness value specified based on the objective function and records its current position and current best position (recorded as pbest).The pbest can be considered as the particle's own moving experience.Moreover, each particle records the global best position (recorded as gbest, which is the best value in pbest) of the total collection.The gbest can be considered as its companions' movement knowledge for the particle [35], [36].Let xi (t) represent the location of a particle i in the population at time t.Then, the particle's location is adjusted by the addition of the velocity, vi to the current location: Where, c1 and c2 represent the acceleration coefficients, r1 and r2 are random vectors.www.ijacsa.thesai.org

IV. THE PROPOSED APPROACH FOR ANDROID MALWARE DETECTION
This section explains our methodology for the detection of Android malware.First, the used dataset is introduced.Second, we used the Information Gain (IG) and Pearson CorrCoef (PC) to find the most significant permissions and API calls that leads to efficient discrimination between the malware and goodware applications.Third, the proposed hybrid PSO-ANFIS is utilized to classify the Android apps as either goodware or malware.

A. The Used Dataset
To build an accurate and reliable model capable of efficiently classifying the Android applications as malware or goodware, an accurate and well-labeled dataset is required.We used a recent data set contains 250 benign applications collected from Google's play, the free online malware detection tool VirusTotal [36] is used to confirm that the collected apps are benign.The data set also contains 250 malware applications which have been downloaded from Genome project [37] and Drebin dataset [15], all the malware apps are confirmed as malware using the VirusTotal.In this paper, we considered the permissions and API calls used in the app as features to classify the applications as either goodware or malware.

B. Features selection
In this section, the most significant permissions and API calls that leads to efficient discrimination between the malware and goodware applications are selected.For this purpose, two features ranking algorithms, namely, Information Gain (IG) and Pearson CorrCoef (PC) are employed to rank the individual permissions and API calls based on their importance for classification.Based on the static analysis method, we used free analysis tools like android-apktool, dex2jar and jd-gui to extract the API calls and permissions from the goodware and malware applications.Java language is used to develop Android applications and they are generated as Android Application Package (APK) files.APK package consists of files required for running the application.The files in the APK include the following: DalVik Executable (DEX) file: This is a file generated by the compilation of the Java source code.
Manifest file: A file holding the Android application characteristics such as permissions, the activities, and intents.
eXtensible Markup Language (XML) file: A file defining the components of the user interface layout and the used values.
Resource file: A file consists of resources needed by applications like sound files and images.

 Information Gain Ratio (IGR)
Information Gain Ratio method extracts the correlation between Android API calls and permissions, it gives the maximum scoret to the most potential permissions based on the class of malware and goodware Android apps belonging to IGR [38], the IGR equations are explained as follows: n_r(X, C) Where, gain_r (X,C) denotes the gain ratio of the feature X occurrence in the class C. Ci and |Ci| represent the occurrence of feature X in class C, the ith sub-class of C and the total number of features in Ci, respectively.

 The Pearson CorrCoef
Pearson CorrCoef computes the relation between feature X and class C by Where, R(P;A) = 0 means the independency of feature P and class A, R(P;A) = 1 means the top positive correlation of feature P and class A and R(P;A) = -1 means the top negative correlation.In our paper, R(P;A) = 1 means that the application which request the feature P is highly suspected as malware app, while R(P;A) = -1 means requesting the feature P leads to classify the applications as goodware.

C. PSO-based ANFIS
The proposed approach PSO-ANFIS intelligently combines ANFIS method and PSO algorithm for the optimization of Android malware detection, by tuning the parameters of membership function to achieve the highest malware detection accuracy.ANFIS utilizes the IF-Then fuzzy rules in mapping the inputs to outputs, the level of ANFIS accuracy is significantly affected by tuning its network structure and parameters [40].There is a strong relation between the accuracy and the used membership functions.This paper aims to tune the Gaussian membership function to improve the accuracy of ANFIS by getting the best learning parameters and minimum number of fuzzy rules.The best learning parameters which minimize the difference between the target data and ANFIS output, can be obtained by minimizing the Root Mean Square Error (RMSE) as follows: Where, α represents the number of training data items, Rm is the input value and ym is the predicted value.The Performance criterion RMSE depends on the center and the width of the Gaussian membership function (Ci, σi).The RMSE is considered as an objective function and its minimization is the research problem in this study.PSO is a potential optimization technique to enhance the performance of ANFIS [41].In this paper, PSO algorithm is utilized intelligently to realize the minimization of the objective function of ANFIS which is the error between the predicted class of Android malware and the actual class, by tuning the membership functions of ANFIS.Fig. 2 shows the structure of the proposed hybrid PSO-ANFIS approach, each learning process of ANFIS represents one particle and the parameters of the membership functions that effect the ANFIS performance represent the particle dimensions.www.ijacsa.thesai.orgThe steps of the hybrid PSO-ANFIS approach are explained as follows: Step 1: Set initial value for the number of iterations C =1, randomly initialize the position Xiand velocity Viof the i-th particle in the search space using ( 6) and ( 7), respectively.
Step 2: Compute the objective function of ANFIS using (8) and use it as a fitness function ( ) for the proposed PSO-ANFIS.
Step 3: IF the fitness function of the i-th particle in position X ( ) is better than the fitness function of the same particle in the local best position f (Pi) THEN .ELSE IF the fitness function of the i-th particle in position X ( ) is better than the fitness function of the same particle in the global best position ( ) THEN Gbest = Xi Step 4: IF the fitness function of global best position ( ) is best than stopping criteria OR the particle counter C is greater than the total number of iterations m THEN Stop ELSE Go to Step 2

V. EXPERIMENTS AND RESULTS DISCUSSION
To conduct the experiments the dataset is separated into two sets testing set and training set.The five-fold crossvalidation technique is selected to split the dataset.In order to evaluate the performance of the proposed approach Android malware detection, we selected the 24 permissions and API calls features according to their significance for the discrimination, using the Pearson CorrCoef (PC) and Information Gain (IG) algorithms.We applied our proposed approach PSO-ANFIS on the selected features.The performance of our proposed approach PSO-ANFIS is compared with the well-known system ANFIS system.ANFIS is selected for comparing our proposed PSO-ANFIS with, because it proofs efficiency in many classification tasks and it is similar to PSO-ANFIS in terms of fuzzy rules type.The classification accuracy of the two classifiers is compared in terms of the Root Mean Square Error (RMSE).Root Mean Square Error measures the dissimilarity between values obtained by the classifier and the observed values: The total accuracy obtained by the cross-validation is calculated based on the average of the RMSE in the five-folds method as follows: Where, CVA is the accuracy based on the crossvalidation, m is the number of the used folds, and Ri is the calculated RMSE of each fold.The evaluation of the fuzzy inference classifier requires the separation of its output into two main classes.To evaluate our proposed POS-ANFIS, we separated its continuous output by determining a threshold in the range between one and zero, to categorize the malware and goodware apps.In this experiment, four threshold values have been determined: 0.10, 0.25, 0.   1 above shows the most frequently used features in malware and goodware applications and their rank of importance.It is observed that using the Information gain and Pearson CorrCoef Algorithm yield the same subset of 15 features, signifying that the results of ranking are consistent.The most requested permissions are associated with accessing phone state, connecting to the Internet, checking the connectivity of the network, monitoring the device booting, writing to external storage and messages related permissions.Moreover, most of the applications requested the SYSTEM_ALERT_WINDOW permission and accessibility services APIs.They enable an application to create windows that can be shown on top of other applications and even execute tasks on them, these abilities made them an extremely attractive target for malware developers to perform tapjacking attacks.A significantly sophisticated new form of Android ransomware/Android.Lockdroid.E is detected [4], this variant of ransomware malware employs the accessibility tapjacking method to pose a real threat for more than 67% of Android devices.
Table 2 above shows the testing and training RMSE for both ANFIS and PSO-ANFIS using the features selected by the proposed dual-stage method based on the five-fold crossvalidation.A low RMSE depicts greater malware detection accuracy, so the classifier that generates lower RMSE is considered as a better classifier for the differentiation between the malware and goodware applications.It can be observed from Table 2 that the proposed PSO-ANFIS achieves lowest RMSE for both testing and training in all five-folds.Classification approach Accuracy PSO-ANFIS 89% K-ANFIS [39] 75% Droid Permission Miner [42] 87% Puma [5] 86.41%.3 above shows the performance of PSO-ANFIS and ANFIS for each threshold in terms of accuracy and error rates.We can see that the proposed classifier PSO-ANFIS achieved better results than ANFIS for all thresholds.These results confirm the suitability of PSO algorithm in minimizing the error between the predicted class of Android application and the actual class, by tuning the membership functions of ANFIS.Fig. 3 shows a comparison between the values of CVA for PSO-ANFIS and ANFIS.It is clear from Fig. 3 that the results from PSO-ANFIS are significantly better.For example, the CVA of RMSE value of PSO-ANFIS testing is 0.2792 which is smaller than the CVA value of ANFIS testing which is 0.3271.Also, the CVA of RMSE value of PSO-ANFIS training is 0.2583 which is smaller than the CVA value of ANFIS training which is 0.3181.All these results confirm that PSO-ANFIS system outperforms ANFIS.The performance of our proposed approach has been compared with other comparable approaches, including our previous research [39] and other approaches [5], [42] as shown in Table 4.

VI. CONCLUSION
Accurate detection of Android malware has been an important issue in recent years.In this study, we found that the most significant permissions and API calls lead to efficient discrimination between the malware and goodware applications.For this purpose, two features ranking algorithms, Information Gain (IG) and Pearson CorrCoef (PC) are employed to rank the individual permissions and API calls based on their importance for classification.In addition, we proposed a new hybrid method for Android malware detection based on the combination of the Adaptive neural fuzzy Inference System (ANFIS) with the Particle Swarm Optimization (PSO).Using dataset consists of 250 malware and 250 goodware collected from different recourse, the proposed approach achieved classification accuracy of 89% which is better than the classification accuracy of ANIS and other approaches.For future work ensemble of classifiers will be considered to improve the classification accuracy of Android apps.
35 and 0.40.The classification results based on the selected features are compared in terms of Accuracy (ACC) and obtained error rate.The following confusion matrix used to classify the Android application as malware (1) or goodware (0).classification of malware as malware FP: incorrect classification of malware as goodware FN: incorrect classification of goodware as malware TN: correct classification of goodware as goodware.www.ijacsa.thesai.org

TABLE . I
. MOST FREQUENTLY USED FEATURES IN MALWARE AND GOODWARE APPLICATIONS AND THEIR RANK OF IMPORTANCE

TABLE .
III. CLASSIFICATION ACCURACY OF PSO-ANFIS AND ANFIS

TABLE .
IV. THE PERFORMANCE OF THE PROPOSED APPROACH COMPARED WITH OTHER APPROACHES