Effective Malware Detection using Shapely Boosting Algorithm

Malware constitutes a prime exploitation tool to attack the vulnerabilities in software that lead to a threat to security. The number of malware gets generated as exploitation tools need effective methods to detect them. Machine learning methods are effective in detecting malware. The effectiveness of machine learning models can be increased by analyzing how the features that build the model contribute to the detection of malware. The model can be made robust by getting insight into how features contribute to each sample that is fed to a trained model. In this paper, the boosting machine learning model based on LightGBM is enhanced with Shapley value to detect the contribution of the top nine features for classification such as true positive or true negative and for misclassification such as false positive or false negative. This insight in the model can be used for effective and robust malware detection and to avoid wrong detections such as false positive and false negative. The comparison of the top features and their contribution in shapely value for each category of the sample gives insight and inductive learning into the model to know the reasons for misclassification. Inductive learning can be transformed into rules. The prediction by the trained model can be re-evaluated with such inductive learning and rules to ensure effective and robust prediction and avoid misclassification. The performance of models gives 98.48 at maximum and 97.45 at a minimum by 10 fold cross-validation. Keywords—Artificial intelligence; machine learning; malware detection; shapely value; decision plot; waterfall plot


I. INTRODUCTION
At the current time, the malware is generated in large numbers. Open Threat Exchange [1] is a platform for the exchange of information related to computer security. The reason for the high volume generation of malware is both from the generation side, and end-use of it. Malware authors use tools such as polymorphic and metamorphic engines. Metamorphic engines can generate malware with minor modification of code. It uses techniques such as register reassignment, NOP instruction insertion, code transposition, the substitution of machine-level opcode/instructions, dead code insertion, and combinations of these techniques. Polymorphic engines can generate malware with encryption, prepend data, append data, and combinations of these techniques. The generated malware exhibits the same behavior as old malware. However, this generated malware can evade detection by antivirus software based on the signature. The detection engine of many antiviruses is based on the signature. Hence, databases of signatures need a constant update for upcoming malware. On the use side of malware, the number of software products has increased over time. Ten top software products with vulnerabilities are listed in Table I [2]. Software products with vulnerabilities from the top ten vendors are listed in Table II [3]. These vulnerabilities are exploited for an attack using existing or new malware. The software products are not limited to but include Operating Systems (OS), Driver for hardware devices, software applications, etc. The more a software product is used and popular, the more attacks it may have. Hence, hackers need more malware to attack the vulnerabilities. The vulnerabilities in hardware, OS, application, firewalls, anti-virus products, etc. may be by accident. The author [4] identifies three phases of the life cycle of vulnerabilities. In the first phase, a product is released in the market. The second phase starts when a vulnerability is found in the software product. In the third phase, the vulnerability has to be fixed by the developer and released for the user of the software. The vulnerabilities can de systematically discovered with needful tools. Knowing vulnerabilities is not enough, the vulnerabilities have to be proven by exploits, and attack software (malware). Machine learning and deep learning methods are used for malware detection and classification in research work these days. The objective of this paper is to further improve the effectiveness of the machine learning (ML) model based on boosting algorithms such as LightGBM by overcoming the wrong prediction, misclassification the ML model may have. Good ML models are made general with feature engineering and learning from a large dataset, to detect unknown malware. There are many algorithms for ML models and many feature engineering techniques to make the models effective that resulting in increasing the accuracy of models. Misclassification in the machine learning model is wrong identification. For malware, the ML model may not identify them and they are termed as a false negative. A false negative detection can be very dangerous for any organization. As the malware is not detected, it will be able to meet the objective of the attacker despite all the security solutions applied. ML model may also declare benign software as malware. Such occurrences are termed false positives. A false positive detection causes issues such as panic among users of the software, inconveniences, non-use of software until a confirmed source declares the software as benign. All machine learning models have misclassification without exception.
Machine learning models to detect malware are many and they also use feature importance as part of an algorithm to identify top features. There are other methods for feature importance using feature engineering such as Principal Component Analysis (PCA), Redundant Feature Removal (RFR), and Haar Wavelet Transform (HWT) [6] and Leave One Feature Out importance (LOFO) method [7].
In this paper, a novel method is proposed to identify the change in top features that contribute to the misdetection of malware or future input sample that may be malware or benign software to a trained ML model. In addition, to identify the amount of contribution the top features are having for misclassification of a future sample in consideration as input to the ML model. Shapely values and visualization techniques are used to achieve these objectives. Shapely values are from classic game theory. Shapely values are used to find feature importance in an ML model. Lundberg et al. [5]  shapely values associated with them. These top features along with their contribution to Shapley values are visualized using decision plot, waterfall plot, and force plots. Further, this work proposes to identify the false positive and false negative from the test dataset part. Further, the work also associates visualization with change in top features and amount of contribution of top features. Having identified top features and the amount of contribution of the top features for misclassification, this work proposes the use of inductive learning techniques to overcome the misclassification of future samples. The present work aims to improve the effectiveness of the ML model based on the LightGBM model. It can be used for zero-day malware detection as well.
The gaps that this work addresses are highlighted as follows.
• These feature importance from algorithms and feature engineering methods cannot associate the top features for a new sample used for prediction by a trained machine learning model.
• They cannot determine the amount of contribution of a feature for a sample used for prediction by a trained machine learning model. Hence, they cannot associate the visualizations with the amount of contribution of a feature for a new sample to be predicted by the machine learning model.
• There remains always a doubt if the new sample under test is part of high accuracy as published for the model or part of misclassification as false negative or false positive.
• The inductive method proposed in this work improves the probability of prediction to a higher level.
• A novel approach as proposed in this work is not available in the literature survey. Hence, this paper opens new dimensions for increasing the probability of effective detection of a new sample by a trained model.  • Use of LightGBM, boosting algorithms, for effective prediction of a future sample that may be malware or benign software. The proposed inductive method will avoid misclassification and improve the effectiveness of the ML model. This paper is organized with a literature survey in Section II, followed by the methodology of malware detection and the use of shapely values for visualization in Section III. The Dataset, experimental setup, and results are outlined in Section IV. The paper concludes in Section V with a conclusion and an Appendix in Section VI.

II. LITERATURE SURVEY
Malware is like any software product. It has to be distinguished from a benign software product. The methods available to distinguish and detect the malware are broadly categorized into static analysis, dynamic analysis, and hybrid analysis.

A. Static Analysis
In static analysis, the malware is not executed. Features for machine learning are extracted from the software without running the software, the sample under consideration. It has the advantage that the sample cannot infect the system used for extraction of features. All the software, malware, shared libraries required, and dynamic link libraries (DLL) have a header. For windows, the header of the executable is termed Portable Executable (PE) header. The features from the PE header of windows executables can be extracted as explained in [6] [8]. In addition, features can be extracted using properties of the executable file as an object and are termed file-related features. File related features are not limited to but include a histogram of bytes in executable, the entropy of complete file entropy of various parts of files, strings embedded in the executable, N-grams [9] from byte code, N-grams from assembly code, N-grams from API calls, images of hex bytecode of a file [10] [11], images of hex bytecode of different part of a file, etc. Many machine learning models and deep learning models use features with different combinations derived from static analysis [12]. However, malware authors use methods such as obfuscation [13], encryption of various types to evade feature extraction methods. The obfuscation and encryption methods are many and may be categorized into standard and non-standard (private). These shortcomings of static analysis may be overcome by dynamic analysis [14].
Authors in [15] convert the sample file to images and extract features using the trained CNN model. The extracted features are plotted using t-Distributed Stochastic Neighbor to identify the cluster of malware. Subsequently, they make N-grams with n values 1 to 5 using the API call sequence for six types of malware actions. The malware actions are creating or modifying files, hooking on to system services, getting information for loading the DLL, etc. The N-grams are used with eight types of distance measurement to make a similarity matrix using four types of kernel functions with the Support Vector Method (SVM). Distance measurements used in this work are Cosine, Bray-Curtis, Canberra, Manhattan, Chebyshev, Euclidean, Hamming distance, and Correlation for feature extraction. This technique may handle malware with a known packing method, as they can be unpacked to process and get features but will have a deficiency in handling packed malware with unknown packing methods.
Yousefi-Azar et al in [15] extract static features of a sample, malware, or benign software, using term frequency based on natural language processing. Extracted features are used with the deep learning model and Extreme Learning Machine (ETM) for malware detection. Backpropagation results in large feature space which increases computation complexity. The authors multiply term frequency with a random projection matrix to reduce the computation complexity. Balanced android dataset Drebin and Dexshare and windows executables from 2016 are used as a dataset. Windows executables from 2017 are tested as zero-day malware to achieve an accuracy of 95.5%.
The authors [16] collect malware samples that are used for attacks in financial institutions in Brazil, affecting cyber users for over 6 years. They use static analysis to extract features from PE header of collected samples and use Multilayer Layer Perceptron, K-nearest neighbor (KNN), Random Forest (RF), and Support Vector Machine (SVM) classifiers to detect malware. Further, they identify the family of malware using the t-Distributed Stochastic Neighbor Embedding (SNE) method. Concept drift of ML model is detected using Drift Detection Method (DDM) and Early Drift Detection Method (EDDM) to detect drift in the malware samples over time. The authors visualize and relate the new malware families coming over time using confirmation and warning indicated by the drift methods. They conclude that a warning indication by drift methods implies a degradation of ML models and a confirmation indication by drift method implies that the ML model needs to be updated.

B. Dynamic Analysis
In dynamic analysis, the malware is executed in a protected environment, and the behaviors, actions of malware are observed. In a normal environment, the sample will infect the system and will affect the future normal use of the system. Hence, a protected environment is used to avoid infection of the system conducting the malware test. The actions and behaviors of malware are not limited to but include adding, deleting, and modifying related changes in the file name, registry, processes, communication in the network, system configuration, etc. Features are derived with these changes and used in machine learning models with various algorithms. The dynamic analysis method is very expensive in terms of time to execute malware, computing resources, and trained manpower required. Besides, the malware authors employ techniques to avoid malware detection. One of the techniques employed by the malware author is to detect the virtual environment required for running the malware. If the virtual environment is detected, they switch off the behavior of malware and act as benign software. Another technique used by malware authors is to connect to the command and control center owned by them and download the malware at a later time to take control of the target machine. If the network is not available in virtual environment, the sample acts as benign software. Hence, trained persons are required to note this behavior of malware. The hybrid analysis is used to overcome these shortcomings of dynamic analysis.
Robert et al. [17] use a large dataset of malware with a Malheur tool to know the behavior of samples. Malheur tool executes the samples and generates a report. Needful information such as DLLs imported, API used as the callback are extracted from the report to understand the actions, behavior of malware with help of domain experts. Domain experts make rules and rules are externalized to the malware detection 103 | P a g e www.ijacsa.thesai.org module. Authors believe malware will exhibit its behavior as per framed rule and that can be detected. However, new types of malware may not exhibit behavior as per rules framed, because that malware was not part of the dataset used. Hence, this unknown malware will not be detected.
Binayak et al. [18] create a knowledge database of Inmemory processes based on the use of Dynamic Link Library (DLL) sequences using TF-IDF (Term Frequency-Inverse Document Frequency) and multinomial logistic regression based learning approach. The suspected process from malware uses a different DLL than of system DLL. This knowledge database is compared with DLL sequences used by In-memory processes to identify suspected, unwanted processes and malware.

C. Hybrid Analysis
Hybrid analysis combines static and dynamic analysis to overcome their shortcoming. Lifan Xu et al. [19] extract both static and dynamic features from android malware dataset and represent the features as vector. Advance features are derived using deep learning, a Deep Neural Network (DNN) using both the original static and dynamic feature vector sets. The advanced and original features are concatenated as new vectors as input to the DNN that modifies with multiple different kernel to detect malware. The combined hybrid analysis has shortcomings as in dynamic analysis or static analysis.
Sethi et al. [20] use feature from both static analysis and dynamic analysis on PHP, pdf, exe files. For dynamic analysis, the authors use a Cuckoo sandbox. Cuckoo sandbox is a virtual environment to run executable. It gives an analysis report of actions and behavior of the file executed. J48, SMO, and Random Forest machine learning algorithms are applied in the WEKA tool with the combined feature extracted using static and dynamic analysis. They achieved 100% accuracy with J48.
The literature survey gives different methods of improving the accuracy and other performance parameters of the machine learning model by feature engineering for malware detection. However, they do not give insight into the top features and contribution of each feature for a new sample by a trained machine learning model. Hence, there is a gap in research that can give insight into the top features and their contribution in the prediction of an unseen sample by machine learning model. This work is an effort to fill the gap.

A. Shapley Value and Feature Importance
The machine learning model should be both interpretable and accurate. Interpretation of ML model based on decision tree may be based on decision path, heuristic value to features, and model-agnostic. In this work, Shapley value is used for making the ML model interpretable. A local explanation is assigning a numeric measure, credit, to each input feature that constitutes a machine learning model based on a decision tree. These local explanations are combined to represent a global structure that represents an ML model based on a decision tree or an ensemble of decision trees. The ensemble of decision trees may be based on a bagging algorithm such as Random Forest or boosting algorithm such as LightGBM. The global explanation of the ML model continues to retain the local faithfulness as in local explanation. Shapely values from game theory satisfy simultaneously local accuracy, consistency, and missingness three properties required for credit score to a feature in an ML model. The credit score, Shapley values, are computed by one feature at a time into the output function of the model with some condition as in Eq. (1). Lundberg et al. [5] follow the causal do-notation formulation. It justifies use of the Shapley additive explanation (SHAP) interaction values as a richer type of local explanation and feature perturbation formulation.

B. Malware Detection Model
All samples in the dataset are from windows executable. The features are derived from the PE header of the samples and as properties of a file. Each window executable contains a PE header that is explained in [6] [8] [21]. The PE header can be extracted using a python program using "Library for Instrumenting Executable Files" (LIEF) a library in python. The extracted features are listed in detail in Appendix A. PE header consists of DOS header, file header, NT header, section header, optional header, and many directories such as Import directories, Resource directory, Export directory, and Exception directory. Import directories list Dynamic Link Libraries (DLL) loaded by the executable and Application Program Interfaces (APIs) used by executables. Resource directory lists the information required by executable such as icons, bitmaps, strings, menus, dialogs, configuration files, version information, etc. Exception directory lists exception handling information. Features extracted are listed in Appendix A. Some of the features are described here. File header of PE header gives features such as timestamp, vsize, has_debug, has_relocations, has_signature, has_tls, has_symbol, imports, Machine1-Machine10 listed in Appendix A. Machine representing, type of processor required, in the file header part of PE header is hashed and put into one of ten bins and named as Machine1 -Machine10. Features that are hashed and put in several bins are named like this. Section header and optional header give section name, section size, section characteristics, and start and end byte contents of each section. The section name is a string. It is hashed and put into 1 of 50 bins. This gives us feature entry_name1 -entry_name50 listed in Appendix A. Section size, section virtual size, and section characteristics values are hashed and put into 1 of 50 bins. These operations give us features Sec_size_1 -sec_size_50, sec_vsize1 -sec_vsize50, sec_char1 -sec_char50 listed in Appendix A. Entropy of content of each section in the sample is hashed and put in to 1 of 50 bins. This gives us features sec_entropy_1 -sec_entropy_50 listed in Appendix A.
DLL in an import directory and the name of an API in the DLL are concatenated to make a string. The string is hashed and put into one of 1280 bins. This gives us the feature Imp1-Imp1280 listed in Appendix A. Function name in the export directory is hashes and out into one of 128 bins. This gives us feature exp1-exp128 listed in Appendix A. File-related information used to derive features are histogram of bytes, strings, and entropy of hex values in each sample. The byte value in the sample can be 0-255. A histogram is count of value of the byte in each sample. The count of the value of a byte is put into the respective bin H1-H256 to represent the feature listed in Appendix A. Strings in a sample give very important, insightful information used by malware. Strings reveal created and modified filenames and registry related information. Strings may also reveal IP addresses used by malware authors for communication, command and control center URLs, signature of malware authors and groups. All strings of size five-character or more are extracted, hashed, and put in one of 104 bins. This gives us features Str1 -Str104 listed in Appendix A. The encryption and packing methods increase the entropy, disorder of bytes in samples. Entropy is computed as the method described by [8]. In this method, a block size of 2048 bytes is extracted and counts of bytes are put in 16x16 bins. These operations of making a block of 2048 bytes with windows of 1024 bytes and putting in 16x16 bins are repeated for the entire content of a sample. This gives us features Ben1-Ben256 listed in Appendix A. Both the PE header and file-related information give 2351 features. Dataset consists of malware and benign samples and belongs to January 2017 time period. Gradient Boosting Decision Tree (GBDT) LightGBM ML algorithms are selected for experiments in this work. The ML algorithm is selected for the following advantages.
• Feature importance of the ML model can be extracted after training of the model.

• Faster training and prediction
• Ease of computation

A. Dataset
The dataset in the proposed system is derived from [21]. It has Malware data from December-2006 to December 2017. The dataset from December 2006 to December 2016 contains only the malware and no benign entries and the reason for exclusion. Fig. 2 shows the exclusions, filter and pre-process used on the dataset to get the sub dataset used in this experiment. Dataset part from January 2017 is used in this proposed system. The unidentified entries are without labels in the dataset and are excluded for malware detection and analysis. The unidentified entries in the dataset may be malware or cleanware. The dataset consists of 32761 malware and 17186 benign software that appeared in January-2017. The details of the derived dataset are in Table III.  Each entry in the dataset has 2351 features. These features are from PE headers, sections of windows executable, systems APIs used in the executable, exported API from the executable, and file related properties. File related properties include Histogram of the complete executable in 256 bins, Byte entropy of executable file hashed into 256 bins and strings in 104 bins. The executable here means both the malware and cleanware. These features are defined in Appendix A and are used in the various diagrams in this paper. These features' names help identify exact features that are contributing to the detection of malware and the amount of contribution in the detection of malware or cleanware.

B. Experimental Setup
Intel(R) Core(TM) i5-7200U CPU @ 2.50 GHz, 2701 MHz, 2 Core(s), and 4 Logical Processor(s) with 8 GB Ram is used as computing resource in this work.

C. Malware Detection with LightGBM
Dataset is divided into a training set and testing set in the ratio of 70% and 30%. The model is trained with the training set and tested with the testing set. The results of this are in row 1 of Table IV. It has performance data for Accuracy, Precision, Recall, F1-score, and confusion matrix parameters in terms of false negative (FN), false positive (FP), true positive (TP), and true negative (TN).     The decision plot can be for more number of samples. Fig. 5 shows the decision plot for the first ten samples in the dataset. The value that shows negative from zero in blue color are benign and the values that are on the right side of two, the purple color vertical line, in the graph are malware. The seven samples, pink color, with the specific features as shown are the malware and three samples are benign. The objective of these figures is to display how the features are contributing to decision with use of the LightGBM model. The label of samples is verified with the prediction of each sample with the LightGBM GBDT algorithm for all the 10 samples. It matches as given in the decision plot.      • The contribution of the remaining 2342 features for the sample is +0.2, much less than the top five features.   Fig. 9 are very different from the top features of malware (true positive) in Fig. 3. In addition, the final Shapley value for the sample is down to -0.09 in Fig. 9 compared to 5.38 in Fig. 3 for malware. The start point is very low at less than -6 in Fig. 9 compared to the start point at -1 in Fig. 3.       12 shows the waterfall plot for the True negative sample in test dataset in Shapley value. Fig. 13 gives the force plot for the True negative (TN) sample in the test dataset in Shapley value.
The top features of waterfall plots in Shapley value from Fig. 3, Fig. 8, Fig. 10, Fig. 12 for in FP, FN, TP, and TN samples respectively in test dataset of the dataset are compared in Table V. Top features are listed in the features column. For a sample in each category in false positive, false negative, true positive and true negative, it identifies the presence of a feature as "Y" and no presence as "N". Further, it identifies the topmost feature, with the value among the top feature with a "T" in each category. The probability value contributed by each feature is identified in respective columns. This table helps to conclude that there is disjoint set of features for each category samples in FP, FN, TP, and TN. The topmost feature for the FN sample is has_debug and is present in FP and TP. The topmost feature for FP is Rx_sec_num and contributes very low value in other categories of samples.
The contribution of the remaining 2342 features is lowered significantly for FP and FN. For TP the value is +ve .42, for TN the value is negative -.03.
These comparisons can identify the misclassified FP and FN samples and improve the efficiency of the ML model by correct classification for an unknown sample. Few insightful rules that can be formed are as follows: • The malware sample with a high contribution of Imp321, H33, C_char1, and str43 may be a FP sample.
• The Malware sample with the highest contribution by Rx_sec_num among all the features will be a FN sample.  108 | P a g e www.ijacsa.thesai.org

D. k-Fold Cross-Validation
Cross-validation with k=10 is performed for less biased and less optimistic accuracy value for the LightGBM model. The test dataset is used for this 10-fold cross-validation test. The results of cross-validation are tabulated in Table VI.  They use more than three times malware compared to benign software for training and testing. They do not define ways to determine unknown malware. [6] Use only one tenth of benign software compared to malware. This highly unbalanced dataset lowers the probability of false positives. They consider zero-day malware as one which does not match known signature or unknown malware. In this work, a boosting machine model based on LightGBM is enhanced using Shapely value to build an effective and robust machine learning model. Features derived by static analysis of malware and benign samples in the dataset are used to build the LightGBM boosting machine learning model. Datasets from Jan 2017 for malware is used for training and prediction. Waterfall plots, Decision plots and Force plots based on Shapley value helped identify the top few features. The Waterfall plots demonstrated a change in features and their contribution for a sample from different categories of samples as insight into the ML model. Table V compared the top features contributed to misclassified samples. The top feature for samples that is detected as false positive, false negative samples by trained models is analyzed and inductive learning rules are made. The inductive learning rules can be applied to unknown, unlabeled samples to avoid misclassification into FP and FN and to ensure correct detection. These top features and their contribution may be used to overcome the misclassification of malware. The cross-validation with the test dataset is 98.48 at maximum and 97.45 at minimum.
The work can be further extended to analyze change in features and to derive inductive learning rules for misclassification by other ML models for false positive and false negative cases to ensure correct prediction. The Shapley values for a feature may be mapped to the probability score of the ML model. This will help to correlate the Shapley value to probability value for a feature as local explanation and as a whole for a sample at global explanation (structure). Large datasets may be used to make a robust ML model and analyze reasons for misclassification for various families of malware such as ransomware, rootkit, Trojan horse, etc.