Empirical Study on Microsoft Malware Classification

A malware is a computer program which causes harm to software. Cybercriminals use malware to gain access to sensitive information that will be exchanged via software infected by it. The important task of protecting a computer system from a malware attack is to identify whether given software is a malware. Tech giants like Microsoft are engaged in developing anti-malware products. Microsoft's anti-malware products are installed on over 160M computers worldwide and examine over 700M computers monthly. This generates huge amount of data points that can be analyzed as potential malware. Microsoft has launched a challenge on coding competition platform Kaggle.com, to predict the probability of a computer system, installed with windows operating system getting affected by a malware, given features of the windows machine. The dataset provided by Microsoft consists of 10,868 instances with 81 features, classified into nine classes. These features correspond to files of type asm (data with assembly language code) as well as binary format. In this work, we build a multi class classification model to classify which class a malware belongs to. We use KNearest Neighbors, Logistic Regression, Random Forest Algorithm and XgBoost in a multi class environment. As some of the features are categorical, we use hot encoding to make them suitable to the classifiers. The prediction performance is evaluated using log loss. We analyze the accuracy using only asm features, binary features and finally both. xGBoost provide a better log-loss value of 0.078 when only asm features are considered, a value of 0.048 when only binary features are used and a final log loss of 0.03 when all features are used, over other classifiers. Keywords—Multi-class classification; malware detection; XGBoost


I. INTRODUCTION
There are several kinds of malware that can infect a computer system. The number of malwares exceeds 800M in 2019 [1]. Detecting a given file as malware is one of the interesting research problems. Malware detection is challenging because the cybercriminals continuously change the way of attacking the computer systems, resulting in change in the features of malware software. There is a long-lasting confrontation between cyber security experts and malware creators. Machine learning algorithms can be efficiently used to identify whether a given file is malware or not. These algorithms require features/attributes of malwares. Malware files exist either in the form of byte files or assembly language files. Features can be successfully extracted from these files.
Microsoft is one of the major companies that develop antimalware products. Microsoft has launched a challenge to detect malwares on Kaggle.com [2]. Microsoft has provided nearly half a tera byte of data consisting of malware files. The dataset given in [2] consists of 10,868 instances with 81 features, classified into nine classes.
Several works are available in the literature on malware classification. Ahmadi et al and Drew et al work on textual feature extraction from the challenge dataset [3,4]. The dataset is of huge size and it is difficult to work on a computer with moderate configuration. Hu et al. address scalability of the dataset [5]. Scofield et al. utilize an entity resolution strategy that merges syntactically dissimilar features [6]. Deep learning techniques are used in [7] and [8] to classify malwares based on the textual features. Narayanan et al. use the classifications like SVM, k-Nearest Neighbours and Artificial Neural Networks in their work [9]. More recent works can be found in [10].
In this work, we apply various multi class classification algorithms to predict the class of a given malware. The organization of this paper is as follows: Section 2 describes the research problem, dataset details, feature extraction and evaluation measures. Section 3 explains proposed approach to solve the problem. Section 4 details the experimental setup. Results are given in Section 5 along with some discussion. Conclusions are given at the end.

A. Problem Statement
Microsoft has classified malware into 9 classes. Microsoft malware classification is the problem of determining in which class of malware, a given file belongs to. This is a multi-class classification problem. To problem can be elaborated as follows: Given a file, the problem is to estimate the probability of the file belonging to each type of nine classes of malware. In multi-class classification problems, the algorithm predicts the class with maximum probability as the target class. But this kind of approach is not probable for malware classification because, estimation of the probabilities that belong to each class is valuable. For example, the probability of a file belonging to class 3 is 0.5 and class 4 is 0.4. If the problem is modelled such that the file belongs to class 3 considering the maximum probability, we will lose the information of the file may also be affected by class 4 with slight margin. Therefore, our approach computes probability of a given malware belonging to each of the 9 classes. The structure of the solution followed in this work is given in Fig. 1.

B. Dataset Description
The dataset available at Microsoft malware classification challenge webpage [1] has been used in this work. The organizers of this challenge have provided the training and test 509 | P a g e www.ijacsa.thesai.org datasets separately. There are two kinds of files in this dataset.
(1): .asm file and (2): .bytes file. Total train dataset consists of 200GB of data, out of which 50GB is .bytes files and 150GB is .asm files. There is a total of 10,868 .bytes files and 10,868 asm files, comprising 21,736 files in total, with nine possible class labels denoting 9 types of malwares. The number of files in each kind of class is given in Table I. Fig. 2 shows the distribution of instances among nine classes of malware in the given dataset. It is understood from Fig. 2 that the problem is highly imbalanced with 27% of instances belonging to class 3 and 0.4% of instances in class 5. Classes 4, 5 and 7 occur very infrequently whereas, classes 1, 2 and 3 are the malwares that occur frequently.
Box plot on asm file size is given in Fig. 3. This indicates that class 2 and 5 have some similarity. But from class distribution plot in Fig. 2 implies that class 2 is frequently occurring, and class 5 is the least occurring class. This signifies that file size is useful in predicting class labels.  A sample data points in both files are given in Table II.    2) Features related to asm files: There are 10,868 files of asm of size around 150 GB. The initial observation of asm files says that there are Address, Segments, Opcodes, Registers, function calls and API related words in asm files. We have extracted 52 features from all the asm files. These features consist of file_size, bag of words related to 13 prefixes, 26 opcodes, 3 keywords and 9 registers. As the file size is huge, we use multi-threading with 5 threads to extract these features. [17,18]: Log loss is the common evaluation measure used for multi class classification problems. Multi class log loss is defined as follows:

D. Evaluation Measures 1) Multi-class log-loss
where, n is the number of instances, c is the number of classes, y ij =1 if instance i belongs to class j and p ij is the predicted probability estimate of instance i belonging to class j.
A pure classifier yields a log loss of 0. The log loss value increases as the probability estimate by the chosen algorithm goes wrong. The aim of machine learning algorithm is to minimize the log loss value.
2) Confusion matrix: A confusion matrix for a n-class problem will be an n X n matrix, where columns correspond to the predicted class labels and the rows corresponds to the actual [19,20,21]. The main diagonal gives the correct predictions. That is, the cases where the actual values and the model predictions are the same. In malware classification problem, the matrix is of size 9 X 9. Each cell [i,j] represents number of points of class i are predicted to belong to class j. The ideal value of confusion matrix C can be

3) Precision:
Precision is the fraction of correctly predicted instances out of total predictions for a given class [20,21]. Precision is good if cost of wrong belongingness prediction to a class.

4) Recall:
Recall is the capture of correct predictions among total instances belonging to the class [20,21]. Recall is good if cost of identifying an instance which is a member of the class. If a patient who is cancerous is not predicted, it is a huge loss to the patient.
The proposed approach is explained in the next section.

III. PROPOSED APPROACH
Various machine learning algorithms are used in a multi class environment in this work. The proposed approach is shown in Fig. 6. The algorithms used in this work are briefly explained.

A. Random Model
In random model, we compute the probabilities of each class in the solution shown in Table I purely in random and normalise the sum to be 1. A random model gives us the worst possible log loss value of any algorithm. Any model performing worse than random model can be immediately rejected. [11] k-NN algorithm is a lazy learning algorithm. It doesn't train the model in advance. The algorithm computes distance of test instance from k nearest instances in the training data. The class to which majority of k nearest neighbours belongs to is taken as the class of the test instance. Determining right k is a challenge in this algorithm. Hyper parameter tuning helps us in finding right k. 511 | P a g e www.ijacsa.thesai.org

C. Logistic Regression [12]
Logistic regression is basically defined for binary classification problem. We use multinomial logistic regression [13], which is a variant of logistic regression for multi class problem. This algorithm predicts the probability of test instance belonging to a class in multi class environment.

D. Random Forest [14]
Random forest is an ensemble of decision trees trained with bagging. Random forest algorithm constructs n number of decision trees using train data. The class lable will be determined by majority voting of all these constructed decision trees. The decision tree algorithm can naturally handle multi class case too. [15] XGBoost is an optimized distributed gradient boosting library. It utilises Gradient Boosting framework. XGBoost provides a parallel tree boosting method, which is very fast and accurate in many cases. XGBoost is a kind of ensemble. Ensemble learning constructs of a group of predictors that use multiple models and aggregates the performance of each tree. In Boosting technique, the errors made by previous models are tried to be corrected by succeeding models by adding some weights to the models.

Characteristics of XGBoost:
• XGBoost is used in regression as well as classification problems.
• Can be able to manage memory very efficiently for large datasets exceeding RAM.
• Supports different kinds of regularizations which helps in reducing overfitting.
• Provides auto pruning of tree.
• Efficiently handles missing values.
• Takes care of outliers to some extent.
All the classification algorithms chosen are sensitive to parameters. The experimental setup and parameter setting is discussed in the next section. 512 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 3, 2021 IV. EXPERIMENTAL SETUP This section describes the parameter selection of machine learning algorithms used for experimentation. Some classifiers we intend to use are sensitive to parameters. We perform hyper parameter tuning to fix the best parameter. The hyper parameter tuning is shown in Fig. 7 to 10. k-NN classifier is sensitive to the value of k [16]. To find best k, we have tested the model with different values of k from 1 to 15. The model gives best log loss for k=1, as shown in Fig. 7. Therefore, we use k=1 in our experimentation.
For Random Forest classifier, we have tested with number of trees varying from 10 to 3000 (Fig. 9). With 1000 trees we could achieve best log loss and low misclassification error. Therefore, we use 1000 trees in random forest. We use XGBoost classifier with 500 trees, 500 estimators with a maximum depth of 5 and learning rate 0.05.
Any machine learning algorithm needs training and testing to determine the performance of the classifier. We split the dataset randomly into three parts train, cross validation and test with 64%, 16%, 20% of data respectively. We use 80% of data for training and 20% for testing.

V. RESULTS AND DISCUSSION
We experiment with the features extracted from byte files, asm files individually and by combining them all. The following sections present the results.

A. Results on Byte Files
The log loss values on cross validation as well as test data are tabulated in Table III. Random forest classifier achieves low log loss value on cross validation data, whereas XGBoost is the winner on test data as well as misclassified errors.
From Table IV, we can see that the precision and recall of k-NN for class 5 is low compared to other classes. We guess that this is because of very few number of instances in class 5 (Fig. 1). From precision matrix, it is understood that there is a confusion between class 1 and class 5.

B. Results on Features Extracted from asm Files
The log loss values computed using features extracted from asm files are tabulated in Table V. XGBoost obtain better log loss on test data. But precision and recall for class 5 is improved using asm file features as shown in Table VI. C. Results on Both Byte and asm Files Random forest ensemble and XGBoost clearly obtain better accuracy in both cases of asm as well as byte files. We have used both features in these two models and present results in Table VII. When 257 features related to byte files as well as 53 features extracted from asm files are used for training, log loss result of XGBoost is improved for both cross validation as well as testing data from 0.048 to 0.031.

VI. CONCLUSION
In this paper, we detect the type of malware that a given file belongs to. We use unigram model to construct bag of words from byte files as well as asm files. Random forest and XGBoost classifiers achieve a better log loss value of 0.031 over other classifiers used in this work. Usage of only byte files failed to detect some class of malware especially class 5, where the number of files are few, but the other information pertaining to asm files could succeed in detecting malwares belonging to all class. In future, we would like to apply advanced text retrieval features on byte files to improve the log-loss.