Observation of Imbalance Tracer Study Data for Graduates Employability Prediction in Indonesia

—Tracer Study is a mandatory aspect of accreditation assessment in Indonesia. The Indonesian Ministry of Education requires all Indonesia Universities to anually report graduate tracer study reports to the government. Tracer study is also needed by the University in evaluating the success of learning that has been applied to the curriculum. One of the things that need to be evaluated is the level of absorption of graduates into the working industry, so a machine learning model is needed to assist the University Officials in evaluating and understanding the character of its graduates, so that it can help determine curriculum policies. In this research, the researcher focuses on making a reliable machine learning model with a tracer study dataset format that has been determined by the Government of Indonesia. The dataset was obtained from the tracer study of Amikom University. In this study, SVM will be tested with several variants of the algorithm to handle imbalanced data. The study compared SMOTE, SMOTE-ENN, and SMOTE-Tomek combined with SVM to detect the employability of graduates. The test was carried out with K-Fold Cross Validation, with the highest accuracy and precision results produced by SMOTE-ENN SVM model by value of 0.96 and 0.89.


I. INTRODUCTION
A decent University can be seen from the level of absorption of its graduates in working world, thus many universities are trying to improve the quality of their graduates [1], [2]. That is the reason why the Indonesian Ministry of Education requires all Universities to always report the results of tracer study anually for measuring University graduates employability. Tracer study is also a requirement for higher education accreditation set by the National Accreditation Board for Higher Education (BAN-PT) [3], [4].
Currently we live surrounded by data, data circulating around us can be collected and processed to produce new knowledge [5], including tracer study data. These data can be collected and processed to improve the quality of human resources and curriculum that can increase the absorption of university graduates in industries. [1], [6].
One of the machine learning models that have been widely used to meet these needs is classification [7], [8]. Using classification algorithm we can predict whether an alumni has the possibility of being absorbed in a job quickly or not [9].
There are many classification algorithms that are popularly used, one of which is the Support Vector Machine, from previous research the SVM algorithm is very well used to predict the employability of graduates [10], but basically the final result of an algorithm does not only depend on the quality of the algorithm used but also on the quality of the dataset applied to the algorithm, one of the criteria to get a reliable machine learning model is that the dataset must be balanced, to balance the dataset there are 2 methods, namely oversampling and undersampling, one of the oversampling algorithms that can be used is SMOTE, SMOTE itself has several variants, namely SMOTE, SMOTE ENN, and SMOTE Tomek [11], [12].
This study aims to find out the best method for predicting the employability of higher education alumni using the Amikom University tracer study dataset with attributes and formats determined by the Indonesian Ministry of Education which can be accessed on the web http://tracerstudy.kemdikbud.go.id/ frontend/.

A. Classification
Classification is a type of machine learning algorithm where the computer will automatically predict the class of a data from the input data given [7]. Several classification algorithms commonly used for tracer studies include Naive Bayes, Neural Network, SVM, Logistic Regression, etc [9], [13], [14]. In previous works, Tracer Study Data in Indonesia was analyzed using those classification algorithms, without using SMOTE or another imbalanced data handler model.

B. Balance Data
Balanced dataset is data in which the comparison of each data in a class is balanced, the data in which each class has a significantly different amount, the dataset is called imbalance. Unbalanced classes are a common problem in machine learning classification where there is a disproportionate ratio in each class. Class imbalances can be found in various fields , moreover in tracer study case. Classes that have more data are often called majority classes and classes that have less data are called minority classes [15]- [17].

C. Support Vector Machine
The Support Vector Machine algorithm is one of the algorithms included in the Supervised Learning category, which means that the data used for machine learning is data that has a previous label [18], [19]. So that in the decisionmaking process, the machine will categorize the testing data into labels that are in accordance with its characteristics. www.iijacsa.thesai.org Support Vector Machine is one of the machine learning algorithms that can be used for classification, where this algorithm will generate the best hyperplane where this hyperplane will separate the classes in the dataset [20], [21]. where: SMOTE SMOTE is one of the algorithms that can be used to balance a dataset, using an oversampling approach, in which this algorithm will generate synthesis data from the minority class so that the minority class has the same amount of data as the majority class [15], [22].This synthetic data is obtained based on the value of k-neighbours from minority data. In this study, researchers will compare three variants of the SMOTE algorithm, namely, SMOTE, SMOTE ENN and SMOTE Tomek. SMOTE Tomek uses a combination of the SMOTE algorithm which is a balancing algorithm with an oversampling approach combined with ENN and Tomek which is an undersampling algorithm, where ENN and Tomek function to delete synthetic data that has similarities to the majority data so that data balance is obtained where each data class has a clear difference [11], [23].

III. RESEARCH METHOD
The dataset used in this study is data obtained from questionnaires filled out by alumni of Amikom University in 2018. The questionnaires that have been distributed are then filled out by (many) respondents and stored in csv form. The process can be seen in the

A. Selection of Attributes and Collection of Survey Results
The first stage of this research is to collect the results of the questionnaire; which later the results from this questionnaire will be presented in csv form so that thereafter it can be processed using a predetermined model. There are 145 collumns consists of their hardskill level after graduate, sex, how long they study in college, when they start to search jobs, and many more, including the label (alumni employability). All of the atributes can be accessed at http://tracerstudy.kemdikbud.go.id/ frontend/. www.iijacsa.thesai.org

B. Labeling Data
Data labeling is done by taking each respondent's answer to the question "How long did it take you to get your job after graduation?" In this research, based on that question, labels are divided into three classes. If a student gets a job before graduating from University, then the data will be labeled as "1". If the student gets a job three months or less after they graduate from University then it will be labeled "2". If the student takes more than 3 months to get a job get a job after graduation it will be labeled "3".

C. Data Preprocessing
In this process, preprocessing of data is carried out by converting data labeled string into integer form and also filling empty values in all existing columns with zero values, and deleting values with remaining null data. This have to be done to avoid anomalies in the mathematical modeling.

D. Data Balancing
In practice, classification requires balanced data, balanced data is data where each label has the same amount, if each label has a significantly different amount then the dataset is called imbalanced. Class that has more data is the majority class and the class that has less data is called the minority class [24].
In this study, to overcome the imbalanced data, SMOTE algorithm is used, SMOTE is an algorithm that is useful for balancing the amount of data with an oversampling approach, the SMOTE algorithm will create synthesis data obtained based on the value of k-neighbours from minority data [25].

E. Classification
After the data balancing process, the classification process is carried out with the Support Vector Machine algorithm.

F. Testing
The model testing process uses the K-Fold Cross Validation algorithm with Folds determined to be 3, 5, and 10 Folds. This is done so that the test is more valid and vary [26].

IV. RESULTS
In this study, we will classify the normalized tracer study dataset. After collecting and normalizing the dataset , the dataset will be divided into three classes based on when the alumni got a job, the first class will contain data on alumni who got a job before graduating, less than or three months after graduation, and more than three months after graduating. Fig. 2 showed us the amount of data that has imbalance class. Fig. 3 showed that the amount of dataset significantly altered in every observation using different types of SMOTE.
There are three models of balancing algorithm that will be compared, those are SMOTE, SMOTE ENN and SMOTE Tomek algorithms when applied to the support vector machine classification algorithm. The best model will be calculated based on the average value of f1, accuracy, precision, and recall.
The SMOTE algorithm is a data balancing algorithm with an oversampling approach where the number of minority classes will be increased to balance the majority class. Fig. 4-6 show the dataset after being applied to SMOTE, SMOTE ENN and SMOTE Tomek algorithms  After the dataset are being processed by SMOTE and SMOTE-TOMEK algorithms, it produces classes that have balanced amount of data. But it did not happen in the SMOTE ENN algorithm, SMOTE ENN created a more normal dataset, this is because when data has an absolute balance, sometimes it may result in overfitting. [12] Furthermore, after getting the data that we have balanced, the data will be applied to the Support vector machine classification algorithm and for model level measurements, cross fold validation measurements will be used with 3, 5 and 10 fold values for accuracy, f1 score, recall and precision for each model. Shown in Table I and II, the experiment is done by using three fold cross validation to test the f1 score, accuracy, precision, and recall from SVM with SMOTE, SMOTE-TOMEK, and SMOTE_ENN and the results obtained that this research scenario has an average f1 accuracy result. score, precision and recall using SVM alone are 0.79.0.76, 0.83, 0.75 and after data balancing, the f1 score, precision and recall are respectively as follows In the test scenario using five cross fold validation that are shwon at Table III, IV, V and VI, the average results of the f1  score accuracy, precision and recall  then it can be seen from the data that the values of accuracy, precision, recall and f1 are close to perfect which indicates an overfitting, this is triggered by the distribution of test data that is less than the previous experiment. Just like the previous two experiments in the 10 cross fold validation experiment that can be read in Table VII and Table  VIII, before the application of balancing the data model, the accuracy value was equal to 0.84, f1 was equal to 0.81, precision was equal to 0.87 and recall is 0.8, then after SMOTE being implemented, there was an increase in the accuracy of the f1 score, precision and recall. The four values increase after data balancing is done. The value of f1 score accuracy, precision and recall is equal to getting the average result However, in this experiment, it can be seen that there is an overfitting of the SVM model that uses a data balancing algorithm in several folds which is marked by perfect accuracy in all 3 algorithms. This happens because the test data is only 10% of the entire dataset, it can also be seen in the ENN and Tomek algorithms, cases of overfitting occur more than in the smote algorithm, this is due to the significant difference between the classes in the dataset after the application of the enn and tomek algorithms which is getting worse. enlarge the difference in the data in each class.

V. CONCLUSION
In this study, data balancing algorithms smote, and smote tomek can be used to produce balanced data in terms of the balance ratio formula. Both of these algorithms also produce accuracy, f1 score, precision and recall which are quite significant considering the results presented. However, compared to the SMote-ENN algorithm which produces a poor balance ratio value, the smote tomek and smote algorithms have a lower accuracy value of f1 score, precision and recall. Several fold-cross validation were performed to analyze the data, and found that SMOTE-ENN has the best accuracy in general. In 10-Fold Validation Without SMOTE produced 0.84 in accuracy, using SMOTE it produced 0.9 in accuracy, using SMOTE-Tomek it has 0.94 in accuracy point, and the last one SMOTE-ENN has 0.95 in accuracy.
The SMOTE-ENN-SVM algorithm produces a model with better quality, this can be seen from the accuracy score in each experiment which is higher than other algorithms. In the future, because Tracer Study Data that has many collumns and vary type of data, it would be better to perform feature selection algorithms to select the best feature to be analyzed.