A Decision Support Tool for Inferring Further Education Desires of Youth in Sri Lanka

This paper presents the results of a study carried out to identify the factors that influence the further education desires of Sri Lankan youth. Statistical modeling has been initially used to infer the desires of the youth and then a decision support tool has been developed based on the statistical model developed. In order to carry out the analysis and the development of the model, data collected as part of the National Youth Survey has been used. The accuracy of the model and the decision support tool has been tested by using a random data sets and the accuracy was found to be well above 80 percent, which is sufficient for any policy related decision making.


INTRODUCTION
Sri Lanka has witnessed several incidents of youth unrest in the recent past. Out of these insurgencies, two insurgencies involved the youth of the south while the other one involved the youth of the north. There have been many discussions and debates about youth unrest and the increasingly violent and intolerant nature of their conflicts. Since these discussions have been rather impressionistic there has always been the need for systematic studies to obtain information on Sri Lankan youth and their background and desires [1]. In order to collect up to date information, targeting to explore facts on Sri Lankan youth and their perceptions, an island wide national youth survey has been carried out. This has been conducted as a joint undertaking involving the United Nations Development programme (UNDP) and other six Sri Lankan and German institutions in the turn of the century. In this survey they have considered four main segments of youth, that is, their politics, conflicts, employment and education. Further Education Desires of youth have been selected to be studied further in this research. The relationship between the types of further education desire of youth in Sri Lanka had been studied with relation to other social factors.
Education domain consists of many different areas but presently in Sri Lanka only a few areas are catered by the national educational institutes [1]. By finding out the educational desires of the youth, it will be possible to design and develop educational and professional programmes and institutes which can be readily accepted by the youth and give better results than that can be achieved by only pursuing traditional programmes. Data Mining which is a powerful tool that can recognize and unearth significant facts, relationships, trends and patterns can be employed to discover this information [2]. In this project, a data mining model has been developed to predict the educational desire of youths at an early stage from other social data.

II. THEORETICAL BACKGROUND
Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data [3]. Univariate analysis is the simplest form of quantitative (statistical) analysis.
The analysis is carried out with the description of a single variable and its attributes of the applicable unit of analysis. Univariate analysis contrasts with bivariate analysisthe analysis of two variables simultaneouslyor multivariate analysisthe analysis of multiple variables simultaneously. Univariate analysis is also used primarily for descriptive purposes, while bivariate and multivariate analysis are geared more towards explanatory purposes [4]. Univariate analysis is commonly used in the first stages of research, in analyzing the data at hand, before being supplemented by more advance, inferential bivariate or multivariate analysis [5]. Pearson's chisquare test is the best-known of several chi-square testsstatistical procedures whose results are evaluated by reference to the chi-square distribution [6].
With large samples, a chi-square test can be used. However, the significance value it provides is only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately equal to the theoretical chisquared distribution. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the "expected values") 29 | P a g e www.ijacsa.thesai.org being low. The usual rule of thumb for deciding whether the chi-squared approximation is good enough is that the chisquared test is not suitable when the expected values in any of the cells of a contingency table are below 5 or below 10 when there is only one degree of freedom [7]. In contrast the Fisher exact test is, as its name states, exact, and it can therefore be used regardless of the sample characteristics [8]. For hand calculations, the test is only feasible in the case of a 2×2 contingency table. However the principle of the test can be extended to the general case of an m×n table [9].
Logistic regression is most frequently employed to model the relationship between a dichotomous (binary) outcome variable and a set of covariants, but with a few modifications it may also be used when the outcome variable is polytomous [10].
The extension of the model and the methods from a binary outcome variable to a polytomous outcome variable can be easily illustrated when the outcome variable has three categories. Further generalization to an outcome variable with more than three categories is more of a notation problem than a conceptual one [11]. Hence, it will be considered only the situation when the outcome variable has three categories.
Main objective of fitting this statistical model is to find out the sequence of variables being significant to the model, so that the sequence of variables, as a whole or a subsequence starting from the first variable, will be used as necessary in constructing a decision tree. In this study we make use of this statistical model not for interpretations but only for doing a comparison with the outcome of a Data Mining approach in decision making.

III. ANALYSIS
Univariate analysis is carried out with the purpose of analyzing each variable independently from other variables. Therefore each of the categorical variables measured as a factor is cross tabularized with the dependent variable "Type of Further Educational Desires" calculating percentages of respondents belonging to each combination of levels, and the Chi-Square Test is performed in order to measure the strength of association between factors and the response of interest. A tolerance rate of 20 percent has been fixed as the significance level for further analysis. Table 1 shows the results of the analysis. From the results shown in Table 1, the variables with their p-values less than 0.2 have been detected as significant.

A. Fitting a Statistical Model
The main purpose of this part of analysis is to determine the factors, which affect or are associated with having different types of Further Educational Desires in youth. Though several variables have been identified as significant factors, where each could independently build a significant effect on developing different wishes on education among youth, due to the confounding nature of these factors, it is not easy to conclude on their corporative influence on making Further Education Desires different in people.
Therefore this modeling approach can be very much useful in detecting the genuine effect of these factors when adjusted for some other factors as well.
Since the response variable is Multinomial and the scale of response levels are Nominal, it was decided to work out a "logit" link in regression modeling. Therefore a Generalized Logit Model will be fitted to accomplish the objective.

B. Fitting the best fitted Generalized Logit Model
The Forward Selection procedure is used in selecting variables to the model. In assessing the fit of the terms to the model, the difference in deviance of the two models compared, which is distributed as Chi-Squared has been used at the 5% significance level. However the terms will be selected to the model, as they do the best representation of all the data.
The results obtained in following the steps of fitting a Generalized Logit Model using the procedure CATMOD in SAS package, are tabularized in the body of the analysis.
Let the Null Model be, ; where is the probability that a respondent has the type of Further Education Desire, f , and type F is the Further Education Desire category "No Desire".

Fitting Main Effects to the Model
Step 1: Null Model vs One Variable Model (Model 1) Table 2 shows sample of data used in devising Model 1.   Table 2, the lowest p-value is associated with the variable "Type of activity". The selection procedure of the most significant variable requires that "Type of activity" be added to the Null Model as the first step of developing a model where the "Type of Further Education Desire" being the response variable.
The explanatory variables in the model: Type of activity Model 1: where is the probability that a respondent in "Type of activity" category i has the type of Further Education Desire, f , and type F is the Further Education Desire category "No Desire".
Thus two logits are modeled for each activity type: the logit comparing Technical/Vocational Education to No Desire and the logit comparing University/Higher Education to No Desire.

Model 1: for Type of activity i
logit(response1/response3)i=1 models the probability of response category 1 relative to the response category 3. logit(response2/response3)i=2 models the probability of response category 2 relative to the response category 3.
There are separate sets of intercept parameters and regression parameter for each logit and the matrix xi is the explanatory variable for the i th population.
Step 2: Model 1 vs Two Variable Model (Model 2) Table 3 shows sample of data used in devising Model 2. Since the variable "Educational Level" has the lowest pvalue, it was brought into the model that has been adjusted for "Type of activity".
The explanatory variables in the model: Type of activity, Educational Level where the matrix is the set of explanatory variables for the ij th population.
Step 3: Model 2 vs Three Variable Model (Model 3) Table 4 shows sample of data used in devising Model 3. where the matrix is the set of explanatory variables for the ijk th population.
Step 4: Model 3 vs Four Variable Model (Model 4) Table 5 shows sample of data used in devising Model 4. where the matrix is the set of explanatory variables for the ijkl th population.
Step 5: Model 4 vs Five Variable Model (Model 5) Table 6 shows sample of data used in devising Model 5. where the matrix is the set of explanatory variables for the ijklm th population.
Step 6: Model 5 vs Six Variable Model (Model 6) Table 7 shows sample of data used in devising Model 6. where the matrix is the set of explanatory variables for the ijklmn th population.
It was observed that addition of the remaining variables did not improve the results. Hence the Model 6 has been identified as the best main effect model.

C. Improving the Model
It was further investigated to determine if the addition of two way interaction terms improved the model. The importance of an interaction term was assessed by checking the impact of the difference in deviance of the model.
Step 7: Model 6 Vs Model 7 Table 8 shows sample of data used in devising Model 7. 32 | P a g e www.ijacsa.thesai.org The interaction between "Education Level" and Province has been found to have an effect on the model and hence added to the model. where the matrix is the set of explanatory variables for the ijklmn th population and is the interaction effect of Edu*Pro.
Step 8: Model 7 Vs Model 8 Table 9 shows sample of data used in devising Model 8. In this step, the interaction between "Financial Situation in Past (Finp)" and "Major Problems with Education (Mprob)" has been found to have an effect on the model and hence added to the model.
The explanatory variables in the model: Type of activity, Educational Level, Province, Gender, Social class, Age group, Edu*Pro, Finp*Mprob where the matrix is the set of explanatory variables for the ijklmn th population, the matrix is the interaction effect of Edu*Pro and the matrix is the interaction effect of Finp*Mprob.
Step 9: Model 8 Vs Model 9 Table 10 shows sample of data used in devising Model 9. In this step, the interaction between "Major Problems with Education (Mprob)" and "Province (Pro)" has been found to have an effect on the model and hence added to the model. where the matrix is the set of explanatory variables for the ijklmnth population, the matrix is the interaction effect of Edu*Pro, the matrix is the interaction effect of Finp*Mprob and the matrix is the interaction effect of Mprob*Pro.

After
Step 9 no more significant two-way terms were revealed.

D. Checking the Adequacy of the Best 2-Way Interaction
Model Goodness of Fit: Hypothesis Testing with Deviance Statistics H0: No lack of fit H1: There is some lack of fit According to the p-value, it can be concluded that the Null Hypothesis is not rejected. Hence, the test is the proof that there is no lack of fit of the model or the model developed fits the data well.

E. Classification Table
The classification table was used to evaluate the predictive accuracy of the regression model. From the results shown in Table 12, it can be seen that Model 9 has the accuracy of more than 70 percent.

F. Implementation of Data Mining Techniques
Finally the model (Model 9) developed using Univariate analysis was used to develop a data mining model. This data mining model can predict the "Further Educational Desire" in a youth from other attributes discussed above.

Construction of Decision Tree
A decision tree was constructed using the attributes identified as significant in the Univariate analysis. The ordering of attributes in the decision tree was also as determined in the statistical analysis. Table 13 shows the attributes and the no. of levels of each attribute. "Type of Further Education Desire" is the class attribute. Age group 3 7 Financial Situation in Past 3 8 Major Problems with Education 8 9 Type of Further Education Desire 3 Figure 1 shows the portion of the decision tree thus constructed.

G. Implementation
A software tool was developed using Visual Basic to implement the rule set developed above. Figure 3 shows the interface of the software tool developed.

H. Evaluation
The system developed was tested using a random set of data containing 485 test records. The selected random set was divided into four test sets each test set containing around 125 records. Table 14 shows the classification table obtained by inputting data into the application.    Table 16 shows the resulting measures obtained from the tests. From Table 16, it can be seen that the overall accuracy of the system is above 80 percent. Figure 7 shows the Receiver-Operator Characteristic (ROC) curve drawn from the data in Figure 4. www.ijacsa.thesai.org From the ROC curve, it can be seen that the results concentrate towards the upper left hand corner. This is the proof that the accuracy of the Data Mining model is acceptable as pure random guess would lie along the diagonal line.

IV. CONCLUSIONS
This paper presented the results of the research carried out to find the factors on which the Educational Desires of Sri Lankan Youth depends. The research found that the Educational Desires of the Youth could be predicated through the combination of several social factors. The findings of the research were finally used to design a data mining model for the predication of the Educational Desires of the Youth. This model can be used by decision makers in dealing with issues concerning youth especially their further educational requirements.