Exploring Factors Associated with the Social Discrimination Experience of Children from Multicultural Families in South Korea by using Stacking with Non-linear Algorithm

The number of children from multicultural families is increasing rapidly along with quickly increasing multicultural families. However, there are not enough surveys and basic researches for understanding the characteristics of multicultural children and issues such as social discrimination. This study discovered the machine learning model with the best performance for predicting the social discrimination experience of children from multicultural families by comparing the prediction performance (accuracy) of individual prediction models and stacking ensemble models. This study analyzed 19,431 adolescents (between 19 and 24 years old: 9,835 males and 9,596 females) among the children of marriage immigrants. This study used random forest (RF), rotation forest, artificial neural network (ANN), and support vector machine (SVM) for the base model. Logistic regression algorithm was applied for the meta model. Each machine learning model was built through 5-fold cross-validation. Root-mean-square-error (RMSE), index of agreement (IA), and variance of errors (Ev) were used to evaluate the prediction performance of the developed models. The results of this study indicated that the prediction performance of the rotation forest-logistic regression model had the best performance. The future studies need to explore stacking ensemble models with the best performance through combining a base model and a meta model by using various machine learning algorithms such as clustering and boosting. Keywords—Stacking ensemble; meta model; root-mean-squareerror; index of agreement; rotation forest


I. INTRODUCTION
The number of foreigners residing in South Korea exceeded 2 million as of 2019. This accounts for 3.69% of the South Korean population, which is not a high percentage. However, it is recognized as a noticeable phenomenon in Korean society because the number of immigrants has increased rapidly over the past decade and immigrants are easily distinguishable due to differences in appearance and language [1]. In particular, as this issue has become linked to the marriage to men living in rural areas or men with low-income in urban areas since 2002, the number of multicultural families has reached 900,000 as of 2016 [2]. The number of immigrants will increase more as the population of South Korea will decrease due to the aging and low birth rate [3]. It has drawn more attention because the population composition will be diversified further due to this [3].
The multicultural family means a family made by uniting people with different nationalities or races through international marriage and other methods. South Korea prepared the "Measures to Support the Social Integration for Female Marriage Immigrant Families, Multi-racial People, and Immigrants" in 2006 to help multicultural families settle in South Korea stably. As the Multicultural Families Support Act was enacted in 2008, she strengthened the support for multicultural families at the policy level. As a result, social security and legal status were guaranteed for marriage immigrants.
As the number of foreigners residing in South Korea rapidly increases, the number of children from multicultural families (e.g., international marriage families and foreign workers' families) is also increasing. Furthermore, as they attend schools, the possibility of conflict due to cultural differences has increased according to the increased personal and cultural contacts.
Nevertheless, in South Korea, social policies for multicultural families have mainly focused on employment or welfare for marriage immigrants and foreign workers [4,5]. Moreover, previous studies on multicultural families [4,6] have been conducted to examine only limited individual aspects such as socioeconomic characteristics, welfare level, human rights discrimination, employment status, and policy analysis. However, there are still not enough studies on the overall social discrimination experiences of children from multicultural families. Children from multicultural families (international marriage families) can be divided into children born in South Korea and those who have entered South Korea after being born in other countries. Since it has been reported that children could not adapt to South Korea well due to the unique characteristics of multicultural families and various changes that they experience during adolescence [7], it is necessary to expand the social foundation that can help them adapt to South Korea well for social integration.
In summary, the number of children from multicultural families is increasing rapidly along with quickly increasing multicultural families. However, there are not enough surveys and basic researches for understanding the characteristics of multicultural children and issues such as social discrimination. Therefore, it is needed to identify the characteristics of 125 | P a g e www.ijacsa.thesai.org multicultural children and seek new policies that reflect them to prepare policies that encompass various problems including the social adaptation of multicultural children in preparation for a rapidly changing multicultural society.
Previous studies [8,9,10,11,12] on the adolescents from multicultural families in South Korea reported the difficulties in peer relations, social support, family support, and language as factors related to discrimination experiences. Most of them used regression analysis for a prediction algorithm. Regression analysis is efficient for detecting individual risk factors, but it is limited in identifying multiple risk factors [13,14]. As a way to overcome this limitation, recent social science studies [15,16] have used predictive modeling based on big data-based machine learning. However, since these prediction studies are based on individual prediction algorithms, the bias existing in each algorithm may be reflected in the prediction results.
This study identified the predictors of social discrimination experiences of children from multicultural families in South Korea by using individual prediction models based on machine learning and reduced the potential bias risk of the models by combining them into a stacking ensemble learning model. Moreover, this study discovered the machine learning model with the best performance for predicting the social discrimination experience of children from multicultural families by comparing the prediction performance (accuracy) of individual prediction models and stacking ensemble models.

A. Data Source
The data source of this study was the "Study on the National Survey of Multicultural Families [17]" in 2012, which was jointly surveyed by the Ministry of Health, Welfare and Family Affairs, the Ministry of Justice, and the Ministry of Gender Equality and evaluated multicultural families residing in South Korea. The Study on the National Survey of Multicultural Families was conducted to develop policies customized for multicultural families by identifying their living conditions and welfare needs. The survey items consisted of the general characteristics, employment, economic level, marriage, health, and health care of multicultural families. The survey targets for the national survey of multicultural families were 154,333 families, all marriage immigrants. In addition to marriage immigrants, this survey also evaluated the spouses and children of marriage immigrants separately. The target subjects were sampled based on the status of alien residents living in 16 cities and provinces and the basic status of multicultural families collected by the Ministry of Public Administration and Security. Since this survey collected data from all target samples, a sample design was not needed and the survey was conducted from July 20 to October 31, 2012. The multicultural families used for this study were (1) families composed of marriage immigrants and South Korean who obtained South Korean nationality by birth, acknowledgment, or naturalization and (2) families composed of foreigners who obtained South Korean nationality by acknowledgment or naturalization and South Koreans who obtained South Korean nationality by birth, acknowledgment, or naturalization in accordance with the Multicultural Families Support Act. This study analyzed 19,431 adolescents (between 19 and 24 years old: 9,835 males and 9,596 females) among the children of marriage immigrants.

B. Measurements and Definitions of Variables
The target variable (label) was defined as social discrimination experience (yes or no). Features were gender, age, residence (countryside or city), highest level of education (elementary school graduation and below, middle school graduation, high school graduation, or college graduation or higher), Korean reading level (good, average, or poor), Korean speaking level (good, average, or poor), Korean writing level (good, average, or poor), Korean listening level (good, average, or poor), learning support experience (yes or no), economic activity (yes or no), the experience of using a support center for multi-cultural families (yes or no), learning Korean (yes or no), Korean society adaptation training (yes or no), career counseling (yes or no), and social welfare center use (yes or no).

C. Single Machine Learning Algorithm (base model): Support
Vector Machine (SVM) SVM is a machine learning algorithm that finds an optimal decision boundary, which is a linear separation that optimally separates a hyperplane by transforming training data into a high dimension through nonlinear mapping [18]. For example, when A=[a,d] and B=[b,c] are non-linearly separable in 2D, it becomes linearly separable when they are mapped in 3D. Therefore, if an appropriate nonlinear mapping is applied to a sufficiently large dimension, data with two classes can always be separated in a hyperplane. The concept of SVM is presented in Fig. 1.

D. Random Forest
Random forest is an algorithm that randomly learns a number of decision trees. It uses a number of bootstrap samples. After generating a decision tree for each sample, the output value is predicted using the decision tree most frequently used among the generated decision tree when new data is input [20]. The concept of random forest is presented in Fig. 2.

E. Rotation Forest
Rotation forest is one of the random forest algorithms that performs learning while rotating the data axis by applying principal component analysis (PCA) to the training data. Rotation forest generates classifier ensembles based on feature extraction after excluding random features from the previous feature set used for learning. Principal component analysis (PCA) is performed on randomly divided subsets and training is conducted by rotating the data dimension [22]. Through this process, robust characteristics can be obtained for the input data showing complex distribution [23]. The performance procedure of the rotation forest is presented in Fig. 3.

F. ANN (Artificial Neural Network)
ANN is an algorithm created based on the neural network structure of the human brain. It is composed of an input layer that inputs the target data, a hidden layer (or hidden layers) that is an intermediate step, and an output layer that shows the result. Every layer consists of a number of nodes, and only information that exceeds the threshold is passed to the next layer through the activation function. It is possible to predict the result in the output layer after deriving only the necessary information through this. The concept of ANN is presented in Fig. 4.

G. Stacking Ensemble (Meta Model)
This study predicted social discrimination experiences by using stacking ensemble techniques. Stacking ensemble techniques are superior in generalization and robustness to single machine learning models and have been used for classification and prediction in various fields [26,27,28,29]. It is a method of creating a new model by combining different models as if stacking them [30]. It improves the performance of the final model by taking advantage of each model and supplementing its weaknesses while going through the two stages (base and meta) [30].
This study used random forest (RF), rotation forest, ANN, and SVM for the base model. Logistic regression algorithm was applied for the meta model. The regression algorithm is the simplest method to increase the reliability of the base model while maximizing the generality and stability of the model. Feng et al., (2020) [31] reported that it would overfit the training data less probably. Due to this reason, the regression algorithm has been used as a meta model of the stacking ensemble algorithm in many recent publications [31,32], and this study also used it as a meta model for the same reason. The structure of the finally constructed stacking ensemble model is presented in Fig. 5.

H. Validation of the Models
Each machine learning model was built through 5-fold cross-validation. This method validates the validity of learning by randomly dividing the entire sample into equal-sized five groups, validating it by using one of the groups as a validation dataset and the other groups as training datasets, and repeating this procedure five times. Root-mean-square-error (RMSE), index of agreement (IA), and variance of errors (Ev) were used to evaluate the prediction performance of the developed models. A lower RMSE indicates the higher accuracy of a prediction model. When IA is closer to 1 and Ev is lower, a model is more stable.
III. RESULTS Table I shows the general characteristics of adolescents from multicultural families in South Korea according to the presence of social discrimination experience. Among the all subjects (19,431 adolescents), 15.6% (3,035 adolescents) experienced social discrimination. The result of chi-square test revealed that residence, gender, highest level of education, the experience of using a support center for multi-cultural families, Korean speaking level, Korean listening level, Korean reading level, Korean writing level, career counseling, learning Korean, and Korean society adaptation training were significantly (p<0.05) different between adolescents from multicultural families with social discrimination experience and those without social discrimination experience. The prediction performance (i.e., RMSE, IA, and Ev) of the eight machine learning models for predicting social discrimination experience is presented in Fig. 6, 7, and 8, respectively. The analysis results of this study indicated that the prediction performance of the rotation Forst-logit regression model (RMSE = 0.15, IA = 0.72, and Ev =0.41) had the best performance.  The normalized importance of each variable of the rotation forest-logit regression model is presented in Fig. 9. The model confirmed that Korean society adaptation training, learning Korean, gender, the experience of using a multicultural family support center, and career counseling were major variables with high weight in the social discrimination experience of children from multicultural families in South Korea. Especially, Korean society adaptation training was the most important factor in the final model.  MFSC=multicultural family support center

IV. CONCLUSION
This study compared the accuracy of models for predicting the social discrimination experience of children from multicultural families in South Korea by using eight machine learning algorithms, and confirmed that the rotation forest-logit regression model based on the stacking ensemble algorithm had the best prediction performance. In particular, the prediction model based on the stacking ensemble had improved accuracy (RMSE = 0.04-0.05) than other models and more stable (IA= 0.02-0.03) than other models. The results of this study support the possibility that the meta-model's prediction performance can be superior to the single prediction model for not only unstructured data such as videos and images but also structured data such as social science data. However, Lee & Kim (2020) [33] also reported that stacking ensemble algorithms had a longer execution time (runtime) than single machine learning algorithms, a limitation. Therefore, future studies using stacking ensembles need to evaluate the prediction performance comprehensively by comparing execution time (runtime) as well as accuracy. It is also needed to explore stacking ensemble models with the best performance through combining a base model and a meta model by using various machine learning algorithms such as clustering and boosting.