Model Development for Predicting the Occurrence of Benign Laryngeal Lesions using Support Vector Machine : Focusing on South Korean Adults Living in Local Communities

The disease is a consequence of interactions between many complex risk factors, rather than a single cause. Therefore, it is necessary to develop a disease prediction model by using multiple risk factors instead of using a single risk factor. The objective of this study was to develop a model for predicting the occurrence of benign laryngeal lesions based on support vector machine (SVM) using ear, nose and throat (ENT) data from a national-level survey and to provide a basis for selecting high-risk groups and preventing a voice disorder. This study targeted 16,938 adults (≥19years) who participated in the ENT examination among the people who completed the Korea National Health and Nutrition Examination Survey from 2010 to 2012. This study compared the prediction power of the Gauss function, which was used for this study, with that of a linear algorithm, that of a polynomial algorithm, and that of a sigmoid algorithm. Moreover, four kernels were divided into C-SVM and Nu-SVM to compare the prediction accuracy of C-SVM with that of Nu-SVM. The ‘benign laryngeal lesion prediction model’ based on SVM could derive preventive factors and risk factors. The final prediction rate of this SVM using 479 support vectors was 97.306. The fitness results indicated that the difference between C-SVM and Nu-SVM was not large in the benign laryngeal lesion prediction model. In terms of kernel type, the prediction accuracy of Gauss kernel was the highest and the prediction accuracy of the sigmoid kernel was the lowest. The results of this study will provide an important basis for preventing and managing benign laryngeal lesions. Keywords—Support vector machine; SVM; dysphonia; voice disorder; prediction model; risk factor; data mining


I. INTRODUCTION
Benign laryngeal lesions have a different meaning from laryngeal cancer and it is used to describe voice disease due to the changes in the laryngeal structure such as vocal fold nodules, vocal polyp, intracordal cyst, Reinke's edema, granuloma, sulcus vocalis, and larynx keratosis [1,2].Since benign laryngeal lesions lead to structural changes in the larynx including the vocal cords, it ultimately becomes a direct cause of voice disorder.
Roy et al. [3] reported that the prevalence of the laryngeal diseases was 6.6% in the United States as of 2005 and 10% of all Americans experienced a voice problem at least once in their lifetime.Voice is a very important function in maintaining daily life, and maintaining a healthy voice can greatly affect the quality of life particularly for those who use voice for occupations (e.g., teacher).For example, the voice problem of the teachers can cause unnecessary socio-economic costs such as job loss or unemployment due to the loss of their ability to teach [4].Therefore, it is very important to accurately understand the risk factors and diagnose the earlier for providing appropriate rehabilitation according to the assessment.
It has been reported that voice abuse and the physical stimulation to the vocal cord due to inappropriate speech habits are the most common causes of benign laryngeal lesions [5][6][7][8][9][10][11][12][13].Moreover, various other factors such as smoking, drinking, virus, upper airway infection, and laryngopharyngeal reflux are also reported as risk factors [5][6][7][8][9][10][11][12][13].The disease is a consequence of interactions between many complex risk factors, rather than a single cause [7,14].Therefore, it is necessary to develop a disease prediction model by using multiple risk factors instead of using a single risk factor [12].Nevertheless, studies evaluating the risk factors for the laryngeal disease have mostly focused on single risk factors [15].
Lifestyle heavily influences the occurrence and rehabilitation of benign laryngeal lesions [15].Moreover, even if surgical treatment is successfully conducted, it is likely to reoccur if the factor negatively affects the voice is not removed [15].Additionally, although the shapes of lesions on the vocal fold are similar, different treatments must be performed depending on the etiology.Therefore, it is important to identify the complex risk factors in order to fully understand the cause of the disease and to make accurate diagnosis and treatment.
Machine learning based on the supervised learning such as support vector machine (SVM) has been used as a method to identify complex risk factors of diseases [16][17][18].SVM is known to have the better predictive power in classifying binary data such as the presence of disease than the decision tree method or the artificial neural network (ANN) method [19][20][21][22][23][24].Nevertheless, the voice disorder prediction models based on SVM, which mostly uses acoustic analysis indices, have been www.ijacsa.thesai.orgmainly utilized to evaluate laryngeal diseases [25,26].There are not enough studies examining prediction models reflecting the health behaviors and sociodemographic characteristics.
The objective of this study was to develop a model for predicting the occurrence of benign laryngeal lesions based on SVM using ear, nose and throat (ENT) data from a nationallevel survey and to provide a basis for selecting high-risk groups and preventing a voice disorder.Construction of our study is as follows.chapter II explains study sample and analyzed variables and chapter III defines SVM and explains the procedure of model development.Lastly, chapter IV presents discussion and direction for future research.

A. Data Source
This study targeted adults (≥19years) who participated in the ENT examination among the people who completed the Korea National Health and Nutrition Examination Survey (KNHANES) from 2010 to 2012.This study was approved by the Research Ethics Committee of Honam University (IRB Number: 1041223-201801-HR-40).The KNHANES extract samples from sampling plots by using the proportional allocation systematic sampling method, which stratifies the administrative districts and dwelling types of the national regional classes.Therefore, the samples were proportional to the population.This study selected 16,938 adults (7,703 males, 9,235 females), who were 19 years or older and completed the health questionnaire, ENT examination questionnaire, and the laryngeal endoscopy.

B. Measurements
Benign laryngeal disease in this study were defined as vocal nodules, laryngeal polyps, intracordal cysts, reinke's edema, laryngeal granuloma, glottic sulcus and laryngeal keratosis [10].The explanatory variables were age(19-39, 40-59, 60+), gender, occupation, educational level, Income, smoking, high-risk drinking, and self-reported voice problems.Occupations were classified into economically-inactive, nonmanual and manual.Levels of education were classified as elementary school graduates and lower, junior high school graduates, high school graduates and college graduates and over.Levels of income for households were classified into four quartiles.

A. Development of Prediction Model using SVM
The difference between the groups by the prevalence of benign laryngeal lesions was tested by chi-square test.The prediction model for benign laryngeal lesions was developed using SVM.SVM is a machine learning algorithm that finds the most optimal decision boundary by linearly separating the hyperplane after converting the learning data to a higher dimension through nonlinear mapping [27].The concept of SVM is shown in Figure 1.
For example, A=[a, d] and B=[b, c] are non-linearly separable in the two-dimensional space.If they are mapped in the three-dimensional space, they have a linearly separable characteristic.Thus, data containing two classes can always be separated in hyperplanes when the appropriate nonlinear mapping is used with sufficiently large dimensions [28].Consequently, SVM is very accurate because it can model the complex nonlinear decision-making areas and it tends to have less over-fitting than other models, which is an advantage [29].
The objective of SVM is to find the interface that maximizes the margin (Fig. 2).For the convenience of calculation, the algorithm will be explained by minimizing the half of the reciprocal of squared half margin.max2∥w∥2→min12∥w∥22max2‖w‖2→min12‖w‖22 The meaning of the above equation is as follows.The observations above the plus-plane satisfy y=1y=1 and wTx+bwTx+b is larger than 1.On the other hand, the observations below the minus-plane are y=−1y=−1 and wTx+bwTx+b is smaller than -1.When these two conditions are combined, it can be converted to the following constraint.yi(wTxi+b)≥1  When new data is entered, the observations are substituted into yi(wTxi+b−1)yi(wTxi+b−1).If the answer is larger than 0, it is predicted as 1 category.If it is smaller than 0, it is predicted as -1 category.The analysis is performed by using R version 3.4.2. Figure 3 presents the SVM algorithm source of R program.This study chose a radial basis function (Gauss function) that uses the parameter C (unit cost) in the SVM algorithm.This study compared the prediction power of the Gauss function, which was used for this study, with that of a linear algorithm, that of a polynomial algorithm, and that of a sigmoid algorithm.Moreover, four kernels were divided into C-SVM and Nu-SVM to compare the prediction accuracy of C-SVM with that of Nu-SVM.

A. The General Characteristics of Subjects
The general characteristics of 16,938 subjects were analyzed by using frequency analysis (Table 1).The ages of the subjects were distributed as 31.4% for 19-39 years old, 36.9% for 40-59 years old, and 31.7% for the 60 years and older.The 54.5% of subjects were women and 45.5% of them were men.In terms of occupation, the not economically active population was 44.1%, non-manual workers were 31.9%, and manual workers were 23.9%.The education level of subjects was elementary school graduate and below (39.2%), middle school graduate (11.5%), high school graduate (26.2%), and college graduate and above (23.1%).The 68.7% of subjects were nonsmokers, the 15.5% of them were former smokers, and 15.8% of them were current smokers.The 32.4% of subjects experienced the high-risk drinking during the past one year.The prevalence of subjective voice disorder was 7% and the prevalence of benign laryngeal lesions was 1.8%.

B. The Potential Causes of Benign Laryngeal Lesions
The general characteristics of subjects by the prevalence of benign laryngeal lesions and the potential causes of benign laryngeal lesions are presented in Table 2.The results of chisquare test showed that the subjects with benign laryngeal lesions and those without benign laryngeal lesions were not significantly different in all variables (i.e., age, gender, income level, occupation, education level, smoking, and high-risk www.ijacsa.thesai.orgdrinking) except voice disorder.The prevalence of benign laryngeal lesions was significantly higher in subjects with subjective voice disorder (9.8%) than in those without subjective voice disorder (2.1%, p<0.05).

C. Major Predictors based on SVM
Table 3 shows the 'function weight' of SVM based on Gaussian Kernel algorithm.It is impossible to simply compare the magnitude (priority) of influence on the function weight of SVM by a variable.However, it is possible to identify if a major predictor variable has a positive relationship with a disease or a negative relationship with a disease.The 'benign laryngeal lesion prediction model' based on SVM could derive preventive factors and risk factors.
Risk factors were people perceiving voice problem subjectively, age between 40 and 59, 60 years old or older, female, occupation (non-manual worker and manual worker), high school graduate (the highest level of education), and highrisk drinking.Preventive factors were people not-perceiving subjective voice problem, the income level second quartile (medium-low), third quartile (medium-high), and fourth quartile (high) than the income level first quartile (low), middle school graduate and college graduate (the highest level of education), and former smokers and nonsmokers than smokers.The final prediction rate of this SVM using 479 support vectors was 97.306.3 shows the 'function weight' of SVM based on Gaussian Kernel algorithm.It is impossible to simply compare the magnitude (priority) of influence on the function weight of SVM by a variable.However, it is possible to identify if a major predictor variable has a positive relationship with a disease or a negative relationship with a disease.The 'benign laryngeal lesion prediction model' based on SVM could derive preventive factors and risk factors.Risk factors were people perceiving voice problem subjectively, age between 40 and 59, 60 years old or older, female, occupation (non-manual worker and manual worker), high school graduate (the highest level of education), and high-risk drinking.Preventive factors were people not-perceiving subjective voice problem, the income level second quartile (medium-low), third quartile (mediumhigh), and fourth quartile (high) than the income level first quartile (low), middle school graduate and college graduate (the highest level of education), and former smokers and nonsmokers than smokers.The final prediction rate of this SVM using 479 support vectors was 97.306.

E. The Accuracy of Benign Laryngeal Lesion Prediction According to the SVM Classification Algorithm
The accuracy of benign laryngeal lesion prediction according to the SVM classification algorithm is shown in Table 2.One of the SVM's disadvantages is that the fitness of the model varies by the type of kernel.Therefore, this study compared the Gaussian kernel (used for this study), linear, polynomial, and sigmoid algorithms in order to confirm the prediction accuracy of the models according to various kernel types.Additionally, these four algorithms were divided into C-SVM and Nu-SVM and the prediction accuracy of each algorithm was compared with each other.The fitness results indicated that the difference between C-SVM and Nu-SVM was not large in the benign laryngeal lesion prediction model.In terms of kernel type, the prediction accuracy of Gauss kernel was the highest and the prediction accuracy of the sigmoid kernel was the lowest.V. DISCUSSION This study developed a benign laryngeal lesion prediction model based on SVM by using the highly reliable ENT examination data.The results of this study revealed that subjective voice problem perception, 40 years and older, female, occupation, high school graduate, and high-risk drinking were risk factors of benign laryngeal lesions.The preventive factors of benign laryngeal lesions were people not perceiving subjective voice problem, above the second quartile, middle school graduate and college graduate (highest level of education), former smokers, and nonsmokers.Numerous previous studies that evaluated the risk factors of voice disorder reported that sociodemographic characteristics such as gender [13], age [6], and occupation [3,4,31,32], drinking [2], and smoking [31,15] were the risk factors of voice disorder.These results support the results of this study.One interesting finding of this study was that middle school graduate and college graduate were the preventive factors of benign laryngeal lesions.It was speculated that the middle school graduate had a low risk of benign laryngeal lesions because they were mostly manual workers engaging in simple labor.Byeon & Lee (2010) [33] analyzed the prevalence of benign laryngeal lesions, similar to this study, and reported that simple manual workers (e.g., maids, cleaners, and construction workers) had a low voice disorder prevalence.Byeon & Lee (2010) [33] argued that it was because they had a low probability to abuse their voice because they worked along and did repetitive jobs.It will be necessary to conduct a longitudinal study that will demonstrate the causality among the highest level of education, occupation, and voice disorder.
Another finding of this study was that the prediction accuracy of C-SVM's Gaussian kernel was the highest when the prediction accuracies of the eight SVM classification algorithms were compared after dividing Gaussian kernel, linear kernel, polynomial kernel, and sigmoid algorithms into C-SVM and Nu-SVM.The performance of the nonlinear SVM is largely determined by a kernel function and the parameters consisting of the kernel [28].Kuo et al. [34] also reported that the Gaussian kernel algorithm had high prediction accuracy.The Gaussian kernel is to map in the specific space with www.ijacsa.thesai.orginfinite dimensions.It is believed that using Gaussian kernelbased C-SVM will be effective in predicting binomial variables.

VI. CONCLUSION
The results of this study will provide an important basis for preventing and managing benign laryngeal lesions.It will be required to systematically manage the high-risk groups in order to prevent benign laryngeal lesions.

TABLE I .
GENERAL CHARACTERISTICS OF SUBJECTS, %

TABLE III .
FUNCTION WEIGHT OF SVM BASED ON GAUSSIAN KERNEL

TABLE IV .
THE ACCURACY OF BENIGN LARYNGEAL LESION PREDICTION ACCORDING TO THE SVM CLASSIFICATION ALGORITHM, %