A Feature Analysis of Risk Factors for Stroke in the Middle-Aged Adults Focused on Perception of Sudden Speech and Language Problem

In order to maintain health during middle age and achieve successful aging, it is important to elucidate and prevent risk factors of middle-age stroke. This study investigated high risk groups of stroke in middle age population of Korea and provides basic material for establishment of stroke prevention policy by analyzing sudden perception of speech/language problems and clusters of multiple risk factors. This study analyzed 2,751 persons (1,191 males and 1,560 females) aged 40– 59 who participated in the 2009 Korea National Health and Nutrition Examination Survey. Outcome was defined as prevalence of stroke. Set as explanatory variables were age, gender, final education, income, marital status, at-risk drinking, smoking, occupation, subjective health status, moderate physical activity, hypertension, and sudden perception of speech and language problems. A prediction model was developed by the use of a C4.5 algorithm of data-mining approach. Sudden perception of speech and language problems, hypertension, and marital status were significantly associated with stroke in Korean middle aged people. The most preferentially involved predictor was sudden perception of speech and language problems. In order to prevent middle-age stroke, it is required to systematically manage and develop tailored programs for high-risk groups based on this prediction model. Keywords—C4.5; stroke; decision tree; risk factor; speech problem


I. INTRODUCTION
Stroke is a generic term for both cerebral infarction caused by the blockage of blood vessel in the brain and cerebral hemorrhage caused by the rupture of blood vessel (in the brain).As of 2013, death rate from cerebrovascular diseases was 50.2 persons per 100,000, which is the second highest right after cancer [1].This order of death rate has not changed over the last 10 years and especially, stroke is serous in that it takes the second place in the cause of death regardless of gender.
Incidence of stroke is high in old age.According to 2013 Annual Report on the Cause of Death Statistics, death rate of cerebrovascular disease was 10.1 persons per 100,000 for people in their 40s compared to 277.4 for 70s, which is approximately 27 times higher [1].In terms of life cycle, however, death rate of stroke skyrockets from 40s and over the recent 20 years, increase rate of stroke is the highest in 40s and 50s [2].In addition, it has been reported that health risk behaviors causing stroke is most frequent in middle age [3].Therefore, in order to maintain health during middle age and achieve successful aging, it is important to elucidate and prevent risk factors of middle-age stroke.
In particular, in the case of stroke, even though operation is performed successfully, not only is the disease highly likely to accompany disabilities such as speech impediment during rehabilitation process but the patients also are likely to experience loss of labor.Middle age is the period when one accomplishes his/her goal of life.Acute diseases such as stroke not just are the direct cause of loss of job but cause enormous economic loss as well [4].As of 2011, socio-economic loss from stroke (e.g.medical cost, transportation, nursing care, loss of production, etc.) in Korea surpassed U$ 3.5 billion and among them, social cost for middle-aged people from age 40 to 50 (45% of total cost) was reported to be the greatest [5].
Although it is important to comprehend and systematically manage high-risk groups of middle-age stroke, risk factors of middle-age stroke are less known than old-age stroke and there is also lack of studies on its risk groups.So far, chronic diseases such as diabetes, hyperlipidemia and high blood pressure and life style factors such as smoking, drinking, eating habits and exercise and social and economic status are known to be risk factors of middle-age stroke [6] However, since preceding studies which investigated risk factors of stroke did not adjust socio-economic factors such as occupation and level of income, it is difficult to find out social factors of middle-age stroke [9] [10].Moreover, as health risk behaviors tend to cluster together rather than individually exist (separate from other factors) [11], investigation on individual risk factor has a limitation in identifying high-risk groups of cardiocerebrovascular diseases with various characteristics.
Especially, recent studies reported that perception of sudden speech/language problems are major warning signs of stroke and in a survey on Korean adults, 80% of stroke patients perceived speech/language problems as a warning sign of stroke and 98% of stroke patients visited medical institutions due to speech/language problems as a warning sign, which is translated that perception of speech/language problem is a major factor of warning sign for stroke [12].If high-risk groups are comprehended and managed by considering risk factors and warning signs of stroke, significant portion of strokes can be prevented and the time required to respond to emergency situation can also be reduced.
Recently, as a method of exploring multiple risk factors of diseases, data-mining analysis such as decision tree is being used [13].Use of data-mining can facilitate comprehension of attributes of diseases as well as multiple risk factors.
Since tendency of occurrence and risk factors of stroke differ depending on ethnicity and culture, in order to prevent stroke in Korea, it is necessary to develop a stroke prediction model reflecting demographic characteristics of middle age population of Korea and, based on it, manage them systematically.
This study investigated high risk groups of stroke in middle age population of Korea and provides basic material for establishment of stroke prevention policy by analyzing sudden perception of speech/language problems and clusters of multiple risk factors.Organization of this study is as follows; chapter 2 explains data resources and definition of variables and chapter 3 explains procedure for development of prediction model; chapter 4 suggested results of developed prediction model and chapter 5 presents results and suggests direction for future studies.

A. Sources of data
Study subjects were adults aged 40-59 who participated in the 2009 Korea National Health and Nutrition Examination Survey (KNHANES), a nationwide representative survey of the non-institutionalized population in the Republic of Korea, and who then participated in an health survey [14].
The KNHANES is a nationwide cross-sectional survey conducted annually by The Korea Centers for Disease Control and Prevention.It employs a rolling sampling design that uses a complex, stratified multistage probability cluster survey of representative non-institutionalized civilians.The KNHANES sampling process is described in detail elsewhere [14].Briefly, the creators of the survey redesign the KNHANES from once every years to once every year in order to provide timely health statistics for monitoring changes in health risk factors and diseases and developing associated public health policies and health programs.The 2009 KNHANES, conducted in January to December, was composed of three component surveys: a health interview, health examination, and nutrition survey.Trained medical staff and interviewers performed the health interview and health examination at a mobile examination center and at participants' households.The 2009 KNHANES was conducted on 12,722 persons out of 4,000 households with a participation rate of 82.8% (n=10,533).This study targeted 2,885 persons who completed both the health survey and examination.Of these, 134 persons whose nonrespondents were excluded from the research, and data from 2,751 persons (1,191 males and 1,560 females) were analyzed.
High-risk drinking was classified into normal (less than 12 points) and high-risk drinking (over 12 points) by using alcohol use disorder identification test (AUDIT) [15].Regular moderate physical activity was defined as practicing moderately breathless exercise for more than 30 minutes per session over 5 days a week.Occupations classified based on the Korean Standard Classification of Occupations (KSCO-06) [16] were reclassified into economically inactive (unemployed person, homemaker), non-manual (managers & professionals, clerical support workers, service & sales workers), and manual (skilled agricultural & forestry & fishery workers, craft & plant and machine operators and assemblers, and unskilled laborers) occupations.

A. Exploration on factors related to the stroke
For general characteristics, mean and percentage were presented and difference between groups based on stroke was analyzed by Chi-square test.

B. C4.5 algorithm
C4.5 is a decision tree algorithm developed by Quinlan [17], purpose of which is to create a tree which can exactly classify outcomes even with small number of tests.This algorithm constructs the simplest decision tree by using the concept of entropy based on information theory [18] (Figure 1).
In general, entropy means numbers representing disorder.As data sources are mixtures of proper cases and improper ones, they are very high in the degree of disorder.However, degree of disorder becomes 0 since terminal nodes are decided with one grade after decision tree is learned.Thus, it calculates information gain of each factor while it classifies data, keeping entropy close to 0.
Then, if the attribute with highest discerning power is selected as standard of classification, it makes as many branches as the number of kinds of given attribute values.Cases are divided according to the value of each branch and same processes are repeated in each branch.If there is no more decrease in information, the division stops [19].
Method of dividing tree by C4.5 algorithm is as follows; First, information gain of root node is acquired at input variables where target variables are composed of p and n. (1) Second, the gain is acquired which decreases degree of disorder in the case it is divided by attribute A or variable A at root node.
(2) Third, among various attributes, node is divided by the attribute with greatest gain.If the divided node is composed only of either p or n, the node stops multiplying.
In case incidence rate is low as the outcome of this study, (which is) prevalence rte, there may be problems due to unbalanced data distribution [20].In order to complement this unbalanced distribution, this study adjusted data balance by asymmetrically setting weight of misclassification costs considering prevalence rate of middle-age stoke in Korea [21].Validity of the developed model was assessed with 10-fold cross-validation method.

A. Characteristics characteristics of subjects and potential factors related to stroke
General characteristics of subjects and factors related to stroke are presented in Table 1.Among the total of 2,751 subjects, number of those who have stroke was 33 (1.2%).

B. Prediction model for stroke using C4.5 algorithm
Prediction model for stroke using C4.5 algorithm is presented in Figure 2. As the result of constructing statistical classification model using C4.5 algorithm after including variables set as factors related to stroke through chi-squared test, factors having significant effect were sudden perception of speech and language problems, hypertension, and marital status.The most preferentially involved predictor was sudden perception of speech and language problems.2 is a profit chart of prediction model for stroke by C4.5 algorithm suggested in the higher order of path for subjects' improved gain.When this study drew out profit indicator for each node to seek out prediction paths for stroke, 3 nodes were confirmed as significant paths which effectively predict the stroke.
The first path with the biggest profit indicator for the prediction of the stroke was "middle-aged persons from the age of 30 to 58 who currently perceive sudden language problems" and its profit indicator was 8336.4%.
The second path was "middle-aged persons from the age of 30 to 58 who currently do not have sudden language problems or high blood pressure and do not live with spouse due to divorce or bereavement" and its profit indicator was 735.6%.
The third path was "middle-aged persons from the age of 30 to 58 who currently do not have sudden speech/language problems but have high blood pressure" and its profit indicator was 210.3%.
When the analysis on the prediction model by CART algorithm was completed, this study conducted 10-fold crossvalidation test to assess developed prediction model.As the result of the 10-fold cross-validation test to compare stability of drawn-out model, drawn-out risk index was 0.360 and misclassification rate was 36% for cross classification model, showing the same risk index 0.352 and misclassification rate 35% of prediction model.

V. CONCLUSION
Early detection and management of high-risk groups of stroke enables healthy and happy aging.This study developed prediction model for middle-age stroke by using C4.5 algorithm.As the result of constructing stroke prediction model considering multiple risk factors, perception of sudden speech/language problems, high blood pressure and marital status were significant prediction factors for middle-age stroke and among them, perception of sudden speech/language problem was the most prioritized prediction factor.Numerous preceding studies have reported that perception of sudden speech/language problems is a major risk factor of stroke and those who perceived speech problem had higher rate of stroke [22] [23].However, these studies were limited to exploring individual risk factors while this study confirmed as the result of exploring multiple risk factors that combination of individual risk factors causes a synergy effect.
Another finding of this study was that marital status is major prediction factor for stroke.This study found out that middle-aged people from the age of 30 to 58 who do not live with spouse due to divorce, bereavement or separation are high-risk group for stroke.It is supposed that middle-aged www.ijacsa.thesai.orgpeople who do not live with spouse have high risk of stroke since the middle-aged living alone not only have frequent health risk behaviors such as smoking but also are more vulnerable in health management.According to studies which researched on the relationship between marital status and health, married men who lived away from family had higher risk of accidents, alcohol and substance addiction, depression, death and cardiocerebrovascular diseases than men with stable marriage life and had 2.3 times more suicide rate, 4.7 times more death rate from alcohol and alcohol addiction and 1.7 times more death rate from cardiocerebrovascular diseases [24].
Especially, it has been reported that unstable marriage states such as divorce, separation and bereavement have negative effect on cardiocerebrovascular system by causing depression, which in turn increases death risks Hence, in order to prevent middle-age stroke, it is necessary to develop health management programs for the middle-aged without spouse.Furthermore, it is also necessary to prescribe guidelines for the prevention of middle-age stroke so that they will immediately visit medical institutions when they perceive sudden speech/language problems even if they do not have stroke-related diseases such as high blood pressure and diabetes.
Results of this study are expected to be an important ground to be considered in the strategy to prevent and manage stroke.In order to prevent middle-age stroke, it is required to systematically manage and develop tailored programs for highrisk groups based on this prediction model.

Fig. 2 .
Fig. 2. Prediction model for stroke among Korean middle aged people Table2is a profit chart of prediction model for stroke by C4.5 algorithm suggested in the higher order of path for subjects' improved gain.When this study drew out profit indicator for each node to seek out prediction paths for stroke, 3 nodes were confirmed as significant paths which effectively predict the stroke.

TABLE I .
GENERAL CHARACTERISTICS OF THE SUBJECTS BASED ON STROKE (UNIVARIATE ANALYSIS), N (%)