A Minimum Redundancy Maximum Relevance-Based Approach for Multivariate Causality Analysis

Causal analysis, a form of root cause analysis, has been applied to explore causes rather than indications so that the methodology is applicable to identify direct influences of variables. This study focuses on observational data-based causal analysis for factors selection in place of a correlation approach that does not imply causation. The study analyzes the causality relationship between a set of categorical response variables (binary and more than two categories) and a set of explanatory dummy variables by using multivariate joint factor analysis. The paper uses the Minimum Redundancy Maximum Relevance (MRMR) algorithm to identify the causation utilizing data obtained from the National Automotive Sampling System’s Crashworthiness Data System (NASS-CDS) database. Keywords—Causal analysis; dummy variable; Minimum Redundancy Maximum Relevance (MRMR); multivariate analysis


INTRODUCTION
Causality, or causal influence, governs the relationship between two events.For instance, the first event is determined to be the cause and a second event (the effect), is a consequence of the first one.In this sense, all causalities are correlations while not all correlations are necessarily causalities.Silverstein et al., 2000 suggested that isolated causal influences that only involve pairs or small sets of items are easier to interpret [14].Causal analysis is applied to identify the direct influence of root cause factors.Contrary to the correlation that does not imply causation; causal analysis requires additional counterfactual dependence.We can learn causality from an observational dataset that is particularly suitable to predict the consequence of some given action, facilitate counterfactual inference, and explain the underlying mechanisms of the data [16].
The purpose of this paper is to explore the correspondence of casual inference on data analysis.As stated by Rubin (2004), "Causal inference is an area of rapid and exciting development and redevelopment in statistics.Fortunately, the days of "statistics can only tell us about association, and association is not causation" seems to be permanently over."In the course of causal inference, multicollinearity is one of the major problems in multivariate data analysis [2].However, the efficiency of multivariate analysis highly depends on the correlation structure among explanatory variables.When the covariates in the model are not independent from one another, collinearity/multicollinearity problems arise in the analysis, which leads to biased estimation [5].
Even if two or more explanatory variables are highly correlated, it is difficult to obtain a reliable estimate of the mutual information coefficient between each of explanatory variable, while controlling for the others.This can be devastating since the goal is for accurate coefficient estimates.The local causal influence and/or causal structure discovery algorithm should improve further insight on the application of the observational data-based causal discovery approach to factors selection.Factor selection is one approach to reduce multicollinearity problems [6], which requires selecting the most significant subset of factors to a targeted concept by removing redundant and irrelevant factors.Multicollinearity causes redundant information, these redundant and irrelevant factors can be ignored because they give very little or no unique information for causal data analysis and modeling.
The primary motivation for reducing redundant and irrelevant data and keeping the number of factors as low as possible is to decrease the multicollinearity problem within causal factor analysis and prediction.The objective of this paper is to analyze the effect of multicollinearity on explanatory dummy variables of multivariate causality analysis.Ding and Peng employed an approach of minimum redundancy maximum relevance (MRMR) to find the optimal subset of multiple factors.Individual factor selection is weak for the estimation of injury severity and may be dangerously inaccurate for complex decision problems [8].Therefore, joint factor analysis for a multivariate approach obtains a comprehensive and objective result based on previous reviews.In this paper, first, we initiate the general concept of minimum redundancy maximum relevance (MRMR) and present some analytical and computational developments.Second, we show how this approach should be adapted to causal analysis and compared the analysis results with another two methods maximum relevance (MaxRel) and minimum redundancy (MinRed).We illustrated this in the domain of accidentology by selecting the most relevant and informative factors that explain injury severity in a large dataset.
The rest of this paper will present each solution model.Section 2 explains the summary of mutual information (MI) and MRMR.Section 3 briefly reviews the creation of dummy variables and then explains how to select the group of causal www.ijacsa.thesai.orgfactor by MRMR.Sections 4 and 5 describe the database reviews, present test results of causality measurement together with concluding and discussion remarks.

A. Mutual Information
Mutual information is a measure of the linear and nonlinear dependence between a set of variables.Mutual information (MI), introduced by [13] is a measure of statistical dependency that is able to determine complex relationships between variables, even in case of nonlinear dependency.Mutual information between two random variables is a measure of the information one random variable provides about the other.It takes a minimum value of zero when no dependence exists between the two variables and a positive value when a strong dependence exists between the two variables.Mutual information between two random variables X and Y can be quantified as shown in the following equation (Thomas M.Cover, 1991): Where, x and y represent realizations X and Y, I(X;Y) is the mutual information between the two random variables X and Y, p(x,y) is their joint probability mass function, and p(x) and p(y) are the marginal probability mass functions of X and Y. Mutual information between a set of input variables * + and an output variable Y can be estimated by (2); where S m is the set of m input variables.
Francios D explained that applicability of mutual information reduced beyond the two-variable case even though it has robust measurement ability in a set of random variables [3].The challenge lies in the need to reliably estimate joint probabilities of the dimension of the number of variables at stake.It is often hard to get an accurate estimation for multivariate density because of the multivariate density estimation often involves computing the inverse of the high dimensional covariance matrix.

B. Minimum Redundancy Maximum Relevance (MRMR)
The minimum redundancy maximum relevance approach [9] is based on identifying that the integration of individually good variables does not necessarily lead to good classification/prediction performance.They considered reducing the redundancies among the selected variables to a minimum for creating subset of variables.Mohamad I et al., 2009 introduced two variants of MRMR as input variables selection algorithm of approximation to mutual information to pinpoint the set of inputs that contains the greatest amount of information about the uncertainty of a system [7].In the literature, there are several new classification/prediction strategies to perform these MRMR combined with other algorithms [1], [15], [17].To maximize the joint dependency of top ranking variables on the target variable, the redundancy among them needs to be minimized, which requires incrementally selecting the maximally relevant variables while avoiding the redundant one.In term of mutual information, the purpose of causation factor selection is to find a factor set S with m factors { }, which have the highest mutual information value.Max relevance is to search satisfying factors, which approximates D (S, y) in (1) between individual factors and class : It is likely that causation factors selected according to Max-relevance could have rich redundancy, i.e., the dependency among these factors could be large.When two factors depend highly on each other, the respective classdiscriminative power would not change much if one of them were removed.Therefore, the following minimal redundancy (Min-redundancy) condition can be added to select mutually exclusive factors: The criterion combining the above two constraints is called "minimal-redundancy-maximal-relevance" (MRMR).The operator ( ) to combine D and R and consider the following simplest form to optimize D and R simultaneously: They described that it works efficiently even for a relatively large set of inputs and contributed an analytical proof that a first order MRMR model collapses to a maximum dependency problem.

III. METHODOLOGY
The objective of this study is to focus on reducing multicollinearity problem in causal factor analysis.This paper discusses a target list of potential causation factors for injury severity by using MRMR.Within the selection of group factors analysis, to compare the causation strengths of mutual information value between potential causation factors and injury severity by considering the maximum relevance minimum redundancy value.Fig. 1 describes the process of causal factors analysis and causality measurement of injury severity.The experiment contains two stages.For the first stage, this paper considers fundamental types as input variables including nominal, ordinal and interval variables from injury severity database.Input explanatory categorical data are generated to be dummy variables and analyze the causal factors relation between dummy explanatory variables and categorical response variables by using minimum redundancy maximum relevance.These results are compared with two other methods: minimum redundancy (MinRed) and maximum relevance (MaxRelThe following section explains causal inference with a briefs introduction of the dummy variables approach.www.ijacsa.thesai.org

A. Dummy Variable Approach
Causal analysis assumes that the explanatory variables are numerical variables.Categorical variables (such as gender, light condition, etc.) can be used as predictors if they are first converted to "dummy" variables.A dummy variable is a numerical variable that usually represents a binary categorical variable [11].For a categorical variable with multiple levels (n), (n-1) numbers of dummy variables are required to represent it.Dummy variables are useful because they enable the use of a single regression equation for categorical variables.Dummy variables act as "switches" that turn various parameters on and off in an equation.
The dummy variable approach is a method to transform each of the original explanatory variables into a pair of variables, these paired variables being used for causal relationships between injury severity level and explanatory variables (factors).For example, light condition status, if originally labeled 1:daylight, 2:dark, 3:dark/lighted, 4:dawn and 5 dusk, could be redefined in terms of four variables as follows: var1; 1:daylight, 0:otherwise, var2; 1:dark, 0:otherwise, var3; 1:dark/lighted, 0:otherwise, var4; 1:dawn, and 0:otherwise.These transformed variables can be used with any causality measure.In this paper, dummy transformation variable will be used with Minimum Redundancy Maximum Relevance (MRMR), Maximum Relevance (MaxRel) and Minimum Redundancy (MinRed).

B. Selection of Factor Group
Let us consider a specific injury severity indicator and potential causation factor ( ) Mutual information can be used to evaluate and statistically compare the strength of the causal relationship between and the different factor by using MRMR, MinRed and MaxRel.Mutual information is first independently computed between and all Each MI value lies between 0 and 1, and evaluates the causal relationship on the value of Y that is provided by X [10].To estimate the high dimensional MI ( ), we need to estimate the joint probability ( ).To compare the influence level of a given factor on a severity indicator Y, the MI Values are ordered by minimum redundancy maximum relevance value, minimum redundancy and maximum relevance.
In MRMR approach, the selected factors X i are required individually, to have the largest mutual information I(X i ;Y) with the target class Y and reduced minimum redundancy factor.Selecting the factor of the highest predictive power by ranking with MRMR based mutual information.In a second step, joint factor analysis of multivariate approach, the previous best selected single factor are kept and used to calculate the mutual information based on conditional entropy contribution to rigorously quantify the influence of causation factors on injury severity.These three approaches can also be computed for multivariate factors combination.Let ( ) be a multivariate variable regrouping K factors ( ) The selection of a group G k of k factors, among p, that have the highest joint predictive power for Y, can be done using the above single factors, and hence select the group of k factors with the minimum redundancy maximum relevance, minimum redundancy and maximum relevance values.Among all groups of k factors, the group shows the highest predictive power and best explains the Y values.Finding the best group of k factors among p factors is generally computationally not feasible.

A. Input Factor
The National Automotive Sampling System (NASS) Crashworthiness Data System (CDS) (NHTSA, 2014) is a nationwide crash data collection program sponsored by the U.S. Department of Transportation [12].The National Highway Traffic Safety Administration collects information on a sample of all motor vehicle crashes reported to police in the United States.The data within the NASS-CDS crash must be consistent with one of three conditions: 1) be reported by police; 2) involve a harmful event (property damage and/or personal injury) resulting from a crash; and 3) involve at least one towed passenger car or light truck or van in transport on a traffic way.There are three outcome descriptors, as follows:  Maximum Accident Injuries Severity (MAIS).

 Accident Injury Severity to the body region neck (HWS).
 Accident Injury Severity to the lower extremities (AISBEIN).
This paper analyze the causal relation between the different type of factors and MAIS level, using the information acquired from the year of (2011/2012), approximately 6,000 traffic accidents stored in the NASS- Because our study addresses causal analysis, only driver presence data are selected from the two-year dataset of injury factor and, samples with missing values are deleted, while all continuous factors are discretized.There is multicollinearity among the explanatory variables, the estimation of model parameters may lead to invalid statistical inference.To reduce the multicollinearity problem, we have to change categorical input variables to dummy variable where the Dummy variable or indicator variable is an artificial variable created to represent an attribute with two or more distinct categories.The number of these dummy variables necessary to represent a single attribute variable is equivalent to the number of categories in that variable minus one.

B. Injury Severity Indicator
In NASS-CDS database, maximum injury severity (MAIS) distribution is defined over 7 categories stated as the values in the set of {0, 1, 2, 3, 4, 5, and 6}, which correspond to different injury severities:  0 corresponds to non-injury, and 0 to 6 for more and more severe injuries.
 In our analysis, the MAIS outcome descriptor has been transformed into a variable with fewer classes (minor, moderate and major): 1 for light or null injuries (original label 0), 2 for middle injuries (original label 1and 2), 3 for more severe injuries (original label higher than 3).
In the database, a frequency of 42% is observed for "no injury" accidents and a frequency of 40 % for "minor injury" accident.This is shown in Fig. 2.

C. Impact Factor for Maximum Injury Severity
Multivariate analysis was conducted to analyze which groups among a given number of factors have the highest mutual information value with MAIS, and hence best explains Maximum injury severity.The analysis of the results follows: considering an outcome descriptor, the MI value computed for each factor is represented by a horizontally bar in Fig. 3 where results are calculated with different types of MAIS.Mutual information values are between 0 and 1. DGROUP1 and BDEPLOY are associated with the higher MI value in single factor analysis.
In a multiple factors combination approach, the first variable to be chosen is, quite naturally, the one that by itself maximizes the mutual information given by the MAIS descriptor over the p potential factors.Fig. 4 describes the multivariate analysis for minor injury level, which indicates, for instance, that the group of six factors (DGROUP1, SBUSE, SURTYPE2, DGROUP2, VTYPE, MYEAR and CWEIGHT) has a joint MRMR of nearly 30%.This group of factors has the highest causality power for all group of seven factors combination.It is interesting to observe that, two factor combinations (DGROUP1, SBUSE) has a 2% relationship with minor injury level but 8% of causality rate increased by adding the seat belt usage factor to this group.Two other methods (MinRed, MaxRel) also have the group of the same six factors combinations, which have the highest causality power of approximately 25%.In previous single factor analysis, BDEPLOY and SURCOND5 were in the 2 nd and 3 rd position, respectively.In multivariate analysis, these two factors do not appear in the group of seven factors combinations and are replaced by SBUSE and SURTYPE2 factors.In minor injury analysis, causality relation between explanatory variables and injury severity in these three approaches are not difference base on seven factors groups.Because minor injuries in NASS-CDS are related with all single factors in police reported accident.
In the multivariate analysis for moderate severity, nearly 30% of joint MRMR results emerge in seven factors group (DROUP1, DRINK, SBUSE, LANE6, SURTYPE5, AGEGROUP5 and CWEIGHT) and are shown in Fig. 5.The role of the factor ,alcohol drinking, is a causal relation with the moderate injury level and also depends on the age group type.Additionally, AGEGROUP3 (>56) is counterfactual dependence with blood alcohol concentration.AGEGROUP3 and SBUSE factor do not emerge in the other two approaches.Because that same group of seven factors combinations have over 20% of causality power to estimate the moderate injury level in the (MinRed and MaxRel) approach.Drivers with alcohol presence and number of lane situation do not appear in the major injury level causation factor.The five factors group (DGROUP1, MYEAR, SURTYPE1, SURTYPE2 and SBUSE) have a joint MRMR over 80% in Fig. 6.These five factors groups have the highest joint causality power that determines whether a particular explanatory variable really affects the response variable and can estimate the magnitude of that effect.Seatbelt usage is an important factor and that is mostly relevant to major injury.This analysis examines the age of vehicle at the time of the crash or the vehicle"s model year causal relationship with the injury outcome.The seatbelt usage is counterfactual dependence with vehicle"s model year and efficiency of the seatbelt type depends on that factor.The other two approaches are not good at causality power estimates of injury levels approximately 20%.Seatbelt usage and surface type factor do not appear in the causal factor group of these two approaches.
Factors selection using multivariate MRMR yields groups of factors of minimal size, with minimal redundancy maximal relevance that best explained injury severity.The main advantage of this MRMR approach is to handle multicollinearity factors.Fig. 7 shows the smallest factors group with MRMR over 90% to estimate the causal relation with MAIS original categories level.In this figure, we can see that the status of surface condition is ice/frost; dirt/mud/gravel and oil on the asphalt roadway surface type with slow speed traffic accident.Vehicle curb weight and situation of airbag deployed or not have counterfactual dependence with travel speed.

V. DISCUSSION AND CONCLUSIONS
This paper focuses on minimizing the multicollinearity effect in multivariate causality analysis.The proposed methodology in this study is exceptionally relevant for casual factor organization and, this relevance has been proven by better understanding and determination of traffic injury severity evaluation.A multivariate joint factor analysis is the approach for the investigation of causal relations between risk factors and injury severity.Regardless of the type of dependent outcomes or data measured in a model for each subject, the analysis considers more than two risk factors in the analysis model.Other multivariate analysis methods, like multiple linear regression, logistic regression and mixed effect models are also commonly used in causality measurement.www.ijacsa.thesai.orgMulticollinearity problem can arise in these analyses in cases where the association between categorical variables is strong.The paper engages in the reduction of the multicollinearity problem by converting various types of categorical explanatory variables to dummy variables.With dummy variables, the problem be minimized, and interpretations for probabilistic reasoning, information theory and set relations can be acquired.Dummy variable usually makes the resulting application easier to implement, use and interpret in injury severity analysis.
In other respects, the theoretically strong advantage of MRMR analysis is that it does not require specifying a functional form of dependency such as correlation.In a classical regression analysis, the estimated relationship between the predictor and the factors can be erroneous if the model is miss-specified.In the case of strong correlations between the factors, the estimation of the coefficients is less precise in a regression analysis, which can lead to wrong interpretations with regard to explanatory and response factors.The MRMR method subtracts the redundancy from the relevance which is the terms computed using Shannon"s mutual information as a result of each candidate factors to be included in this minimal set.MRMR can be effectively combined with other factor selections such as wrappers to find a very compact subset from candidate factors at lower expense.
In this paper, applying MRMR in factor selection yields the smallest group of factors with minimum redundancy maximum relevance so that it provides the best explanation of injury severity.The dummy explanatory variables are also considered in MRMR preserving the major benefit to intrinsically handle multicollinearity factors.In the deterministic process, causality measurement is applied by MRMR which maximizes the joint dependency of top ranking variables on the target variable, and the redundancy among them must be reduced, which suggests incrementally selecting the maximally relevant variables while avoiding the redundant one.Based on the experimental results on causality measurement, it can be proven that MRMR has higher estimated power than both MinRed and MaxRel in multivariate joint factor analysis.This study included only crashes found in the NASS-CDS database, limiting its conclusions to AIS motor vehicle crashes.As a future work, this method can also be used to analyze the other causal factors and to increase the sample of case for occupant injury severity.

Fig. 3 .
Fig. 3. MI values computed over a set of potential causation factors for different MAIS levels. www.ijacsa.thesai.org
Table 1 describes the three type of accident factors that are categorized based on the Haddon matrix [4].The matrix examines the factors related to personal, vehicle and environmental attributes.