The SMH Algorithm : An Heuristic for Structural Matrix Computation in the Partial Least Square Path Modeling

The Structural equations modeling with latent’s variables (SEMLV) are a class of statistical methods for modeling the relationships between unobservable concepts called latent variables. In this type of model, each latent variable is described by a number of observable variables called manifest variables. The most used version of this category of statistical methods is the partial least square path modeling (PLS Path Modeling). In PLS Path Modeling, the specification of the relashonships between the unobservable concepts, knows as structural relationships, is the most important thing to know for practical purposes. In general, this specification is obtained manually using a lower triangular binary matrix. To obtain this lower triangular matrix, the modeler must put the latent variables in a very precise order, otherwise the matrix obtained will not be triangular inferior. Indeed, the construction of such a matrix only reflects the links of cause and effect between the latent variables. Thus, with each ordering of the latent variables corresponds a precise matrix.The real problem is that, the more the number of studied concepts increases, the more the search for a good order in which it is necessary to put the latent variables to obtain a lower triangular matrix becomes more and more tedious. For five concepts, the modeler must test 5! = 120 possibilities. However, in practice, it is easy to study more than ten variables, so that the manual search for an adequate order to obtain a lower triangular matrix extremely difficult work for the modeler. In this article, we propose an heuristic way to make possible an automatic computation of the structural matrix in order to avoid the usual manual specifications and related subsequent errors. Keywords—Structural equations modeling; PLS algorithm; latents variables; structural matrix; R programming language


I. INTRODUCTION: PLS PATH MODELING IN R
The PLS Path Modeling in a structural equation modelling with latent variables (SEMLV), is a method in which the partial least square (PLS) algorithm is used to estimate the model ( [1], [2], [3]).Generally, the structural equation models (SEM) are describe graphically by specifying the latent variables (inobservable).For each latent variable, the manifest variables (observable) that are related to it are also specified.Latent variables represent concepts such as loyalty, quality, poverty, abilities, etc.The manifest variables are indicators that describe these latent variables and they are collected in a dataset.An example of such model, called European Customer Satisfaction Index (ECSI) Model, that can be found in [4], is giving in the Figure 1 below: The Figure 1 shows an example of the structural relationship between latent variables.It is known as the European Customer Satisfaction Index model (ECSI model) and is often used in marketing studies.This article focuses on the specification of this kind of relations in practice.When we use a computer to estimate the model, the graph is often specified as binary low triangular matrix.The operation may be timeconsuming because one has to find the best order of the latent variables in a table in order to get the lower triangular matrix.The goal of this paper is to give a method which automatically get the right order and automatically compute the structural relationship matrix.

II. CONCEPTUALIZATION : MAIN IDEA BEHIND THE SMH ALGORITHM
A square (lower triangular) boolean matrix representing the inner model (i.e. the path relationships between latent variables) is a matrix of zeros and ones that indicates the structural relationshipsbetween latent variables.This path matrix must be a lower triangular matrix that has a 1 when column j affects row i, and a 0 otherwise.
The latent variables can be classified in three categories according to their roles in the structural equations in which they appear.The SMH is based on the following classification: • The exogenous variables : It's the latent variables which have no other latent variables related to them.
• The endogenous variables : It's the latent variables which are not related to any other latent variables.
• The neutral variables : It's the latent variables that are related to others in both directions.
The main idea of this heuristic is to classify all the latent variables within these differents groups (Exogenous, Endogenous, Neutral) and find a way to order them to obtain a lower triangular matrix.To find the rigth order of the latents variables, we can remark than the exogenous latents variables must be ordered first (left side), then the neutral latents variables must follow them (middle), and finally, the endogenous latent variables must be the last ones to use (right side).This groups order is found by analyzing some simple cases.
For a formal purpose, let consider the following mathmatical notations : • N the numbers of latent variables • ξ j the j th latent variable • Θ j the endogenous statut of the latent variable ξ j • Γ j the exogenous statut of the latent variable ξ j • E j the number of latents variables that the variable ξ j is related to • F j the number of latents variables that are related to ξ j • K j the numbers of exogenous, latent variables that are related to the variable ξ j • µ j the order score of the latent variable ξ j The variables Θ j and Γ j can be express using the kroneker notation : This conceptualisation, will be use to find an ordered metric for each variable.The variables will be orderd according to the value of this metric.Hight the metric's value of a variable is, hight will be it rank.

III. COMPUTING : THE ORDER METRIC OF THE SMH ALGORITHM
The heuristic method is based on three general empirical principles where it foundation can be seen.

A. About the Exogenous Variables
The exogenous latent variables are the only ones with Γ j = 1 and they must have the lowest values µ j to be in the first position in the structural matrix.Different exogenous latent variables are distinguished according to the number of latent variables F j they are related to.The higher F j is, the lower the score µ j has to be.Some variables that an exogenous latent variable is related to can be endogenous.Therefore, exogenous variables are to be characterized by the number of endogenous variables they belongs to (K j ) they are related to.The higher K j is, the higher the score µ j has to be.To take into account these realities, the order score of the endogenous latent variables is taken to be −10 4 F j + K j .In this case, the minimum score is obtained when all latent variables are exogenous except for one which is exogeneous (F j = N − 1, K j = 1) and the maximum score is obtained when all the latent variables are endogenous except for one which is endogenous (F j = 1, K j = N − 1).The scores of the exogenous latent variables are in the interval

B. About the Endogenous Variables
The endogenous latent variables are the only ones with Θ j = 1 and they must have the highest values of µ j to be in the last position in structural matrix.Different endogenous latent variables are distinguished according to the number of latent variables (E j ) related to them.The higher E j is, the higher the score µ j must be.To take into account this reality, the order score of the endogenous latent variables is taken to be 10 4 E j .In this case, the maximum score is obtained when all latent variables are endogenous except for one (E j = N − 1) which is exogenous and the minimum score is obtained when all the latent variables are exogenous except for one (E j = 1) which is endogenous.The scores of the exogenous latent variables are in the interval [10 4 , −10 4 (N − 1)].

C. About the Neutral Latent Variables
The neutral latent variables are the ones with the Θ j +Γ j = 0 .They must have the values of µ j which are higher than the highest exogenous variable value and less than the lowest endogenous variable value in order to be between exogenous and endogenous latent variables in the structural matrix.Different neutral latent variables are distinguished according to the number of latent variables(F j ) they are related to.The higher F j is, the lower the score µ j must be.Some variables that a neutral latent variable are related to can be endogenous.Therefore, exogenous variables are to be characterized by the number of neutral variables(K j ) they are related to.The higher K j is, the higher the score µ j have to be.Neutral variables are also distinguished according to the number of latent variables(E j ) they are related to.The higher E j is, the higher the score µ j have to be.To take into account all these realities, the order score of the endogenous latent variables is taken to be 10 3/2 E j − 10F j + K j .In this case, the maximum score is obtained when all latent variables are endogenous except for one (E j = N − 1, F j = 1, K j = 1) which is exogenous and the minimum score is obtained when all latent variables are exogenous except for (E j = 1, F j = N − 1, K j = N − 1) which is endogenous.The scores of the exogenous latent variables are in the interval [10 3/2 (N − 1) − 9, 10 3/2 − 9(N − 1)].

D. Order Score Computation
To compute the structural matrix, the latent variables must be ordered properly.The correct order give a lower triangular matrix.As it has been said before the main objective of the heuristic is to find the best set of ordered variables to compute the correct structural matrix.This order is based on the score that can be defined by Mathematically, these descriptions can be summarize in the single function defined as : The latent variables are then ordered based on their µ scores.For two latent variables ξ i and ξ j , the position of ξ i in the structural matrix is before ξ j if µ i µ j .
The problem solved by our method is a similar problem to that of the well-known traveling salesman problem in operations research ( [5], [6]).However, the metaheuristics used in operational research, such as tabu search, simulated annealing, genetic algorithms, etc. have the disadvantage of requiring significant resources in terms of calculation.In addition, the implementation of these algorithms is very complex and require a good mastery of their operating principles.Compared to these methods, the approach developed in this article is very easy to use.The method is limited to a simple classification of latent variables and manifests variables, to their enumeration and to the application of a simple arithmetic formula to obtain scores for ordering latent variables.The computation time is more than one hundred lower than that of conventional optimization metaheuristics.Our approach is therefore an optimization metaheuristic that applies to a very particular problem, namely, the search for a structural matrix in the PLS Path Modeling.This heuristic is the core method of used in the R package plspm.formula([7]) we have already developed and which is available for free download on the mirror sites of the R software.The following Figure 2 shows the performance of the heuristic when the number of latent variables is growing : According to the figure, the heuristic is able to give correct response with more than 100 latent variables.Based on this result, we can state that the heuristic method is very robust since the reasonable numbers of latent variables one can use in practice is generally less than twenty.

A. The plspm.shm R Function
This section present the implementation of the SMH algorithm in R language [8].The fonction is based on the R Package plsmp ( [9]) basis of this scientific computing language can be found in .The SMH algorithm in R is as follows:

B. The Parameters and Results of the plspm.shm Function
The algorithm take essentially two inputs: latent : a character vector containing the latent variable names latlist : a list to specify which latents variables explain another The parameter latlist is a R list structure and must contain two R objects: 1) a vector of the endogenous latent variables.
2) a list of vector objects for each endogenous variable.For an endogenous variable, the vector contains exogenous latent variables which are related to it.The order of vector objects in the internal list must correspond to the one of the endogenous variable.
The main output of the plspm.shmfunction is an ordered vector of all the latent variables.This order is the one one can use to have a structural matrix in the form of lower trianguler binary matrix needed to estimate PLS Path Model, for example the plspm() function in the plspm R package (plspm).But, the functions have the logical parameter mat that permits to compute the corresponding inner matrix (mat=TRUE) or not (mat=FALSE).This prevents from using a manual ordered latent variables vector to find the matrix.By default, the function compute that matrix.The function also have an other logical parameter name igraph that specifies if the relationship graph must be compute (igraph=TRUE) or not (igraph=FALSE).
The implementation in R is giving by the code below : R> lvect <-paste("A",1:4,sep="") R> lvlist <-list( paste("A",1:3,sep=""), list("A3", c("A1","A3","A4"),"A4") ) R> res <-plspm.shm(lvect,lvlist,mat=TRUE,iplot=TRUE) The different results obtained in R concerning the latent variables vector, the latent variables list and the structural matrix are : R> print(round(res,2)) $mu [1] 21.63 30000.0011.62 -20000.00$ordre [1] "A4" "A3" "A1" "A2" $matrice We can then see that the algorithm is capable of finding the correct order of the latent variables and capable of giving the correct structural matrix (triangular inferior).The graph Figure 3 given by the algorithm is : This graph is the graphical version of the structural matrix and it use makes easier the understanding of the structural relationships.Notice that in this example, we have four latent variables.The next example will use six latent variables and is concerned with a real example of the ECSI model as presented in the plspm package on the satisfaction dataset.

B. Application on a More Complex Problem
In this second example, the latent variables are denoted : image ("IMAG"), expectations ("EXPE"), quality ("QUAL"), value ("VAL"), satisfaction ("SAT") and loyalty ("LOY").The We can again see that the algorithm is capable of finding the correct order of the latent variables and capable of giving the correct structural matrix (triangular inferior).The graph Figure 4 given by the algorithm is :

VI. CONCLUSION
In the field of PLS Path modeling, the task of specifying structural matrices has always been tedious because of its purely manual nature.The method proposed in this article freed the modeler of this constraint by providing a means of automatic search of the correct order in which the latent variables must be placed in order to obtain a lower triangular matrix.The algorithm even calculates this matrix directly, which saves time and avoids errors related to the manual specification of such matrices.The heuristic described in this paper makes easier the process of finding automatically the PLS Path Modeling specifications.The simulations carried out show that, theoretically, this heuristic can easily be used for models involving more than one hundred latent variables.This possibility increases the scope of the PLS Path Modeling that was, until now, used on a limited number of latent variables because of the difficulties related to the manual specification of the structural relationships.However, one must take care of the fact that the structural relation rules are not circular because the matrix, in this case, is not triangular and that the problem can be misspecified in practice.The SMH heuristic also avoids the need of exploring all of the possible ordered latent variables configurations.It is an elegant solution to this combinatory problem.The use of this heuristic avoids the test of all arrangements of latent variables in order to find the best which gives the correct structural matrix.Future work will focus on the generalization of the principle of our method on the traveling salesman problem.Such a generalization will allow the algorithm to apply a much larger set of problems.

Fig. 3 .
Fig. 3.The inner graph of the simple example

Fig. 4 .
Fig. 4. The inner graph of the complex example This graph is the graphical version of the structural matrix.It confirms the fact that the heuristic is able to handle problems with large variables.