Empirical Validation of WebQMDW Model for Quality-based External Web Data Source Incorporation in a Data Warehouse

In recent years, World Wide Web has emerged as the most promising external data source for organizations’ Data Warehouses for valuable insights required in comprehensive decision making to gain a competitive edge. However, when the Data Warehouse uses external data sources from the Web without quality evaluation, it can adversely impact its quality. Quality models have been proposed in the research literature to evaluate and select Web Data sources for their integration in a Data Warehouse. However, these models are only conceptually proposed and not empirically validated. Therefore, in this paper, the authors present the empirical validation conducted on a set of 57 subjects to thoroughly validate the set of 22 quality factors and the initial structure of the multi-level, multi-dimensional WebQMDW quality model. The validated and restructured WebQMDW model thus obtained can significantly enhance the decision-making in the DW by selecting high-quality Web Data Sources. Keywords—Data warehouse; external data sources; web data sources; quality evaluation model; quality model validation


I. INTRODUCTION
The importance of incorporating external data in the Data Warehouse to gain valuable insights into the market, competitors, products, or customers for comprehensive and unbiased decision making, has been long recognized in the research literature [1], [2]. The use of World Wide Web (WWW or Web) [3] as an external data source for the Data Warehouse (DW) [4] has grown considerably over the past few years [5]- [16]. The WWW helps provide a wide-angle lens for the decision-making in organizations in a very low-cost and highly accessible manner [4] (see Fig. 1). It is a known fact that the quality of the DW data sources hugely impacts the quality of the DW itself [1], [2]. This fact makes the quality-aware evaluation and selection of high quality, credible, and compatible Web Data Sources (WDSs) a very crucial task in the incorporation of Web data in the DW [4], [17]- [27]. There are, however, many challenges in this task like the availability of a massive amount of information, the heterogeneous structure and format of the Web Data [4], the dynamic nature [17], and poor reliability [18] of a significant chunk of Web Sources.
For the aforementioned task of quality-aware evaluation of Web Data Sources for a Data Warehouse, various quality models, frameworks, or a set of factors have been proposed in the research literature (see, for example, [19], [22]- [24], [4], [21], [20], [25], [26]). However, these quality models are only conceptual in nature. To the best of the authors' knowledge, none of these quality evaluation models for evaluating WDSs as external data sources for a DW have been empirically validated to corroborate their applicability in this problem area. In order to fill this research gap, in this paper, we present the empirical validation of the state-of-the-art multi-level, multidimensional WebQMDW (Web quality model for evaluating web sources for the DW) quality model [27] to enhance the decision making in a Data Warehouse. WebQMDW model [27] is the first of its kind model which segregates the quality factors in such a way to introduce automated quality evaluation as screening (at the first level) and separation of expert evaluation of different expert areas into different dimensions (at the second level). The present work complements and extends the authors' previous work [27] of the quality-based evaluation of the websites of academic institutions for incorporation as WDSs in a University DW. The said work proposed and used the novel WSEM QT (Web source evaluation with multi-criteria decision-making methods and web quality testing tools) process in conjunction with the underlying novel WebQMDW quality model. We believe that the empirical validation of the WebQMDW model will be an important milestone in the quality evaluation of WDSs for a DW, aiding the DW professionals in providing advanced data analytics for decision-making in the organization.
Hence, the objective of this paper is the empirical evaluation of the WebQMDW model in order to • Validate the set of quality factors of the WebQMDW model; and eliminate or add new factors, if indicated by the validation results.
• Validate that the quality factors have been suitably placed in the level/dimension of the WebQMDW model; or if they should be placed in a different level/dimension according to the validation results.
The rest of the paper's overall arrangement is as follows: Section II discusses the frame of reference of the current work, including the related work, WebQMDW quality model, and motivation. Section III presents the empirical validation process of the WebQMDW model, including the analysis of results and restructuring of the model. Section IV discusses the various threats to the validity of the survey and how we dealt with them, followed by conclusion and future work in Section V.

A. Related Work and the WebQMDW Quality Model
Quality is a critical and hence, widely researched concept in the context of both software products and data. From the point of view of Data/Information Quality (DQ/IQ), there are established standards (like ISO/IEC 25010/25012 [28], [29]) as well as "de facto" standards (like the Wang and Strong model [30]) in the relevant literature. Due to the peculiar characteristics of Web portals as opposed to a traditional software product, a plethora of research works have specifically addressed the Website/Web portal quality [31]- [33]. The quality evaluation of WDSs for a DW, however, encompasses the data quality as well as the source Website quality because due attention needs to be given to the quality requirements specific to the destination of the WDS incorporation, i.e., the Data Warehouse and the underlying business domain as well [27]. Few of the important works in the area of defining quality factors/models for using the WDSs as EDS for the DW are summarized in Table I. Huang et al. [4] have suggested integrating Web Data in a DW by considering both Quality and Coverage aspects. Quality aspect evaluation is proposed by using quality factors of Speed of loading, Accuracy, Currency, Presentation, Format, Content, and Source as put forward by Rieh [33]. Coverage aspect evaluation is proposed by determining two factors of Scope and Variety. Lóscio et al. [20] have used the three quality parameters of Data Completeness, Schema Completeness, and Correctness for determining the relevance of a WDS for a particular application domain. For a quality-aware Web Warehouse, Marotta et al. have proposed managing Data and Service Quality in their work [21]. In this work, the organization of Data quality is in six dimensions: Reliability, Consistency, Uniqueness, Freshness, Completeness, and Accuracy [21]; Whereas the organization of Service-Related quality is in six dimensions of Stability, Usability, Business Value, Security, Interoperability, and Service Level. The WebQM quality model proposed by Zhu and Buchman [19] has three classes of Web Source Stability, Web Application Specific Quality, and Web Information Quality, used to group the twelve quality factors in this model. These quality factors are Timeliness, Presentation, Relevance, Metadata, Objectivity, Completeness, Correctness, Origin, Refresh Rate, Durability, Accessibility, and Availability [19]. For WDS quality, Mihaila et al. have used the four quality factors: Granularity, Frequency of Updates, Recency, and Completeness. [25]. Naumann et al. used the three quality factors of Availability, Extent, and Understandability in their work [26].
In a previous work, authors have proposed the WebQMDW quality model [27] with 22 quality factors classified in 2 levels ( Fig. 2). At Level-1(Automated quality testing level), those quality factors based on which the overall quality of the Website/Webpage can be assessed by using the available website quality testing automated tools are placed. This level has seven quality factors: Performance, Accessibility, Domain Reliability, SEO (Search Engine Optimization), Security, Best Practices, and Web Search Engine Ranking. At Level-2(Expert evaluation level), fifteen quality factors according to which the experts need to evaluate the WDSs are allocated. This level is further divided into three dimensions based on the expert area required for judging them. Dimension  In the current work, we choose to focus on this model as the selection and structuring of quality factors is in such a way that it solves the two main issues of the quality evaluation process of Web data sources [27]. The first issue of an enormous load of evaluation on experts is tackled due to the screening of the vast number of web sources to a select few through automated Website quality testing tools at the 1 st level of the model. The second issue in previous quality models was the bottleneck of finding experts with expertise in all the related domains of quality evaluation. This issue is also resolved at the 2 nd level of the model as different experts need to evaluate the quality factors separated into different dimensions according to the required expertise, namely Web, Data Warehouse, and underlying business domain. The initial structure of the WebQMDW model is shown in Fig. 2. The details of these 22 quality factors in the context of the Web Source Quality evaluation and the detailed account of the model's application to a practical case study of a University DW can be found in [27].

B. Motivation
For any area, after proposing the quality model, the next step is its validation from the perspective of the users of the respective domain. The validation is important due to two reasons. Firstly, it is essential to consider users' choices of quality factors due to their profile, experience, and knowledge in the respective field [34]. Secondly, the robust statistical analysis from the empirical validation attaches confidence to the adequacy of the proposed quality model [35].
In the area of generic quality evaluation of Websites, there are several empirical validations works in the research literature (Table II). The work of Caro et al. [32] validates a quality model for evaluation of the quality of Websites. The research work [36] of Moraga et al. validates a quality model for website quality specifically from the point of view of "University-educated users." A quality model for evaluating health websites' data quality is validated through experts in the work of Liete et al. [35]. Some authors [32], [36] have performed the validation using the survey method. Few authors have used the Delphi method [37] for the validation of the quality models [35].
However, no such validation works of quality models specific to the context of evaluation of WDSs as EDS to a DW (see Table I) could be found in the research literature, to the best of the authors' knowledge and belief. Hence, the current work provides the validation of the WebQMDW model (described in the previous section), which is specific for the said context [27]. It is performed by using the survey method according to the guidelines of the work of Pfleeger and Kitchenham [38]- [43], as described in the subsequent sections (see Sections III and IV). As described in the previous section, the WebQMDW model has been obtained by bearing in mind the definitions of the quality factors and the defined categories (i.e., levels/dimensions) identified for structuring the factors. This section elaborates the validation process of both the set of quality factors and the initial hierarchical structure of the WebQMDW model.

A. Research Methodology
Several empirical validation methods [44] are described in the research literature, like case studies, controlled experiments, ethnographies, and surveys. The survey method [38]- [44] has a defining characteristic of studying the applicability of the phenomenon on the target population by polling the survey questionnaire on the representative subset of the target population. Bearing in mind that we need to study the applicability of the WebQMDW model in the opinion of the Web and Data Warehouse users (the target population), this was the best applicable method for our work. So, in this paper, we use the survey method as the validation method to thoroughly validate the set of quality factors and the structure of the WebQMDW model while following the guidelines and principles of research proposed by Kitchenham and Pfleeger [38]- [43]. These guidelines describe the various activities for collecting information for describing, comparing, or explaining knowledge, behavior, and attitudes, by using the survey instrument [43].

B. Setting of Objectives
Measurable and specific objectives are set in this step. We set the main objective of our survey as: ''To acquire the viewpoint of Web and Data Warehouse users regarding the importance as well as the placement (in the levels/dimensions) of each of the quality factors in the WebQMDW model."

C. Selection of Subjects
Keeping the objective mentioned earlier in mind, the target subjects required were both Web and Data Warehouse users. For the purpose of empirical analysis, many researchers have pointed out the advantages of taking students as subjects [45], [46] as the students' knowledge tends to be homogenous and a high number of students as subjects are conveniently available simultaneously. According to us, the students who have the knowledge and practical hands-on experience of Data Warehouse and the World Wide Web were well suited to be this survey's subjects. Additionally, if the students have the knowledge of Data and Software Quality, they will be able to assess the importance of each quality factor better. Hence, it was decided to use "Convenience sampling" and administer the survey to a set of students, the 74 students of the Data Warehousing & Mining Course of the final-year class of Information Technology program at USIC&T, GGSIP University, New Delhi, India. These students not only had knowledge of the Web and Data Warehouse but had also studied an entire course on Software Engineering as part of their curriculum previously. The survey was conducted as a part of the mandatory practical laboratory session of the Data Warehousing & Mining course. Therefore, there was enough motivation in the students to be a part of the survey.

D. Selection of the Design of the Survey
The descriptive design of the survey is considered most appropriate, where the objective requires a description of the phenomenon of interest. The objective of this survey requires a description of the opinion of the respondents regarding the importance and placement of quality factors in the WebQMDW model. Hence, the descriptive design [38] was considered appropriate and selected by us rather than the experimental design.

E. Preparation of the Survey Instrument
The guidelines of designing the survey instrument, i.e., the questionnaire [39], suggest that the survey questions should be chosen, keeping in mind the objective of the survey. Hence, in accordance with the objective mentioned earlier, we constructed the questionnaire with 22 Likert-style closed questions divided into sections I and II, asking the importance of the 22 quality factors of WebQMDW model Level I and Level II, respectively (Fig. 3, Fig. 4). Only the naming of quality factors in the questions could have led to ambiguity in the respondents' minds about the meaning of the quality factors. So, we formulated the questions in conventional simple English language by adapting the definition of each factor from the research literature [27], [32]. The answers to the closed questions were supposed to be marked in the 5-point Likert scale ranging from the lowest score '1' signifying 'Not Important' and highest score '5' signifying 'Very Important.' Section III consisted of 2 open questions regarding the structural placement of quality factors in the levels/dimensions of the WebQMDW model (Fig. 5). The first open question focused on any suggested switching of the category (i.e., Level/Dimension) of the factors in the WebQMDW model. The second open question focused on any other quality aspect or factor to be added to the WebQMDW model.

Section-I (corresponding to importance of WebQMDW Level 1 quality factors)
Level

F. Validation of the Survey Instrument
We pre-tested the questionnaire to validate the survey instrument. Ten respondents (5 of them pursuing Ph.D. in the field of Data Warehousing and the rest 5 pursuing Ph.D. in the field of Web Engineering) answered the questionnaire before its actual administration. Following their feedback about the understanding of the questions, three questions (with questions no. 6, 9, and 10) were modified with examples and simpler language to improve the questionnaire.

G. Administration of Survey
The survey was administered to the subjects in an online session of a Data Warehousing & Mining laboratory class. The questionnaire was delivered in the form of a Google form whose link was shared with the subjects. Before the beginning of the session, the purpose and importance of the study were briefly explained to the respondents. The time limit of one hour for submitting the responses to the survey was also communicated to them.

H. Analysis of the Data
The survey was supposed to be administered to an expected sample of 74 subjects. In the actual scenario, the survey session was attended by 59 subjects because the remaining subjects were absent during the session. The recorded response rate was, hence, 79.7%. However, during the session, two subjects could not complete the survey due to network issues. So, the rest of the 57 responses were considered.
First, we analyzed the internal consistency of our data from closed questions with the help of Cronbach's alpha value (Equation 1) [47]. We determined Cronbach's alpha for data of section I and section II of the questionnaire, corresponding to importance value responses for quality factors from Level 1 and Level 2 of the WebQMDW model, respectively (see Table III). As a thumb rule, the value of Cronbach's alpha above 0.7 is considered acceptable. For our data, the values obtained were 0.889 for Section I and 0.920 for Section II. Hence, the survey can be said to have good internal consistency and reliable results for further analysis. Cronbach's alpha [47] i.e., = In this work, we have calculated the mean (i.e., average value) and the percentage coefficient of variation (%CV) of the importance values, to be used as the indicators for including or excluding the quality factors. It was decided to eliminate the factors whose mean value was below the value 3.0 (mid-point of the scale) as conceptually, in the view of the participants, they did not seem significant enough to be considered a quality factor for the evaluation of Web sources. We also decided to eliminate those quality factors for whom the percent variation coefficient was above 33% because conceptually, there was inconsistency in the participants' viewpoint about the importance of this quality factor. Thus, considering the ranked values of mean importance of factors in Fig. 6, most of the 22

Section-II (corresponding to importance of WebQMDW Level 2 quality factors)
Level 2, Dimension 1 Q. 8 The importance value of the Web Source data having the ability to be accessible over different platforms (operating systems or hardware architecture), should be: Q. 9 The importance value of the media format (text/HTML/pdf/audio/video etc) of the data from the Web Source fitting within the processing ability of the organization, should be: Q. 10 The importance value of the degree to which the data from the Web source is worthy of the cost associated (if it requires access fees), should be: Q. 11 The importance value of the amount or quantity of data provided by the Web Source being significant, should be: Q. 12 The importance value of the Web Source providing the data within the time constraint specified by the need of organization, should be: Level 2, Dimension 2: Q. 13 The importance value of the description or metadata of the data from the Web Source being easy to interpret in accordance with the Data Warehouse schema, should be: Q.14 The importance value of the data from the Web Source corresponding to the required time period according to the usage of Web data in Data Warehouse, should be: Q. 15 The importance value of the data from the Web Source being concise and free of superfluous elements that are not required for the right purpose in the Data Warehouse, should be: Q. 16 The importance value of the data from the Web Source being consistently represented in same or compatible formats throughout the Web pages of the Web Source, should be: Q. 17 The importance value of the data of the Web Source providing a complete coverage in terms of the depth, breadth and scope of the task at hand of the Data Warehouse, should be: Level 2, Dimension 3: Q. 18 The importance value of the degree to which the data from Web Source is beneficial and adds value to the business of the organization, should be: Q. 19 The importance value of the data from the Web Source being correct and guaranteed to be error-free especially in the context of the application domain, should be: Q. 20 The importance value of the data from the Web Source being impartial and free from bias, should be: Q. 21 The importance value of the extent to which the data from the Web Source is believable, should be:

Section-III (corresponding to open questions for structure of WebQMDW model)
Q.23 Do you suggest the switching of the category (i.e level or dimension in the WebQMDW quality model) of any quality factor? Q.24 Do you suggest the addition of any new quality factor, not covered in the WebQMDW quality model? 210 | P a g e www.ijacsa.thesai.org factors of the WebQMDW model had a mean importance value above 3. These values signified that the respondents considered most of the factors to be having moderate or high importance. Among the highly important factors were Performance, Web Search Engine Ranking, Business Value Addition, and Uniqueness. However, the quality factor Best Practices was eliminated as its mean value fell below the decided indicator of 3.0. None of the factors had a percent variation coefficient of above 33%, so no factor was eliminated for this particular constraint (Fig. 7). The open question (number 23), which focused on any suggested switching of the category (i.e., level or dimension) of any quality factor, was not answered by any of the respondents. The last open question (number 24), which focused on the addition of any new quality factor, was answered by four participants who suggested including Reputation as one of the factors. On close review of meanings of the factors from the review of literature, it was seen that in the context of Web Sources, in particular, this factor of Reputation [30] was synonymous to the factor Web Search Engine Ranking that was already included in the WebQMDW model [27]. The factor name Reputation was also used in the pioneering work of Wang and Strong, considered a de-facto Data Quality standard [30]. So, instead of adding another factor, we decided to consider renaming the factor Web Search Engine Ranking to the more general name of Reputation.

I. Restructuring of the Quality Factors of the WebQMDW Model
The initial structure of the WebQMDW model is shown in Fig. 1. After completing the above-stated validation process, the WebQMDW model now consists of a set of 21 factors, instead of 22, as one of the factors Best Practices was eliminated in the validation. As stated above, the factor Web.
Search Engine Ranking was renamed as Reputation. Since none of the participants suggested switching of the categories (i.e., level/dimension) of the factors, no other restructuring was done. The final structure of the WebQMDW model (with the above-stated changes) is as shown in Fig. 8. This section discusses the threats to the following kinds of validity and also how they were minimized:

A. Construct Validity
The survey uses the 5-point Likert scale to gather the opinion of the participants about the importance of the factors, with the lowest numerical value '1' signifying 'Not Important' and the highest value '5' signifying 'Very Important.' The Likert scale is used in many previous similar studies [32], [35] to gather the opinion of participants. This scale is an efficient tool for observation and hence, can be considered as a valid construct.

B. Internal Validity
To ensure internal validity is to make sure that the results are not being derived from casual relationships. For this aspect, we considered the following issues carefully: • The students enrolled in the same class of Data Warehousing & Mining were taken as subjects. The subjects had adequate knowledge of Data and Software Quality as they had also studied an entire course on Software Engineering as part of their curriculum. Hence, it can be said that all the subjects had the same profile and level of experience both in Data Warehousing and Data Quality. Thus, the variability among subjects was reduced.
• Since the subjects had not taken part in any survey on the same lines as the present one, so the persistence effect was nullified.
• Since the survey was provided to be filled only once, so no learning could have taken place. Thus, the threat of the learning effect was not present.
• The survey was administered in a one-hour session. This time was much less than even one practical laboratory session time of the students. Hence, the fatigue effect was not that relevant in this case.
• The survey was conducted as part and parcel of the ongoing practical laboratory sessions of the subjects' Data Warehousing & Mining course. The subjects were also motivated by telling them the importance of their contribution to the current research in the Data Warehousing field. Also, since subjects had already studied Web Warehousing as one of the advanced topics in the course, they showed sufficient interest in participating. Hence, we had achieved sufficient subject motivation for the survey.
• Since the survey was conducted in an online session with the subjects participating from their homes, their influence on each other was, if at all, very minimal. Further, to avoid plagiarism, it was ensured that the subjects kept their videos on during the entire one-hour session and were informed not to communicate with each other.

C. External Validity
External validity is the degree of generalizability of the research results to the population of interest and beyond in actual practice. External validity was ensured by mitigating the following two issues: • Material and task used: A survey questionnaire structured as a Google form was the material used. This survey was independent as no previous task was needed to be done in order to fill it.
• Subjects: The students were used as subjects of this survey due to two major reasons. Firstly, the students clearly represented the population understudy for the survey as they had experience as Data Warehouse users as well as Web Portal users, along with the knowledge of Data Quality. Secondly, many researchers have argued in favor of using students as subjects [45], [46] without impacting the external validity much. However, we do not rule out the possibility of conducting a replicated study with experts from the industry in the near future.

D. Conclusion Validity
Conclusion validity is the statistical validity of the conclusion of the research. For this concern, the size of the sample (57 subjects) could be the only issue. However, most of the quality factors identified from the research literature have been previously used and mostly validated, in the sub-areas of the current problem domain, like Web Portal quality and Data Warehouse quality. Hence, the concern is subjugated. We will still consider conducting a replication study with a larger number of subjects from the industry.

V. CONCLUSION AND FUTURE WORK
Over the last few decades, Web Data Sources have established their position as good, viable, and highly accessible External Data Sources for a Data Warehouse. However, the assessment of the quality of the Web Data Source is critical before their incorporation in the DW. Some quality models have been conceptually proposed in the research literature. However, to the best of the authors' knowledge, none of the previously known models for the Web Data Source evaluation for a Data Warehouse have been empirically validated. Hence, this paper presents the validation process of the multi-level, multi-dimensional WebQMDW model for quality evaluation of Web data sources for a Data Warehouse. The objective was to provide an empirically validated quality model which will guide the DW professionals to provide enhanced decision making in the Data Warehouse by quality-based incorporation of external Web data sources. The thorough empirical validation is carried out through a survey based on the Pfleeger and Kitchenham work guidelines, which are considered a defacto standard. A questionnaire with three sections was used as the instrument for the survey. Sections I and II correspond to the importance values of the quality factors from level 1(automated quality evaluation) and level 2(expert evaluation) of the WebQMDW model. Section III focuses on the structuring of the model into levels and dimensions. The statistical analysis of the results obtained from the validation survey revealed that 21 factors of the WebQMDW model are 213 | P a g e www.ijacsa.thesai.org considered to be having either high or moderate importance for Web Source quality evaluation. The restructured and validated WebQMDW was obtained as suggested by the results of the empirical validation and supported by the research literature, which can be considered a significant contribution in this area. We plan to conduct a further study with a larger number of subjects, especially from the industry, in the near future. Such a study could be beneficial to refine the model further. We also plan to work on the measures for each quality factor and the refining of the granularity of the model.