Empirical Validation of Web Metrics for Improving the Quality of Web Page

Web page metrics is one of the key elements in measuring various attributes of web site. Metrics gives the concrete values to the attributes of web sites which may be used to compare different web pages .The web pages can be compared based on the page size, information quality ,screen coverage, content coverage etc. Internet and website are emerging media and service avenue requiring improvements in their quality for better customer services for wider user base and for the betterment of human kind. E-business is emerging and websites are not just medium for communication, but they are also the products for providing services. Measurement is the key issue for survival of any organization Therefore to measure and evaluate the websites for quality and for better understanding, the key issues related to website engineering is very important. In this paper we collect data from webby awards data (20072010) and classify the websites into good sites and bad sites on the basics of the assessed metrics. To achieve this aim we investigate 15 metrics proposed by various researchers. We present the findings of quantitative analysis of web page attributes and how these attributes are calculated. The result of this paper can be used in quantitative studies in web site designing. The metrics captured in the predicted model can be used to predict the goodness of website design. KeywordsMetrics; Web page; Website; Web page quality; Internet; Page composition Metrics; Page formatting Metrics.


INTRODUCTION
A key element of any web site engineering process is metrics.Web metrics are used to better understand the attributes of the web page we create.But, most important, we use web metrics to assess the quality of the web engineered product or the process to build it.Since metrics are crucial source of information for decision making, a large number of web metrics have been proposed in the last decade to compare the structural quality of a web page [1].
Software metrics are applicable to all the phases of software development life cycle from beginning, when cost must be estimated, to monitoring the reliability of the products and sub products and end of the product, even after the product is operational.
Study of websites is relatively new convention as compared to quality management.It makes the task of measuring web sites quality very important.Since metrics are crucial source of information for decision making, a large number of web metrics have been proposed in the last decade to compare the structural quality of a web page.
Web site engineering metrics are mainly derived from software metrics, hyper media and Human computer interaction.The intersection of all the three metrics will give the website engineering metrics [2].Realizing the importance of web metrics, number of metrics has been defined for web sites.These metrics try to capture different aspects of web sites.Some of the metrics also try to capture the same aspect of web sites e.g., there are number of metrics to measure the formatting of page.Also a number of metrics are there for page composition.
Web site engineers need to explicitly state the relation between the different metrics measuring the same aspect of software.In web site designing, we need to identify the necessary metrics that provide useful information, otherwise the website engineers will be lost into so many numbers and the purpose of metrics would be lost.
As the number of web metrics available in the literature is large, it become tedious process to understand the computation of these metrics and draw conclusion and inference from them.Thus, properly defined metrics is used for predictions in various phases of web development process.For proper designing of websites, we need to understand the subset of metrics on which the goodness of website design metrics depends.In this paper we present some attributes related to web page metrics and calculate the values of web attributes with the help of an automated tool.This tool is developed in JSP and calculates about 15 web page metrics with great accuracy.
To meet the above objective following steps are taken:  Set of 15 metrics is first identified and their values are computed for 514 different web sites (2007-2010) webby awards data.
 The interpretations are drawn to find the subset of attributes which are related to goodness of website www.ijacsa.thesai.orgdesign.Further, these attributes can be used to assess the data into good sites and bad sites.
The goal of this paper is to find the subset of metrics out of 15 metrics to capture the criteria of goodness of web sites.The paper is organized as follows: In section II of this paper the web page metrics which we use in our research is tabulated.Section III describes the research methodology, data collection and description of the tool which we use for calculating the attributes of the web page.In section IV we describe the methodology used to analyze the data .Section V presents the result .The overall page quality metrics cannot be easily evaluated as they require human intervention.So in our study, we only use Page formatting metrics and page composition metrics which can be easily calculated.The description of the parameters used in this study is given below:-

1) Number of words
Total number of words on a page is taken.This attribute is calculated by counting total number of words on the page.Special characters such as & / are also considered as words.

2) Body text words
This metrics counts the number of words in the body Vs display text (i.e.Headers).In this, we calculate the words that are part of body and the words that are part of display text that is header separately.The words can be calculated by simply counting the number of words falling in body and number of words falling in header.

3) Number of links
These are the total number of links on a web page and can be calculated by counting the number of links present on the web page.

4) Embedded links
Links embedded in text on a page.These are the links embedded in the running text on the web page.

5) Wrapped links
Links that spans in multiple lines.These are the links which take more than one lines and can be calculated by counting the number of links that spans in multiple lines.

6) Within page links
These are the links to other area of the same page.This can be calculated by counting the number of links that links to other area of the same page.Example in some sites have top bottom.

7) Number of !'s
Exclamations points on a page can calculated by counting total number of ! marks on a page.

8) Page title length
These refer to the words in the page title and can be calculated by counting the total no of words in the page title.

9) Number of graphics
These refer to the total number of images on a page.And can be calculated by counting the total number of images present on the page.

10) Page size
It refers to the total size of the web page and can be found in properties option of the web page.

11) Number of list
This metrics can be calculated by counting total number of ordered and unordered list present on a web page.

12) Number of tables
This metrics gives the answer of the question .How many number of tables is used in making a web page?

13) Frames
This metrics can be calculated by analyzing whether a web page contains frames or not.www.ijacsa.thesai.org

14) Text emphasis
This metric can be calculated by analyzing the web page and counting the total number of words which are in bold, italics and capital.

III. RESEARCH METHODOLOGY
This study calculates quantitative web page metrics for example number of words, body text words, number of graphics, emphasized body text, number of links etc from the web pages that was evaluated for 2007-2010 webby awards.
The organizers of webby awards places the sites in 70 categories example travel, sports, science, fashion, student, youth, education, School University etc.
The Webby Awards is the leading international award honoring excellence on the Internet.Established in 1996 during the Web's infancy, the Webbys are presented by The International Academy of Digital Arts and Sciences, which includes an Executive 750-member body of leading Web experts, business figures, luminaries, visionaries and creative celebrities, and Associate Members who are former Webby Award Winners and Nominees and other Internet professionals.
The Webby Awards presents two honors in every category --The Webby Award and The People's Voice Award --in each of its four entry types: Websites, Interactive Advertising, Online Film & Video and Mobile Web.Members of The International Academy of Digital Arts and Sciences select the nominees for both awards in each category, as well as the winners of the Webby Awards.However the online community, determine the winners of The People's Voice by voting for the nominated work that believe to be the best in each category [8] For our study we take all the 70 categories for example travel, sports, science, fashion, student ,youth, education, school university etc. these categories contain about 514 sites .Mainly we want to determine the subset of metrics which we use to classify the sites into good and bad sites.

A. Data Collection
The web sites are taken from webby awards sites.We collected about 514 sites from various categories; only the home page is collected for evaluating different web pages.There are three levels in the site as level 1, level 2 and level 3 pages.The level 1 page is the home pages.The level 2 consist of pages that are accessible directly from level 1 that is home page and the level 3 pages that are accessible from level 2 but not from the home page.In this paper we only consider level 1 page.
The data collection process in explained in the block diagram of figure 1.
The data points for each year are tabulated in table II.From the above table we can conclude that the total number of data points is 514.

B. Description of tool
To automate the study of web page metrics we develop a tool for calculating 15 web page attributes.We use JSP for this purpose.JSP technology is one of the most powerful, easy to use and fundamental tools in a Web site developer's toolbox.JSP technology combines HTML and XML with Java servlet (server application extension) and JavaBeans technologies to create a highly productive environment for developing and deploying reliable, interactive, high performance, platformindependent web sites.JSP technology facilitates creation of dynamic content on the server.It is part of the Java platform's integrated solution for server-side programming which provides a portable alternative to other server-side technologies, such as CGI.JSP technology integrates numerous Java application technologies, such as Java servlet, JavaBeans, JDBC, and Enterprise JavaBeans.It also separates information presentation from application logic and fosters a reusablecomponent model of programming [9].
From the above mentioned tool we can calculate different web attributes.We can select all the attributes or select some of the above list.We can also save the result for further use.The interface of the tool is shown in figure 2. Preprocess each website's home page for input in the tool www.ijacsa.thesai.orgLogistic Regression:-LR is the common technique that is widely used to analyze data.It is used to predict dependent variable from a set of independent variables.In our study the dependent variable is good/bad and the independent variables are web metrics.LR is of two types (1) Univariate LR and (2) Multivariate LR.
Univariate LR is a statistical method that formulates a mathematical model depicting relationship between the dependent variable and each independent variable.Multivariate LR is used to construct a prediction model for goodness of design of web sites.The multivariate LR formula can be defined as follows:- In LR, two stepwise selection methods, forward selection and backward elimination can be used [10].Stepwise variable entry examines the variable that is selected one at a time for entry at each step.This is a forward stepwise procedure.The backward elimination method includes all independent variables in the model.Variables are deleted one at a time from the model until stopping criteria are fulfilled.
We used forward selection method to analyze 2007-2009 webby awards data and backward elimination method for 2010 webby awards data.

V. ANALYSIS RESULTS
We employed statistical techniques to describe the nature of the data of the year 2007-2010 webby awards.We also apply Logistic Regression for the prediction of different models to examine differences between good and bad design.This section presents the analysis results, following the procedure described in section IV.Descriptive statistics are for every year data is presented in section A and model prediction is presented in section B

A. Descriptive Statistics
Each table [III-VI] presented in the following subsection show min, max, mean and SD for all metrics considered in this study.

B. Model Prediction
We used Logistic Regression to discriminate good from bad pages.This technique is suitable where we have one dependent variable and many independent variables.As in our study, we have one dependent variable named good/bad and independent variables are the whole web metrics of webby awards.We built four predictive models for identifying good pages.Model 1 is with respect to data 2007, Model 2 is with respect to 2008, Model 3 is with respect to data 2009 and Model 4 is with respect to data 2010.These model predict the goodness of the website design based on the subset of metrics which we get from the Logistic regression technique.Table VII summarizes the attributes of the web page.If theses selected metrics have the higher values then we say that, these attributes contributes to a bad design of a web site.The description of each model is described below:-MODEL 1:-This model is based on 2007 webby awards and the attributes which contributes towards the bad design are total embedded links and number of lists.If we have higher values of these attributes then we predict that we can have a bad design.The results of table 7 shows that we can create profile of good pages that is attributes which can be used to make a good design.

VI. CONCLUSION
The goal of this research is to capture quality of web sites.As E-business is emerging and websites are not just medium for communication, but they are also a product of providing services.Therefore imparting quality, security and reliability to web sites are very important.We empirically validate the relationship of web metrics and quality of websites using logistic regression technique the results are based on webby awards data obtained 2007-2010.
The webby awards data set is possibly the largest humanrated corpus of web sites available.Any site that is submitted for the award is examined by three judges on six criteria.It is unclear and unknown how the experts rates the website but hope we present a way towards a new methodology for creating empirically justified recommendation for designing a good web sites.In this paper we present the attributes which, if have higher value can lead to a bad design.From the above attributes we also find profile of good pages.
The type of metrics explored here are only one piece of the web site design puzzle; this work is part of a larger project whose goal are to develop techniques to empirically investigate all aspect of web site design and to develop tools to help designers of the web site to improve the quality of the web page.

VII. FUTURE WORK
In future, we replicate this work on the larger data set and we will explore the tools and methods in all dimensions with the help of that work the web site engineers will simplify and improve the quality of the web sites.Also in future we will take level 1 and level 2 web pages because the home page has different characteristics from other levels of the page .In future we will propose some guidelines to make effective web sites which are easily downloaded and have good scanability.

Figure 1 :
Figure 1: Block Diagram of Data Collection Process

Figure 2 :
Figure 2: Tool Interface for calculating web metrics IV.DATA ANALYSIS METHODOLOGY In this section we describe the methodology used to analyze the metrics data computed for 514 web sites.We use Logistic Regression to analyze the data.

MODEL 2 :
-This model is based on 2008 webby awards.The only attributes which contributes to the bad design is words in page title means if we have higher value of this metrics then we could have a bad design.MODEL 3:-This model is w.r.t 2009 webby awards.In this model we also get only one attribute named number of list which could lead to a bad design of a website.MODEL 4:-Model 4 is based on webby awards of the year 2010.In this model we get many metrics which leads to a bad design like body text words, number of!'s, Page size, Number of tables and within page links.If we have higher values of these metrics we will get a bad design.For predicting model 4 we use backward elimination method of Logistic Regression.

TABLE II DESCRIPTION
OF DATA POINTS USED IN THE STUDY

TABLE III -
Descriptive Statistics of year 2007 webby awards data

TABLE IV :
-Descriptive Statistics of year 2008 webby awards data

TABLE V :
-Descriptive Statistics of year 2009 webby awards data

TABLE VI :
-Descriptive Statistics of year 2010 webby awards data

TABLE VII :
-subset of Metrics selected in each model using Logistic Regression