Semantic Web Improved with the Weighted Idf Feature

—The development of search engines is taking at a very fast rate. A lot of algorithms have been tried and tested. But, still the people are not getting precise results. Social networking sites are developing at tremendous rate and their growth has given birth to the new interesting problems. The social networking sites use semantic data to enhance the results. This provides us with a new perspective on how to improve the quality of information retrieval. As we are aware, many techniques of text classification are based on TFIDF algorithm. Term weighting has a significant role in classifying a text document. In this paper, firstly, we are extending the queries by " keyword+tags " instead of keywords only. In addition to this, secondly, we have developed a new ranking algorithm (JEKS algorithm) based on semantic tags from user feedback that uses CiteUlike data. The algorithm enhances the already existing semantic web by using the weighted IDF feature of the TFIDF algorithm. The suggested algorithm provides a better ranking than Google and can be viewed as a semantic web service in the domain of academics.


INTRODUCTION
A lot of information is available on the Internet.Search engines remain as the primary infrastructure for Information Retrieval.The relevance of the result-sets is not as desired by the user.This leads to the requirement of a good ranking algorithm to put the best results on the front.
Many popular Web services like Delicious, Citeulike and flickr.comrely on folksonomies (Gautam and Kumar, 2012).Some websites such as CiteUlike (Research Paper Recommender), Delicious (online bookmarking), Flickr (online photo management and sharing application), Furl (File Uniform Resource Locators), Blinklist (links saver), Diigo (collect and organize anything e.g.bookmarks, highlights, notes, sceenshots etc.), Otavo (collaborative web search), Stumbleupon (discovery engine), Blummy (tool for quick access to favorite web services), and Folkd (saves bookmarks and links online) etc. which contain these tag information.
Various difficulties are encountered while doing research on folksonomies.In spite of all this, the growth is tremendous in this area.Researches based on social-bookmarking have become increasingly popular, which lets users specify their keywords of interest, or tags on web resources.Social tagging, also known as social annotation or collaborative tagging is one of the major characteristics of Web 2.0.Social-tagging systems allow users to annotate resources with free-form tags.The resources can be of any type, such as Web pages (e.g., delicious), videos (e.g., YouTube), photographs (e.g., Flickr), academic papers (e.g., CiteULIke), and so on.
In this paper, we utilize the semantic tag information with web page.This information is obtained from CiteUlike (Research Paper Recommender and online Tagging System).When users submit their query; they also submit some semantic description to disambiguate the query.Then, by matching the semantic description between the query and web page, user"s query intent can be well understood.The better understanding of the user"s query leads to better ranking results in academic domain.
In this paper, the following approach has been adopted.We have tried to use the metadata available in the form of user feedback and semantic tags from CiteUlike.a) A new ranking algorithm has been developed.The algorithm utilizes the weighted IDF feature of the TFIDF algorithm.
b) The query was expanded.The idea was to use "keyword + tags" instead of keywords only, so that it carries some semantic description along with it.
c) The data was obtained through CiteUlike.d) The performance analysis was done by comparing the approach with Google by several evaluation methods.
The paper is organized by an introduction to the existing ranking methods, then the new optimized JEKS algorithm followed by significance of the algorithm.Thereafter, the experiments and analysis is done followed by significance and relevance of the research work.In the end, finally the paper is concluded.

II. THE EXISTING RANKING METHODS
Tf-idf, term frequency-inverse document frequency is a numerical statistic which reflects how important a word is to a document in a corpus.The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus.
The literature (S.Lu, X. Li, S. Bai and S. Wang., 2000) provides an improved approach named tf.idf.IG to remedy this defect by Information Gain from Information Theory.www.ijacsa.thesai.org The literature (S.Lu, X. Li, S. Bai and S. Wang., 2000) provides an improved approach named tf.idf.IG to remedy this defect by Information Gain from Information Theory.
The Lingo algorithm proposed by Osinski and Weiss (2005) combines common phrase discovery and latent semantic indexing techniques to separate search results into meaningful groups.It looks for meaningful phrases to use as cluster labels and then assigns documents to the labels to form groups. (Wu, Zhang and Yu, 2006) explored the technique of Social Annotations for the Semantic Web.These annotations are manually made by normal web users without a predefined formal ontology.The evaluation of the approach shows that the method can effectively discover semantically related web bookmarks that current social bookmark service cannot discover easily.(Farooq, Kannampallil and Song, 2007) The authors use six tag metrics to understand the characteristics of a social bookmarking system.Possible design heuristics was suggested to implement a social bookmarking system for Cite Seer using the metrics.
The authors Cilibrasi and Vitanyi (2007) described a technique for calculating the Google similarity distance.
Jin, Lin and Lin (2008) proposed the architecture of a semantic search engine and an improved algorithm based on TFIDF algorithm.The algorithm considers crawling of static web pages.The algorithm can be considered for crawling of dynamic web pages and for parallel crawling also.
A personalized search framework was proposed by Shenliang, Shenghua and Fei (2008).It utilizes folksonomy for personalized search.(Jiang, Hu, Li, and Wang 2009).The other method of basic TFIDF model uses supervised term weighting approach.The model uses class information to compute weighting of the terms.The approach is based on the assumption that low frequency terms are important, high frequency terms are unimportant, so it designs higher weights to the rare terms frequently.
Jomsri, Sanguansintukul and Choochaiwattana (2010) proposed a framework for Tag-Based Research Paper Recommender system.User self-defined tags were used for creating a profile for each individual user and cosine similarity was used to compare a user profile and research paper index.The recommender system demonstrated an encouraging preliminary result with the overall accuracy percentage up to 91.66%.The number of subjects is considered to be small in the experiment.(Zhao and Zhang, 2010) proposed a new viewpoint on how to improve the quality of information retrieval.The queries are extended by "keywords+tags" instead of keywords only.A new tag based ranking algorithm (OSEARCH) was proposed and the results obtained were also compared with Google by several evaluation methods.
The authors Leung and Lee (2010) focussed on search engine personalization and developed several concept-based user profiling methods that are based on both positive and negative preferences.The proposed methods were evaluated against the previously proposed personalized query clustering method.(Kaczmarek, 2010) introduced a novel approach to interactive query expansion.When a user executes a query, the algorithm shows potential directions in which the search can be continued.
Another supervised term weighting method, proposed by the authors (Zhanguo, Jing, Liang, Xiangyi and Yanqin, 2011), provides an improved tf-idf-ci model to compute weighting of the terms.The method uses intra and inner class information.
Various variations of the tf-idf weighting scheme are often used by search engines.Search engines use these weighted measures as a central tool in scoring and ranking a document's relevance given a user query.The tf-idf is improved by many literatures.The proportion of distribution of terms in text collection is one of the most important factors of expressing the content of text, but it is beyond tf-idf"s power (Zhanguo, Jing, Liang, Xiangyi and Yanqin, 2011).
The paper proposed by (Yoo, 2011) suggests a hybrid query processing method for the effective retrieval of personalized information on the semantic web.When individual requirements change, the current method of query processing requires additional reasoning for knowledge to support personalization.(Halpin and Lavrenko, 2011) proposed the method of relevance feedback between hypertext and semantic web search.The paper proposed investigates the possibility of using semantic web data to improve hypertext web search.
In this paper, the authors (Gracia and Mena, 2012) presented the web"s natural semantic heterogeneity problems -namely, redundancy and ambiguity.The authors" ontology matching, clustering, and disambiguation techniques aim to bridge the gap between syntax and semantics for Semantic Web construction.
The authors Zhong, Li and Wu (2012) proposed an effective pattern discovery method for text mining.The paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information.
The paper (Lee, Kim and Park, 2012) proposes searching and ranking method of relevant resources by user intention on the semantic web.There are more limitations in information searching as the information on the Internet dramatically increases.To overcome the various limitations, the Semantic Web must provide search methods based on the different relationships between resources.This paper proposed by (Gautam and Kumar, 2012) proposes a framework for a tag-based Academic Information Sharing and Recommender System which shares information such as question papers, assignments, tutorials and quizzes on a specific area.www.ijacsa.thesai.org(Shaikh, Siddiqui and Shahzadi, 2012) proposed the Semantic Web based Intelligent Search Engine.SWISE required including domain knowledge in the web pages to answer intelligent queries.The layered model of Semantic Web provides solution to this problem by providing tools and technologies to enable machine readable semantics in current web contents.
(Lee, Kim, and Park 2012) presented some proposals to improve and extend the semantic approach based on conceptual neighborhood"s graphs in order to best preserve the proximity between the adapted and original documents and to deal with models that define delays and distances.

A. Metadata Information in the Web Pages and Expansion of the Query
While talking about semantic web, metadata comes into picture.What is this semantic?How is it related to metadata?Semantic Web is something that implies the content, meaning or the metadata related to the web.This metadata information is hidden in the web pages.There are different websites which are working upon it since a long time.We have sites like Delicious, CiteUlike, Flickr etc., which allow different users to create their accounts.After creating the accounts, the users can add metadata for the different websites.This metadata conveys the content of the website as interpreted by different users.
The method should be such that which tries to capture the user"s real query intent.The primary purpose of the search engines is to return the optimal results.But before returning the results, it should be able to analyze the query clearly.The simple keywords can"t express user"s real query intent.In order to analyze the query, some metadata information is added along with the query.The metadata information is added by expanding the query .i.e., keyword+tags instead of the keywords only.
So, the idea is to consider utilizing metadata which is available in the form of semantic tags .One area that arises is to consider utilizing the semantic tag information with web page.When users submit their query, they can also submit some simple semantic description to narrow down the query.
Then by matching the semantic information between query and web page metadata, we can understand user"s query intent better and return better result.
So, the idea is to utilize this semantic tag information.Here, we are proposing the development of a new algorithm based on semantic tags and the weighted IDF feature of the TFIDF algorithm.

B. Storage of Semantic Tags on Web Pages
The semantic tags of a web page are some object properties that reflect the content of the web page, such as marked with "semantic web", which signifies that the page contains information about the object of "semantic web".Of course, there may be multiple tags on a page, because the pages always contain multi information.These tags carry the metadata information along with them.
In our case, we are storing the tags from CiteUlike.A popular website in academia is CiteULike (www.CiteULike.org).CiteUlike is a free service for managing and discovering scholarly references.Additionally, it is also capable to:  "tag" papers into categories.
 Add your own comments on papers.

 Allow others to see your library
The semantic tags are retrieved from CiteUlike.The URLs along with their tags are stored in a local database.For the semantic tags, each URL is opened in CiteUlike and the tags with their numeric values are stored in the database.We add tags" values in the MYSQL database.The data was retrieved from April, 2012 to June, 2013 from CiteUlike for the 50 queries.A total of 5000 URLs were opened in CiteUlike and the database was created.

A. Utilizing the Weighted Inverse Document Frequency
In this paper, we are proposing a new algorithm based on semantic tags in the web pages.An enhanced semantic web algorithm is proposed.The algorithm is based on utilizing the metadata information available with the web pages by integrating in the algorithm some good features of weighted IDF.
When α = 0, (1) becomes classic TFIDF approach, and when α = 1, (1) becomes our newly improved approach.Using balance factor, we can get better classification results.This equation ( 1) is applicable for the terms of the document.The same equation can be used for tags also.Let us take an example.For the three tags, tag1, tag2, tag3 of the category Ci, if they share the same values of tf-idf but have different proportion of A and C. So, the tags which have higher values of the weighting factor make more contribution to the category Ci.Evidently, the tf-idf approach gives equal weights to the three tags unlike the weighted ones.

B. A New Optimized Ranking Algorithm -JEKS (Jyoti and Ela Kumar Search) algorithm
Initially, when users want to submit a query, instead of just giving the query in the form of keywords, they will also expand the query by adding some metadata information along with the query.Afterwards, the algorithm compares the inputted tags in query with the semantic information on the web pages in order to provide the user with better results.
Accordingly, the user query can be expressed as: Query = {keyword1, keyword2,…, tag1, tag2,…} In the above formulation, keyword1, keyword2 is the main query keyword.Tag1; tag2 is the semantic information which we are adding to expand the query.For example, Query = {research papers, web mining) represents that the user wants to find information relating to research papers on web mining.
Similarly, Query = {resources, information retrieval} represents that the user wants to find information relating to resources in the field of information retrieval.
Once, the query is submitted, the system creates a vector of all the user tags.V_usrt = {user_tag1, user_tag2,…} Once the query is submitted to the search engine, the engine returns an initial result page list.The vector of all the tags on the result pages is recorded.V_rest = {r_tag1, r_tag2,…} Where, r_tag1, r_tag2 represent semantic tags on result pages.
The similarity is calculated between the two tag vectors, and recorded as a Tg_score.
Here, google_score represents the original google results score when the query is applied.
Here, p represents the total no. of documents, which is 100 in the experiment; q represents the location of the document on search engine"s result list.So, google_score for the 6 th result is (100-6 + 1) / 100 = 0.95.In (3), Tg_score is calculated by matching the tags of the user with the tags of the result page.The match between the two vectors is based on the following factors.
1) The similarity between the user tag vector and web page tag vector.The high value is obtained by high similarity between the two vectors.
2) The other factor being the weight of the tags on the result pages.Weight refers to the frequency of the the tags in the result pages which match with the tags of the user.
Tg_score is defined as given below based on the factors considered: In the above equation, freq (tag) represents the frequency or weight of the particular tag on the result page.
represents the similarity between the user tag vector and the result page tag vector and similarity is defined as given below: (7) ,e.g.let us say in the Query = {resources, information retrieval} , resources is the keyword and information retrieval is the tag, then in the tags of the result pages even if information or retrieval appears , we have taken the similarity score as 0.5.
Next, ,e.g.consider the query , Query = {artificial intelligence, pdf} to Google, The tenth result has the tags as "pdf", "pdfs", "research" and the frequency of the tags is 10, 9, 4 respectively.Then, the value of the Tg_score = Here, if the above equation is analyzed properly, we see that if we replace words with tags, the (8) can be used in the context of semantic web.So, f w,d has already been considered www.ijacsa.thesai.orgas the Tg_score.Now remains the log (|D|/f w,D ), (which is IDF score).Here, for each query, we have taken the 100 Google results.So, for a particular query, D is 100 and f w,D equals the number of documents in which the particular tag of the query appears.Now, why we have included this IDF score?Suppose that Tg_score is large and fw,D score is small.Then log (|D|/f w,D) will be rather large, and so in (3), the score will be large.This is the case we are most interested in, since tags with high score imply that this tag is important for the document d but not common in D. This tag is having a large discriminatory power.Therefore, when a query contains this tag, returning a document d where score is large will very likely satisfy the user.Now, we are multiplying this IDF score with the weighting factor.As, we have already mentioned the significance of this weighting factor .Let us take an example.Let us replace the terms with the tags in (8).If the values of (Tg_score * IDF score) is similar for the different tags, then weighting factor is used to differentiate the results.The tags with the higher weighting will be preferred as they have higher discriminating power for the category Ci in comparison to the tags having less weighting factor.The tags having less weighting may be rare tags in the category Ci.Now, calculating the (IDF score * weighting factor) for the Query = {books, artificial intelligence}, let us say that the documents in which the tag artificial intelligence appears is 30 and the value of D is 100.So, the IDF score is log (100/30) and weighting factor is (30/70).
In the above (6), we are using java functions to calculate the similarity between user tags and result tags.The database is created using MYSQL.

V. SIGNIFICANCE OF THE JEKS ALGORITHM
The JEKS algorithm developed above is effective in the case when (Tg_score*IDF) score is similar for the different tags in a category Ci.Through the values of the proportion of Ai and Ci, it can be easily found that the three tags show different discriminating power to TC (Refer TABLE 2).The weighting factor can be used to differentiate the results.The tags with the higher weighting will be preferred as they have higher discriminating power for the category Ci in comparison to the tags having less weighting factor.The tags having less weighting may be rare tags in the category Ci.For example, take a class Ci as research papers and the three different tags as mobile computing, data mining and semantic web.Corresponding to this, the three different queries are {research papers, mobile computing}, {research papers, data mining} and {research papers, semantic web}.Now for a particular case when Tg_Score and IDF Score is similar for the three different tags of the class Ci, then Ai/Ci will be used to produce three different TotalScore values (Refer (3))., and hence different rankings.The TABLE 2 shows that the tag3 gives higher discriminating power to the category Ci from other categories than the tags tag1 and tag2.The tag1 may be a rare tag in the category Ci, and makes little contribution to the category Ci.So, the TotalScore will be highest for the tag3, lowest for tag1 and for tag2, it lies in between.

VI. EXPERIMENTS AND ANALYSIS
The experiments are performed as follows: 1) Initially, submit the query to Google, and obtain the original Google search results.
2) Now, submit the Google search results to CiteUlike to obtain the relevant tags.
3) Re-rank the search results according to our algorithm.4) Compare the Google results with our algorithm.

A. Data Set
Query Set: Initially, we determine the queries which we input to the search engine.We determine a total of fifty queries.The queries are a combination of keywords and tags.These queries are submitted to Google.The queries are from academic domain as CiteUlike provides tags for the academic database.
Result Set: Now, submit each query to Google and record the first 100 results.This way, the result set of 50 queries become 5000 results.
Results Tag Set: Now, we submit the 5000 results to CiteUlike and the resulting tag vector is recorded.We obtain lots of tag values for a result.
For example, user submits the query "resources, genetic algorithm", to Google, the 4 th result of Google is having the tag"s values, genetic algorithm = 37, genetic = 35, algorithm = 27, pedestrian navigation = 23, navigation = 12.And, the tag genetic algorithm appears in 40 urls.So, according to the above algorithm, the total score = (0.97) + (0.

B. Experimental Results
First, we determine the relevance between each query intent and each result page.Each result is assigned a relevance score according to its relevance, which ranges between 0 to 3 (totally irrelevant, basically irrelevant, basically relevant, and totally relevant).
We obtain normalized DCG values for our algorithm and Google as given in the Table 3.We obtained normalized DCG values for the 50 queries for our algorithm as well as for Google results.We observed that Fig. 2 shows the normalized DCG values of 50 queries.The graph compares our algorithm with Google.It can be seen that our algorithm acquires higher values of DCG for 40 queries when compared to Google.
Next, we use Precision@k curve for various Relevance levels.
The following conclusion can be drawn from the Fig. 3 to Fig. 5.Our algorithm acquires higher precision in comparison www.ijacsa.thesai.org to Google throughout the varying levels of K for all the 50 queries.The results obtained for Rel>=1 are the best as expected.The precision for Rel>=1 are better than Rel>=2, which is better than Rel>=3.Only, when the Rel>=3, initially Google results are better as can be seen from Fig. 5.
We computed the values for precision, recall and F1-score for our algorithm and Google (Table 4.).These values are calculated for all the queries.These values are calculated for their corresponding top 50 results for Rel>=2 for all the 50 queries.We observed that the value of recall for our algorithm and Google remain at 1 as we have re-ranked the top 100 results of Google for each query.The value of precision and F1-score are calculated and it has been observed that we are getting better results.

VII. SIGNIFICANCE OF THE RESEARCH WORK
Being an academician, I preferred to work in the Academic Domain.I have selected some 50 queries applicable in the Academic Domain.The queries are focused on retrieving the books in different fields of computer science, research papers in different fields of electronics and computers, resources in the respective fields and pdf in various fields of computers.I have retrieved Google results for those queries.For a single query, I have retrieved first 100 results.Those 100 urls were submitted to CiteUlike for retrieving metadata (i.e.tags).In totality, I have retrieved 5000 urls and the tags corresponding to those urls with their weights.The Google results were reranked corresponding to those queries using my algorithm.
After this, I had applied JEKS algorithm on 5000 urls(corresponding to 50 queries).My results of JEKS algorithm for normalized DCG for 40 queries (out of 50 queries) were higher than Google.Our algorithm acquires higher precision in comparison to Google throughout the varying levels of K for all the 50 queries.
We computed the values for precision, recall and F1-score for our algorithm and Google .These values are calculated for all the queries.These values are calculated for their corresponding top 50 results for Rel>=2 for all the 50 queries.We observed that the value of recall for our algorithm and Google remain at 1 as we have re-ranked the top 100 results of Google for each query.The value of precision and F1-score are calculated and it has been observed that we are getting better results.So, the significance of my research work is that a better ranking system has been developed using my algorithm for retrieving the results in academic domain.The results can be extended to include more queries.

VIII. RELEVANCE OF MY RESEARCH WORK
The relevance of the research work is that the entire work has been done using semantic tags from CiteULike(which provides tags in a fully uncontrolled environment).The algorithm is entirely based on tags, which are the essence of semantic web.So, it can be taken as an application or a web service in Academics Domain using semantic web.The algorithm can be extended for more queries.

IX. CONCLUSION
In this paper, we have analyzed some existing ranking methods and proposed a new algorithm based on the previous methods.Semantic tag of a web page is the metadata information associated with it and depicts a lot about the information associated with it.The match degree between user"s real query intent and web page content is determined by calculating the similarity between query and web page tag.
We have proposed the new algorithm using the already existing semantic web algorithm which basically calculates the weighted score of the tags.We have utilized the IDF feature of TFIDF algorithm to improve the semantic web which uses tags.In addition to this, we have used a weighting score.In experiments, we have collected the data from Citeulike and implemented the above algorithm.The relevance scores to the different web links have been given by a group of users.Comparing with Google search results, we find that JEKS algorithm acquires better ranking results, and can put more relevant results in front.Our algorithm acquires higher values of DCG for 40 queries when compared to Google.Our algorithm acquires higher precision in comparison to Google throughout the varying levels of K for all the 50 queries.
In the future work, we will further improve the algorithm.


Easily store references you find online  Discover new articles and resources  Automated article recommendations  Share references with your peers  Find out who"s reading what you are reading  Store and search your PDF"s CiteULike has a filing system based on tags.Tags provide an open, quick and user-defined classification model that can produce interesting new categorizations.

( 10 *
1+9*1+4*0)/ (10+9+4) = 19/23 and google_score = (100-10+1)/100=0.91.Next in (3) is the IDF score multiplied by weighting.We know from the TFIDF algorithm.Given a document collection D, a word w, and an individual document d Є D, we calculate w d= f w,d* log(|D|/f w,D ), (8) Where f w,d equals the number of times w appears in d, |D| is the size of the corpus, and f w,D equals the number of documents in which w appears in D. Words with high w d imply that w is an important word in d but not common in D.

Fig. 1 .
Fig. 1.The number distribution of specific tags versus difference tags in a result set

TABLE I .
BELOW SHOWS THE RELATION OF TERM TK AND CATEGORY CI. indicates the number of documents belonging to category Ci where the term tk occurs at least once; B indicates the number of documents not belonging to category Ci.where the term tk occurs at least once; C denotes the number of documents belonging to category Ci.where the term tk does not occur at least once; D denotes the number of documents not belonging to category Ci where the term tk does not occur at least once. A

TABLE II .
THREE TAGS WHICH SHARE THE SAME (TG_SCORE*IDF SCORE) BUT HAVE DIFFERENT PROPORTION OF AI AND CI IN A CATEGORY CI

TABLE IV .
PRECISION AND F1-SCORE FOR OUR ALGORITHM AND GOOGLE