Efficient Proposed Framework for Semantic Search Engine using New Semantic Ranking Algorithm

The amount of information raises billions of databases every year and there is an urgent need to search for that information by a specialize tool called search engine. There are many of search engines available today, but the main challenge in these search engines is that most of them cannot retrieve meaningful information intelligently. The semantic web technology is a solution that keeps data in a readable format that helps machines to match smartly this data with related information based on meanings. In this paper, we will introduce a proposed semantic framework that includes four phases crawling, indexing, ranking and retrieval phase. This semantic framework operates over a sorting RDF by using efficient proposed ranking algorithm and enhanced crawling algorithm. The enhanced crawling algorithm crawls relevant forum content from the web with minimal overhead. The proposed ranking algorithm is produced to order and evaluate similar meaningful data in order to make the retrieval process becomes faster, easier and more accurate. We applied our work on a standard database and achieved 99 percent effectiveness on semantic performance in minimum time and less than 1 percent error rate compared with the other semantic systems. Keywords—Semantic Search Engine; Ontology; Semantic Ranker; Crawler; RDF;SPARQL


INTRODUCTION
There is a huge amount of data stored on the Internet that is only useful and helpful if accessed as information, not as pure data.To access information from Internet, we need a ‗smart or intelligent' search facility.Search engines are the tools to help users to find data from the huge warehouse of web pages.To extract data, most of the search engines use syntax-based search or full-text search methods.Full-text searching is a technique whereby a computer program matches terms in a search query with terms embedded within individual documents in a database [1].An important issue in full-text searching technique is that -Because full-text searching relies on linguistic matching-matching a word or phrase in a search query with the same word or phrase in a document in the database being searched-it is subject to failure when a variant term exists and is not matched‖ [2].
Recent syntax based search engines use various techniques to solve the limitations of a syntax-based search such as page ranking and content score [3].To get web pages ranked by the search engines, website developers use a method called Search Engine Optimization (SEO).Keywords and meta-tags are the main tools used for SEO.These methods enhance the factor of user friendliness and increase the chances of more accurate results, but these are not the ultimate solution.Data searched by a syntax-based search engine has some limitations, including high recall with low precision (e.g.thousands of results in response to one or few keywords).
A semantic web is optimized solution to these challenges.Semantic Web can be defined as some documents linked in such a way so that the data becomes readable and understandable in a meaningful way [4].One way of viewing this semantic web is that it is a concept of utilizing the Internet in such a way that searching the World Wide Web returns results relevant to the meaning of the search query.On the Semantic information is illustrated via a new W3C model called the Resource Description Framework (RDF).Semantic Search system is a search system for the Semantic Web.Existing Web sites can be utilized by both individuals and computers to trace exactly and gather information available on the Semantic Web.Ontology is the most significant conception used in the semantic web infrastructure, and RDF(S) (Resource Description Framework/Schema) and Web Ontology Languages (OWL) that used to represent ontologies.
In recent years, the Resource Description Framework (RDF) has become a popular protocol for storing web-based data with well-defined meanings,that usedto link data to improve semantic meaning has widened the scope of this protocol.While RDF data is routinely used by many organizations (e.g.gov.uk and bbc.co.uk) its potential to improve semantic searches is now of interest to the database and Internet research communities.[5] The Semantic Web will maintain more professional discovery, computerization and reuse of data and offer some support for combinational problem that cannot be solved with existing web techniques.At this time of research on semantic web search systems are not in the beginning stage, but unlike the traditional search systems such as Google, Yahoo, and Bing (MSN) and so forth still lead the present markets of search systems.
In this manuscript, we suggest a new semantic ranking algorithm based on HTML parsing and crawling algorithm to handle the search engine challenges discusses in Section-3.Experimental outcomes display that the recommended technique is more efficient to retrieve and sort a large amount www.ijacsa.thesai.org of data from huge datasets or data warehouses.The manuscript is categorized as follows Section two (2) discuses related work of semantic systems, and Section three (3)discuses the global challenges that face search engines.Section four (4) described the several component of the recommended framework and recommended ranking algorithm.In Sect.5, experiments results and analysis will be presented and in the last section conclusions and references will be given.

II. RELATED WORKS
Information recovery and retrieval by searching on the web is not a fresh idea but has different problems when it is evaluated to general information retrieval.Dissimilar search systems return different search results due to the differences in indexing and searching process.Google, Yahoo, and Bing have been out there which holds the queries after developing the keywords.

Swooglesystem
that described in W3Cand Duckduckgohavesome limitations especially regarding user experience, time of query response and storage capacity.As shown in figure 1, Swoogle's architecture can be broken into four major components: SWD discovery, metadata creation, data analysis, and interface.Swoogle architecture is data centric and extensible: different components work on different tasks independently.Swoogle offers the some services such as search SW terms and documents, i.e.URIs that have been defined as classes and properties and Provide metadata of SW documents and support browsing the Semantic Web.But, Swoogle has some limitations such as poor indexing of documents and long response time of query.[6] Another example of semantic search engine is Semantic Web Search Engine (SWSE).Following traditional search engine structural design, SWSE contains of crawling, improving data, indexing process and an interface for search to retrieve an information; unlike traditional search engines, SWSE works over RDF Web data tightly also known as Linked Data which implies unique challenges for the system design, architecture, algorithms, implementation and user interface.[7] Fig. 1.Swoogle architecture The sophisticated system design of SWSE loosely follows that of conventional HTML search engines.Figure 2 details the pre-runtime architecture of SWSE system, viewing the mechanism involved in realizing a local index of RDF Web data agreeable for search.Like usual search systems, SWSE includes modules for crawling, ranking and indexing data; on the other hand, there are also factorsspecially designed for treatment RDF data, namely the consolidation module and the reasoning module.The high-level index building process is as follows:  The crawler recognizes a set of seed URIs.Results analysis for keyword query Bill Clinton Fig. 2. Focus analysis for entity Bill Clinton recover a large set of RDF data from the Web;  The consolidation module tries to and identical (i.e., equivalent) identifiers in the data  The ranking module achieves links-based analysis over the crawled data and gains scores indicating the significance of individual factors in the data (the ranking module also considers URI redirections encountered by the crawler when performing the linksbased analysis);  The reasoning modulematerializes new data which is implied by the natural semantics of the input data (the reasoning module also requires URI redirection information to assess the trustworthiness of sources of data);  The indexing module organizes an index which supports the information retrieval tasks required by the user interface.
But, SWSE has some limitations such as poor ranking of documents because the ranking stage is coming before the indexing stage.Ranking technique is coming independently with data indexed in dataset.www.ijacsa.thesai.orgAnother solution model of semantic search engine called Falcons Object Search [8] which firstly is a keyword-based object search engine.For each discovered object, the system constructs an extensive virtual document consisting of textual descriptions extracted from its concise RDF description.Then an inverted index is built from terms in virtual documents to objects for supporting basic keyword-based search.That is, when a keyword query arrives, based on the inverted index, the system matches the terms in the query with the virtual documents of objects to generate a result set.Unfortunately this model is not interested to rank these objects according to query.
This paper investigates some concepts on how the semantic web might be queried in the context of semantic search engines and proposes a framework that facilitates an effective search over the semantic web.Firstly the various factors that influence the search experience over the Internet will be reviewed.Secondly the semantic core technologies necessary to perform a basic search over the Internet will be described -that is RDF and RDF Query Language (SPARQL).Thirdly the academic and social impact of this work is clarified.Finally a proposed framework for a complete search experience for a semantic search engine is presented.

III. CHALLENGES FOR SEMANTIC WEB SEARCH ENGINE
A semantic web search engine should be able to search data over the Internet with maximum precision and accuracy and should be able to link related data.Semantic search engine should consider the following criteria: user experience, efficiency (performance and associated time) , ranking process, scalability, and cost effectiveness.

A. User Experience
A friendly user interface is the mainly significant feature that will increase the user experience.Search engines such as Yahoo, Bing and especially google have all been through a number of enhancements in order to give end-users with the best potential user experience.Even if the results are incomplete or sometimes not accurate due to syntax-only based search algorithms, end-users still remain to these search engines with good user experience.Enhancements to the enduser interface of a semantic query search engine needs important development so that poor input representation of a query will automatically suggest corrections for spelling mistakes and poor grammar, and of course find the best matched results with a high accuracy.

B. Efficiency
An efficient semantic search engine's performance depends upon the size of data to be matched, the request time to server or database and the associated response time.For a semantic web query the finishing time also depends on factors such as delays caused by looking up URL (Uniform Resource Locators) [9], indexing large-scale of data [5], and dealing with query termination and broken links problems [10].Some smart semantic search systems cannot illustrate their important performance in developing precision and lowering recall.In Ding's semantic flash system, the source of the search system is based on the top-50 returned results from Google that is not a semantic search engine, which could be low precision and high recall [11].

C. Ranking Process
The main idea of the Semantic web search engine is to retrieve the most relevant (most precision) and accurate results in response to a query.Ranking process such as page rank algorithm is a method of rating web pages so that the web pages with the highest ranking are presented at the top of a list of search results [12].This is a challenging task given that there are -more than 12.3 billion web pages in the World Wide Web‖ at the time of writing [13], and a single user query on a search engine may return millions of results.It is consequently critical that the search engine can sort and rank the retrieved documents effectively in order of either relevancy or authenticity.There are a number of techniques used by search engines to rank page results.Page ranking techniques can organize results in order of relevance, significance and content score [13].

D. Scalability
Scalability within the perspective of a search engine is the ability of a system to handle a hurriedly growing amount of data.Relational database management systems have frequently shown that they are very efficient with the structure of relational data but not scale well [1].However scalability for data in a semantic web presents additional challenges because of the open source of the RDF protocol.

E. Cost Effectiveness
A high-quality search system must give a solution that is cost effective.Due to the open source of RDF, the queries can be quite expensive while dealing large data sets.For efficient data retrieval, search engines use indexing techniques at the www.ijacsa.thesai.orgcost of additional storage.As a semantic search engine processes an open structure like RDF, complex queries can be very expensive to process.Only a few solutions have been proposed to solve this issue (e.g.caching process) or use other technique in indexing process.Some efforts have been made to introduce cost effective search algorithms such as the SPARQL as search technique [5].In section 4 which discuss the proposed framework that handle and overcome these challenges with suitable user interface like Google search engine that overcome user experience.In addition to standard crawler model, indexing algorithm to overcome scalability and cost effectiveness.Finally we introduce the proposed ranking algorithm that considered as main contribution of this paper to overcome ranking and efficiency problems.

IV. PROPOSED FRAMEWORK
Proposed Framework is designed in a modular fashion and logically composed of two separate phases, the online phase (retrieving phase -which user deal directly with the server) and offline phase (Crawling, indexing and ranking phasewhich server deal not directly with any users ).This section describe in details the two phases that described in figure 3 and figure 4.

A. Offline Phase (Crawling Phase)
In this module, three steps are used.The First Step is crawling process that contains two sub-steps, the first sub-step is URL Discovery Process (pre-crawling process) and the second sub-step is officially crawling process.The second step is indexing process that contains also two sub steps, the first sub-step is parsing HTML document and extract useful information, however the second sub-step is officially indexing process.The third step is semantic ranking process using proposed semantic ranking algorithm that discussed in algorithm #2.Since our architecture is currently implemented to index RDF/XML, we would feasibly like to maximizethe ratio of HTTP lookups which result in RDF/XML content; i.e., given the total HTTP lookups as L, and the total number of downloaded RDF/XML pages as R.In order to reduce the amount of HTTP lookupswasted on non-HTML/ RDF/XML content, we implementthe following heuristics: a) firstly, we blacklist non-http protocol URIs; b) secondly, we queue URIs with common extensions that are highly unlikely to return RDF/XML/HTML/PDF and we blacklist the extensions of images or videos like ( jpg, gif, AVI, MKV, , etc.) c) thirdly, we check the returned HTTP header and only retrieve the content of URIs reporting Content-type: HTML or application/rdf+xml 2) Indexing Process The first sub-step in this process is parsing HTML document to extract useful information such as meta data, title, time stamp, author, keywords and related URLs to be crawled again.The second sub-step is indexing results from the crawling process in the first step into database.But with huge data cannot be indexed into relational database because of scalability, storage, sorting, semantic and retrieval issues.Due to semantic, we use RDF instead of relational database, but we face the unstructured problem.So, we can use hybrid dataset from relational and RDF (based on XML).Relational dataset used for storing hash tables (such as table shown in figure 5.a and figure 5.b) that contains keywords and related RDF ID.RDF used for storing all data that extracted from parsing process.

3) Semantic Ranking Process
The idea of developing ontology-based annotations for information is not a fresh idea; semantic search system would consider keyword impression and would return a page only if keywords (or synonyms, homonyms, etc.) are founded within the page and linked to associate concept.The success is measured by the -predictability‖ that the user would have guessed such anrelationship exists.
The ranking strategy assumes that given a query -Q‖, and a page -p‖, it is possible to build a query sub-graph G Q,p exploiting the information available in page annotation according to ranking ID from the data stored in RDF.Ranking algorithm uses global ID that consist of Semantic ID, Child Data, Parent Data and page rank as shown in figure 6.  SR (Semantic Rank) in figure 5 used as a header (4 bits) 1 digit hexadecimal as discussed in algorithm#2 (Create_Semantic_ID_of_Child_URL).This ID is global semantic ID that express of all these IDs and can be retrieved faster than others.
 PR (Google Page Rank) that famous ranking algorithm that used in Google according to page that crawled in the first phase.This ID uses values from 0 to 10 -( 4 bits) 1 digit hexadecimal

B. Online Phase (Retrieving Phase)
In this module as shown in figure 7, three steps are used.The First Step is keyword generator process that used to split the query request into distinct words.The second step is ontology analyzerprocess that help system to recommend tags to query keywords where tags will have associated ontologies.The retrieving of ontologies from an online store or library in order to tag a word is a lengthy process that will have a cost in terms of efficiency.The third step is matching process that used to match the query tags against the keywords stored in hashing table.After matching process, system can detect the related RDF with related URL resources.www.ijacsa.thesai.org

B. Data Collection
A standard assessment data gathering should be not influenced towards any exacting system or towards aexact domain, as our objective is to assess general idea entity search over RDF data.Therefore, we needed a collection of documents that would be a realistically large estimation to the amount of RDF data accessible ‗live' on the Web and that contained related information for the queries, while concurrently of a size that could be convenient by the resources of a research groups.We chose the ‗Billion Triples Challenge' (BTC) 2011 data set, a data-set created for the Semantic Web Challenge in 2011 as displayed in table 3.

C. Query sets
The Semantic Search Challenge comprised two tracks.The Entity Search track is identical in nature to the 2010 challenge.However, we created a new set of queries for the entity search task based on the Yahoo!Search Query Tiny Sample v1.0 dataset.We selected 10 queries which name an entity explicitly and may also provide some additional context about it.

D. Proposed System Evaluation
Table 4 is datasheet that describe the result of retrieval process after crawling process of 10 samplesfrom proposed system.Datasheet in table 4 shows summation of total results for each query included relevant result (performance) and irrelevant (error) result.Figure 8 shows the performance and error chart of the retrieval process.
Table 5 is datasheet that describe total time of retrieval process in seconds of 10 samples from proposed system.Figure 9 shows the time chart of the retrieval process.The topic of the semantic search engine has attracted large interests both from industry and research with resulting variety solutions in different tasks.There is no standardized framework that helps to monitor and stimulate the progress in this field.In this paper, Four standard tasks of semantic search engine are discussed including crawling, indexing, ranking and finally retrieving task.
We focus on ranking phase that considered as the main contribution of this paper.New ranking algorithm is produced to rank similar meaningful data after indexing phase.In addition to, data retrieval process become faster, easier and more accurate.The performance achieved with 99 percent relevant results in maximum time 60 ms and 1 percent only for irrelevant results.The proposed framework and ranking algorithm can be further developed for future use in detecting more accurate semantic information from social networks in a short time.
They only search information given on the web page, recently, some research group's start distributing results from their semantics-based search engines.Many novel search engines have been developed for the data Web.Most of these systems are focused on RDF document search like (d'Aquin, Baldassarre, Gridinoc, Sabou, Angeletou, & Motta, 2007; Oren, Delbru, Catasta, Cyganiak, Stenzhorn, &Tummarello, 2008) or ontology search like (Ding, Pan, Finin, Joshi, Peng, &Kolari, 2005).Recall that an RDF document serializes an RDF graph; an ontology, as a schema on the data Web, defines classes and properties for describing objects.Although both RDF document search and ontology search are essential for application developers, they can hardly serve ordinary Web users directly.Instead, object-level search is in demand and dominates all other Web queries (Pound, Mika, & Zaragoza, 2010).

2 : Begin 3 : 4 : 6 : 1 7: End If 8 : 9 : 3 : 4 :Fig. 7 .
Fig. 7. Flow Diagram of Online Retrieval Phase V. DATA SET AND EXPERIMENT RESULTS A. Machine Specifications Used in Testing The machine specifications are Corei7 CPU, 2GB RAM, 500GB Hard Disk and Windows 7. The Software specifications are Apache Server (localhost) with PHP version 5.3 and MYSQL Database version 5.5.

Fig. 8 .
Fig. 8. Performance and Error Chart for Retrieval Process

TABLE I .
DESCRIPTION OF 8-BITS CHILD DATA OF SEMANTIC ID Fig. 5. a. Hash table of Query Keyword Fig. 5. b -Hash table of URL Resources

TABLE III .
BILLION TRIPLE CHALLENGE 2011 DATASET

TABLE IV .
RELEVANT AND IRRELEVANT RESULTS OF RETRIEVAL PROCESS AFTER CRAWLING PROCESS ON 10 QUERIES AS SAMPL www.ijacsa.thesai.org