An Improvement of FA Terms Dictionary using Power Link and Co-Word Analysis

Information retrieval involves obtaining some wanted information in a database. In this paper, we used the power link to improve the extracted field association terms from corpus by the proposed algorithm to support the machine to take the right decision and attach the candidate words in their convenient position in dictionary of the field association terms. Power Link is used as a quantitative tool to compute the cocitation relation among two words depending on the cofrequency and distances among instances of the words. In this paper, concept of the Power Link as well as modifications of the rules is used to classify the scientific papers into its proper field. In this paper, instead of whole document, a given document will be divided into three parts, namely, title, abstract and body. A given term will be given a weight that depends on the location of the term inside a specific document. The greatest weight will be given to the title then the abstract then the body, respectively. Results show an improvement in precision, recall and F measure. Keywords—Information retrieval; FA terms; co-word analysis; power link; precision; recall


I. INTRODUCTION
Information retrieval (IR) defined as the activity of finding information resources related to an information need from a group of information resources.Searches can depend on whole document text or other content-based indexing.To provide automatic information retrieval systems, we can use several different retrieval techniques based on Field Association (FA) Terms and this paper concentrate on the concept of FA terms with co-word analysis [3].
Humans can understand the field of the scientific papers through detecting the particular terms, these terms called FA terms.Field of a document can be classified as: a super field, a sub field and terminal field, and the representation scheme of the document field called field tree [12].For example, the path <Science& Technology/ COMPUTER/ Programming> expresses super field < Science& Technology > having sup field < COMPUTER > and terminal field < Programming > and the field code of this path can be defined by K.12.5.
FA terms are collected according to how well they refer to particular field.For example, "Communication network" and "compiler" are FA Terms of sup-field < COMPUTER >.As an FA Term may relate to more than one field, there are five levels used to rank FA terms as in [12]: Level 1: The terms that specified to only one subfield and called Perfect FA terms.
Level 2: The terms that specified to more than one subfield and in only one super-field and called Imperfect FA terms.
Level 3: The terms that specified to one super-field and called Super FA Terms.Level 4: The terms that specified to more than one subfield of more than one super field and called Cross FA Terms.
Level 5: The terms that do not assign any subfield or super-field and called Non FA Terms.
To choose the helpful FA terms need to consider the relations among simple and compound FA terms and field ranking.So, we need to use the co-word analysis and the Power Link concepts [18].
The co-word analysis is a quantitative study of relations between elements (i.e., terms or noun phrases or topics or fields).The inclusion and proximity indexes are used to compute the strength of relations among elements, these indexes depended on the co-occurrence frequency of elements.Co-word analysis focus on the dynamics of science as an outcome of actor methods.Changes in the content of a topic area are the common impact of a great number of individual strategies.This method must let us in principle to identity the actors and describe the global dynamic as in [11].
In [6], author presented an approach using the passage retrieval to improving constructing FA terms dictionary.They suggested a new method for locating FA terms using passage (parts of a document text) method instead of locating them from the full documents.www.ijacsa.thesai.org In [10], author provided the algorithm based on Power Link concept which explained and computed the relation among two words depended on the co-frequency and the relative locations of various successive instances.If words have nearer relative locations then the Power Link become bigger for those words.
In [13], author presented a method based on the Power Link concept to improve the classification of search engines results.This method depends on ranking the terms in a given field.
Depending on the absolute frequencies reflects the documents length rather than the weight of words, so recent works depend on normalized frequencies instead of absolute frequencies [10], [13], [19] and [20].Also, recent works used the co-occurrence frequencies to reflect the relation between terms [4].Power Link method uses the normalized frequencies, co-occurrence frequencies and considered the relative distances between terms.
While Power Link algorithm considers the whole documents, and gives the same weight for all parts of scientific paper, we will give different weight for different parts of a given scientific paper.In this work, the Power Link algorithm will be implemented, in addition to the another algorithm detect the pre-defined errors in Pre-text processing step presented by [7] to improve the quality of results and purge files from the resulting errors.
After collecting the corpus, in the pre-processing phase, every scientific paper will be divided into three parts, title, abstract and body.Each part will be given a different weight based on its importance.The title contains the most related terms to the topic and reflects the field of the document more than other parts.The abstract contains related terms to each other and reflects the field of the document more than the remainder body.So, we propose to give the terms that occur in the title the highest weight, then the abstract and give the body the least weight in the processing phase, the Power Link will be used to improve the FA terms dictionary.As a result, the proposed idea improved the Perfect FA terms (Level 1) and not improved in results of Imperfect and super FA terms (Level 2 and 3) so, level 1 is enough in our data.This idea can be used in many applications in information retrieval field.
The precision, recall and F measure values referred that the presented algorithm produced in average 0.90%, 0.85% and 0.87% respectively which means that the algorithm effective performance.The F value refers the strength of the algorithm.
The rest of article proceeds as the following: In Section 2, we presents a summary discuss of some definitions and modified algorithm.Sections 3 provide the modified algorithm for determining the Perfect FA terms (Level 1).Section 4 includes the results and discussion then in Section 5.

A. Power Link Analysis
Power Link is a quantitative tool to determine the cocitation relationship among two terms depending on the frequency and the distances among instances of the terms [21].In this paper, we used the Power Link as a tool to improve the extracted field association terms from corpus by the proposed algorithm.
The Power Link algorithm presented in [10] was provided calculations for how tow terms tend to occur altogether in a specific corpus.The Power Link value among two terms was high, if these terms are related together strongly.
The link between any two terms t1 and t2 in document D can calculated by the function of power link LT , ) defined in Section 3.

B. Continuity and Transition Theme
Continuity and transition theme is a method to detect or determine the field of each part of a given document.The features of a subject are given based on continuity and transition.The theme field is defined as the field that a sentence presents, which is denoted by [14]. is preserved by continuity or changed by transition through sentences [9].

Let
is field of sentence S that includes FA terms, then the power link among S and is computed by the field that gives where, ( ) is the Power Link among S and whole fields which expressed by the formula ( )= ∑ for each FA Term in F. So, the existing sentence is attached to the same passage If it has the equal as the previous sentence, or has no , or has no field.And S is delimited and a new passage starts if the existing sentence S has a different from the previous sentence, for more details see [5], [8] and [10].
Here, we can detect the three parts (title, abstract and body) by determining the head word of every part (i.e., abstract and introduction).If the head words are not present or repeated then we need to apply continuity and transition theme in this case.Always the first sentence on any document is the title that contains the most related words together and indicates to the field of the paper, the second paragraph usually is the abstract that contains a summary of all important information about the paper.So it contains the most important FA terms that indicate to the field of the paper and the power link between these terms should be high.So according to the previous rules we can detect and extract the abstract part from the document.

C. Real Word Spell Checker
Many words with multiple meanings exist in the English language.Technically, almost every word has a multiple meaning.How often do you go into the dictionary to look up a word, and find that only one meaning is listed next to it?Practically never!Many words have slightly varying meanings, or they can be used as different parts of speech.
For example (right: You were right./Makea right turn at the light, type: He can type over 100 words per minute./Thatdress is really not her type), (ate/eight, blew/blue, fair/fare, no/know ).
To solve these problems, some algorithms were proposed to automatically detect such errors in syntax or meaning.In this work, to avoid these problems, we use the Real Word Spell Checker algorithm in Pre-text processing step.This www.ijacsa.thesai.orgmethod depends on automatic building of errors that called confusion sets for a specific terms dictionary and corresponding corps.For more details see [7].
System design.

Inputs:
Documents in any specified sub field and its super field a) after indexing to new terms candidate (ranked depending on their occurrence in document after stemming, removing stop words) to extract the new field association terms from them, as in Fig. 2 and Table I.
FA terms dictionary (by traditional algorithm by [12]) b) of sub and super field to be used in the Power Link calculations among them and the candidate terms in each document.Also, data of super field will be used to calculate the concentration ratio.

Output:
A new set of improved FA terms: We can demonstrate the system design and proposed algorithm in Fig. 1 and 3 by the four main steps: 1) Power Link calculation, 2) Compute the Candidate Terms Frequency, 3) compute the concentration ratio, and 4) Compute the Precision and Recall Values, more details about those steps will be discussed in the following sub sections.

2) Power Link Calculations
For each candidate term in each document compute the following Power Link calculations: Compute the Power Link between the term t and the a) sub field <S>: ∑ where: is includes at least one FA term belong to .
is the Power Link between and that will be compute it in b. is co-occurrence of term and S.T: is the number of FA terms identify and appear in that the appears. is the number of documents that includes FA terms that identify and . is link between and that will be compute it in (c).

Compute the Power Link between two terms and c)
based on dividing the document: Firstly: we have two constant terms (stems) in every doc are "abstract" and "introduct" according to the corpus are scientific papers.let and .
S.T: is the index of in .
is the index of in .
is the index of in .
There are three cases to compute according to term position: Suppose ( are the title, abstract and reminder body weights respectively) so: (S.T: position is in the body of paper) then (5) where: is the number of different terms in document , co-occurrence frequency of and in and is the distance between any two successive instants and of and , such that there are no other instants of the term or between the instants and in , note that, the extremes values are neglected.
, and are reflects how much the relation between terms in each part of a document (i.e.Title, abstract and body, respectively).such that bigger than and bigger than because as usual the terms are more related together in the title more than the abstract also more than the body of the scientific researches and its values are determined by experiments.Also, we used the continuity and transition to determine the abstract in case if the doc has problems to detect this part.

3) Compute the Candidate Terms Frequency
The frequency of a term in a sub field is denoted by then ∑ S.T: is a document that includes FA terms that identify and is defined as this formula: ∑ S.T.: is number of times that term occur in .
∑ for whole terms in the .
The local information and the normalization factor are given as these parts ∑ and respectively [2].
is the number of unique terms in .This formula is derived from the classic known formula (Term Frequency-Inverse Document Frequency) of Salton and used it in this algorithm instead of the traditional methods [12], [15], [16] and [6] that used the absolute frequency that only depend on the number of a term repetition in the document and not effective enough [1].

4) Compute the Concentration Ratio
The concentration ratio that based on the frequency and Power Link calculations can be used to judge whether or not the term t is a Perfect FA term and defined as: (7) Where and are frequency and Power Link calculations that will be computed in previous steps , since is the sub field, is the super field of this sub field and by using threshold α to judge the levels of FA terms.Such that, If is less than value of then is not perfect term else is perfect term.

5) Compute the Precision and Recall Values
To test the efficiency of the system we used the measurement of precision and recall to reach the best result of FA terms and its measure are

III. EXPERIMENTS AND RESULTS
The experiments used to validate the advantage of the newly approach and that was the main purpose of it.Furthermore, we choose a most efficient weights along group of trials to provide good algorithm performance.Also, we write the code of our system by Python language that can be easily satisfied for any process on the text but there was a lot of challenges in pre-processing the text files to be formulated, like to convert from PDF file to txt file where some data can lost and there are not function in python can read from PDF file.
In this paper, we focus on a super field science and technology and its sub-field Computer with corps size 12.2 MB about 4741 candidate terms were extracted.
Used the Real Word Spell Checker algorithm in preprocessing step led to discovery and correction 5% errors of the terms.Also we detect the three parts in 100 documents by use the continuity and transition theme.After the comparative analysis of the power link algorithm presented by [10], the proposed algorithm and some research information systems on scientific researches, it was recognized that giving different weights for each part could be improved selection of Perfect FA Terms (Level 1) but not improved of level 2 and 3 in our data.Table II show samples of perfect and not perfect FA terms that resulting from proposed algorithm (PFAT) and traditional algorithm [10], note that terms "Data, keyword and system" are detected as perfect by old method but they are not perfect FA terms in <Science& Technology\ Computer> field.We use and the threshold value = 0.9 that showed the best one for the concentration values in [17].So, this threshold used as a fixed threshold for the concentration values in all loops and the average values of precision and recall are 0.90% and 0.85% respectively, as in Fig. 4. The results showed that the power links by weights do better than the random that produced the values of precision and recall in average 0.80% and 0.70%, respectively, as in Fig. 5.This means that, in this random data, the algorithm has efficiency 100% and to ensure the strong of the results, F is also calculated using the formula.(10) The average of new value of is 0.87% while it was 0.74% using traditional method which refers a high performance of the system.IV.CONCLUSION In this work we proposed an approach to produce an improvement FA terms dictionary by used Power Link concept and give different weights to terms according to their position in the document.The precision achieved using the new method 0.90.Hence, the algorithm succeeded to improve the values of precision by 10% than traditional approach.
Future work could focus on the importance to consider the difference between languages and cultures between English and Arab countries in the Middle East.Different languages can be implemented by doing some natural language processing & speech recognition researches using English, Japanese and Arabic languages.Also, this method can used in Building a comprehensive FA terms dictionary and can apply it in many of the applications especially in text summarization, text classifications, Extraction, filtering and machine translation.
Furthermore, we can apply the Power Link analysis using different weights not only on the scientific research but also any type of unstructured documents.

Fig. 4 .
Fig. 4.Precision, recall and F measure by new approach.

TABLE II .
COMPARISON OF NEW AND TRADITIONAL APPROACHES.