Identifying and Extracting Named Entities from Wikipedia Database Using Entity Infoboxes

An approach for named entity classification based on Wikipedia article infoboxes is described in this paper. It identifies the three fundamental named entity types, namely; Person, Location and Organization. An entity classification is accomplished by matching entity attributes extracted from the relevant entity article infobox against core entity attributes built from Wikipedia Infobox Templates. Experimental results showed that the classifier can achieve a high accuracy and F-measure scores of 97%. Based on this approach, a database of around 1.6 million 3-typed named entities is created from 20140203 Wikipedia dump. Experiments on CoNLL2003 shared task named entity recognition (NER) dataset disclosed the system’s outstanding performance in comparison to three different stateof-the-art systems. Keywords—named entity identification; Wikipedia infobox; infobox templates; Named Entity Classification (NEC);


INTRODUCTION
The word named entity (NE) as used today in text mining and Natural Language Processing (NLP) was introduced in the Sixth Message Understanding Conference [1].It represents a major part of all textual data covering proper names of persons, locations, organisations and corporate entities e.g, University of Birmingham, UK, Mount Everest, Mogadishu, David Beckham among others.Besides, Named entity classification (NEC) is the process of categorizing named entities to their corresponding classes (e.g.Person, Location, Organization).This is usually a supplementary step to the wider area of named entity recognition (NER).Although, NEs represent core components in natural language texts, they are still poorly covered in the state of the art language dictionaries.This might be due either to their ever-changing nature and dynamicity in which some named entities disappear while new ones emerge on regular basis, or to the fact that many NEs might be genuinely classified to more than one class, where one may encounter, for instance, several place names who are also person names, and/or corporate names.For example, if you search some of world's largest corporations such as Microsoft and Apple you may hardly find them in the state of the art knowledge networks such as WordNet.An improvement of named entity coverage are now being made in lexical semantic networks such as ConceptNet 5 [2].More importantly, constantly updated live online repositories like Wikipedia [3] and Open Directory Project [4] do possess high named entity coverage than the aforementioned resources holding almost all object names.Therefore, in order to automatically handle NER or NEC tasks, the use of such repositories is inevitable.
Challenges hindering an accurate NEC is not limited to their low coverage in the well-established language resources, but also include the ambiguity pervading the meaning of these entities [5], and entity linking [6], which have been subjected to intensive studies in recent years.This study is rather focused on improving NEC through addressing the coverage problem.To this end, current work advocates the use of Wikipedia utility for entity classification.
Strictly speaking, with the emergence of diverse natural language processing tools and the increasing need for automated text analysis, an important research has been conducted for the purpose of named entity classification in the past few years.In [7], authors used a bootstrapping method based on Wikipedia category to classify named entities containing Heidelberg Named Entity Resource (HeiNER) [8].Nevertheless such classification might be undermined by the inconsistency of placing contributed articles by the authors in the most appropriate category.In a closely related study, Tkachenko et al. [9] carried out a fine grained classification for Wikipedia named entities.Though, their method correlates this study, they extracted many features for the classification including first paragraph of the article text, categories, template names, and other structured content tokens.This will demand a huge processing time when classifying large datasets.The closest work to ours is explored in [10] where researchers used structured information from infoboxes and category trees for the classification task.Despite this relatedness, their work differs from this study in terms of the overall classification methodology as well as the employed dataset where Portuguese Wikipedia was used in [8].
Finally, one shall also mention some seminal works on Wikipedia entity classification built on machine learning algorithms.Dakka et al. [11] used bag-of-words of Wikipedia articles with support vector machine (SVM) algorithm achieving a high F-score of (90%).Watanabe et al. [12] employed Conditional Random Fields to classify Japanese Wikipedia articles while Bhole et al. [13] combined heuristics with linear SVM for the same purpose.But the main drawback of machine learning related approaches lies in the requirement of a manually annotated training data, which is rather costly and complex task.
The main contribution of this paper consists in designing and testing a new simple named entity classification algorithm that only makes use of some structured information available in Wikipedia articles.Especially, unlike the aforementioned methods, the proposed NEC approach relies on the content www.ijacsa.thesai.orginformation of a single structured table, the infobox, but achieves a high score of accuracy and F-measure.The classification algorithm put forward in this study matches a predefined core entity attributes built from Wikipedia Infobox Templates (WIT) and entity specific attributes extracted from the related named entity Wikipedia article.
The rest of the paper is structured as follows.Section 2 covers Wikipedia structure and its containment of named entities.Section 3 copes with the proposed named entity classification approach using Wikipedia.Section 4 details the system experiments, highlighting the utilized dataset, results, and comparison with relevant state of the art systems.Finally, conclusions are drawn in Section 5.

A. Overview
Wikipedia is a freely available encyclopaedia with a collective intelligence contributed by the entire world community [14].Since its foundation in 2001, the site has grown in both popularity and size.At the time of this study's experiment (April 2014), Wikipedia contains over 32 million articles [15] in 260 languages [16] where its English version has more than 4.5 million articles 1 .Its open collaborative contribution to the public arguably makes it the world's largest information repository.Wikipedia contains 30 namespaces of which 14 are subject namespaces and two are virtual namespaces.Besides, each namespace has a corresponding talk namespace 2 .A namespace is a criterion often employed for classifying Wikipedia pages, using MediaWiki Software, as indicated in the page titles.Structurally, Wikipedia is organized in the form of interlinked pages.Depending on their information content, Wikipedia pages are loosely categorized as Named Entity Pages, Concept Pages, Category Pages, Meta Pages [8].
In recent years, there has been a growing research interest among the NLP and IR research communities for the use of this encyclopaedia as semantic lexical resources for tasks such as word semantic relatedness [17], word disambiguation [18], text classification [19], ontology construction [20], named entity recognition/classification [21], among others.

B. Named Entities in Wikipedia
Research has found that around 74% of Wikipedia pages describe about named entities [22], a clear indication of Wikipedia's high coverage for named entities.Each Wikipedia article associated with a named entity is identified with its name.Most Wikipedia articles on named entities offer useful unique properties starting with a brief informational text that describes the entity, followed by a list of subtitles which provide further information specific to that entity.For example, one may find information related to main activities, demography, and environment for Location named entities; education, career, personal life and so on for Person named entities.Relating concepts to that named entity are linked to the entity article by outgoing hyperlinks.Moreover, a semi-structured table, called infobox, summarizing essential attributes for that entity lives in the top right hand of each article [23].It is the core attributes of the article infobox that this study stands on for the classification of named entities without any other prior knowledge .The snapshot in Figure 1 illustrates the Wikipedia article infobox related to "Google", which corresponds to a named entity of type Organization (http://en.wikipedia.org/wiki).The table summarizes very important unique properties of the entity in the form of attribute-value pairs.Consequently such tables are extracted, stored and analysed for the purpose of NE classification.

III. THE CLASSIFIER
Using predefined core attributes extracted from Wikipedia Infobox Templates, a semi-supervised binary algorithm is developed.Being the main classifier, it predicts whether a particular named entity belongs to a given type.In other words, the classifier is designed to match named entities against these set of core class attributes (cf.Section A) and consequently identify these entities based on the outcomes of the matching process.The classification is achieved according to the following definition.www.ijacsa.thesai.orgDefinition: Let ne be a named entity in Wikipedia (WP) belonging to any of the three types, Person (P), Location (L) and Organization (O).If XITA denotes infobox template attributes 4 of type X and IA(ne) is the infobox attributes extracted from WP article associated with ne, then the classifier identifies ne type according to quantification (1).

{ ( ) ( ) ( ) ( )
Where T ne stands for the type of named entity ne as identified by the classifier, while the operator "==" corresponds to array matching.

A. Defining Core Attributes
MeidaWiki team has developed infobox templates designed to guide contributing authors.The infobox templates contain the attribute labels to be filled by the authors with values when writing their Wikipedia articles on named entities.These attributes describe properties particular to each named entity type.For example, all location-based named entities should bear coordinate information.Similarly, infobox attributes for Person named entities include birth date and place.Table 1 lists a selected sample of these attributes for demonstration purpose.Essential attributes to each class, usually identified through manual investigation, are referred Core Attributes.The latter are used in the experiments to identify Wikipedia articles corresponding to named entities through matching the core attributes with the attributes extracted from entity infoboxes.Experimented core attributes are designated with stars in Table 1.

B. Accessing Wikipedia Database
To use Wikipedia as an external knowledge repository for named entity classification, a mechanism for accessing its database should be in place.Designed system's access to the encyclopaedia is summarized in Figure 2. Primarily there are two methods for accomplishing such data access; namely, either querying through web interface, or accessing a downloaded local Wikipedia dump.
For this study, query access method is used for the system evaluation.However, for the actual named entity extraction, a local access is made to a downloaded Wikipedia xml dump of 4 These are the core attributes used for matching February 2014.In implementing the query access method, this study partially adapts the Wikipedia Automated Interface [24] while the local access to the Wikipedia Dump is built on a MediaWiki dump Files Processing Tool [25].The preference of query access over the local access for the evaluation is tied to the unsuitability of the dump files for random access as the dumps are primarily designed for sequential access.

IV. EXPERIMENTAL SETUP
The proposed classifier system is implemented with Perl scripts in Linux environment.Entity core attributes derived from Wikipedia Infobox Templates represent the heart of the developed classification method.An illustration of the implementation scheme is given in Figure 4 (cf.the algorithm in Fig. 3).Each named entity has to go through three processing stages before it gets classified to its type.In stage 1, the Wikipedia article associated with that entity is retrieved while the extraction of its article's infobox forms stage two.At this stage, the scope of the processing text has been narrowed to the infobox.This semi-structured table is further parsed in stage 3 where tuples of attribute label-values are built from the infobox obtained in stage 2. Having organized the tuples in Perl Hashes, the matching process is now performed against the core attributes and the correct decision is made.The same process is repeated for every named entity to be identified.Figure 3

A. Dataset
Experiments were conducted on two datasets.The first testdata comprises 3600 named entities with different proportions of the three considered entity types (PER, LOC, ORG), and was created from two data sources; namely, Forbes and GeoWordNet.Especially, all organization and person names were an excerpt of Forbes400 and Forbes2000 lists for richest American businessmen and world's leading public companies respectively 5 .On the other hand, Location named entities were sourced from GeoWordNet database.The second test-data uses CoNLL-2003 shared task named entity data 6 .The latter dataset, a standard publicly available dataset, has been selected for proper evaluation and comparison with state of the art techniques for Wikipedia NEC.Checking the coverage and the availability of all names with their surface forms in Wikipedia has been performed over all datasets prior to the experiments.

B. Results and Discussion
The system tests were made in two rounds.In the first round the test dataset is divided into 4 smaller parts containing 100, 500, 1000, 2000 NEs all with different proportions of their types.This splitting has been performed for at least two reasons.First, this helps to securitize the data size effect on the observed parameters.Second, it reduces Wikipedia server's overhead with large data since all the testing and evaluation experiments used Query-based access to the online version of the encyclopaedia.
There are four possible outcomes that can result from the binary predictive classifier.In the first case, an entity that belongs to a type x might be classified as being of class x, referred to as True Positive (TP).Secondly, A False Negative (FN) occurs when a named entity of type x is incorrectly identified as not falling in that type.Thirdly, there happens a case where a named entity does not belong to class x, but classified as type x; a situation known as False Positive (FP).Lastly, when a non-member named entity of type x is correctly predicted as not falling in class x, it is referred to as True Negative (TN).Metrics for evaluating the classifier's performance will be based on the above mentioned outcomes.2, where the accuracy level is determined as: ( ) The trend of the scores shown in Table 2 indicates that varying data sizes has little effect on the accuracy for the Person and Organization entity types.However a slight declination is observable in the case of Location names.Overall, round 1 experiments on test-data reveal that the classifier can achieve an average accuracy above 93% irrespective of the data size.

Results of round 1 experiments are reported in Table
An examination of the misclassified proportion of the test data showed, that a number of factors contribute to the classifier's failure to identify some named entities.The most prominent factors were found to be the ambiguity of named entities and the absence of infoboxes from Wikipedia articles.Although there are machine learning like solutions to the ambiguity issue, using disambiguation like method, little can be done in the case of absence of infobox information.Possibly, the only sensible way to handle this matter is removing the underlying case (s) from the evaluation dataset in the validation and state of the art comparison stages.www.ijacsa.thesai.orgDisambiguation is the process of normalizing named entities that have multiple surface forms and identifying their referents.For instance, Birmingham may refer to the largest city in Alabama USA, or the second largest city in the United Kingdom.Error analysis related to system's misclassification highlighting factors leading to these errors is presented in Table 3.Through the error analysis, it is found that ambiguity in Organization and Person names extremely undermines the system performance.This is perhaps due to the use of abbreviations for larger organization names and the presence of common cultural names e.g John, Mohamed, shared by thousands of people in Wikipedia database.Disambiguating named entities in Wikipedia has been studied [5] and is still an active research problem.
Results of Table 3 have also shown the existence of a high proportion of Wikipedia named entities that lacked infoboxes information.Experimental results disclosed that 50% of the unclassified Person entity articles are without infoboxes in Wikipedia.The figure is slightly lower for the other two considered entity types.As the system relies on information in the infobox, the absence of the infobox from any entity article makes the system unable to identify related named entity.Because of its importance, [26] proposed an author assistant tool for automatic suggestion of infoboxes for contributing authors.
In Table 3, the column designated by Others combines other factors including redirected pages, and technical difficulty of extracting the infobox due to the structure of some Wikipedia articles that lack regular patterns.Sometimes the availability of an infobox in an article does not guarantee the presence of the core attributes.The fact that some Wikipedia article infoboxes does not contain the core attributes such as coordinates made this factor to be the misclassification culprit for the largest percentage (77.8%) of unclassified Location named entities.This again precluded the classification of these entities on the basis of their core attributes.
Following the error analysis and prior to the second round of evaluative experiments, Wikipedia assisted disambiguation is used to exclude all ambiguous names.Similarly, all named entities whose Wikipedia articles lack infobox tables have been iteratively removed from the evaluation dataset.
In the second round, experiments were conducted using named entities constructed from CoNLL-2003 shared task data for named entity recognition to observe three of the traditional information retrieval metrics namely; precision, recall, and Fmeasure.Precision is the proportion of classified named entities that belong to the target type.It is defined by the relationship in expression 3.

( )
Likewise, recall (exp.4) measures the proportion of named entities of a given type which has been correctly classified.

( )
Due to the trade-off between precision and recall, an Fmeasure has been developed as proper measure that combines the effect of the metrics as formulated in equation 5.

( )
The overall classifier results in terms of these three metrics are summarized in Table 4.The F-measure scores of locations and organizations indicate that the selected core attributes represent good classification criteria for identifying Wikipedia entities.Again, this study's results confirmed that these attributes are mainly added by article contributors when authoring Wikipedia articles through adapting infobox templates.Person names achieved the highest F-score as ambiguity of these has been accounted for.

C. State-of-the-Art Comparison
Comparing the study's infobox based matching approach with related state of the schemes for named entity classification and extraction is not trivial.Major discrepancies arise from the peculiarity of each approach in terms of the Wikipedia features (article text, links, categories, infoboxes) used for the entity identification.In addition, there might be significant differences in the evaluation data and Wikipedia language in the event of language dependent schemes.Nevertheless, a rough approximate comparison of the system with three baselines is provided in Table 5.The criteria for choosing these baselines are their closeness to the system in terms of their use of infobox information and related features.Table 5 compares the outcomes of the overall classification system in terms of F-score for each type of named entity to three state of art classification approaches (baselines).The baselines use infobox data as one of their classification features; whereas this system is entirely built on infobox attribute matching.Despite that, it is evident that it outperforms all baselines except [27] where a high F-score is reported for location based named entities.However, there is still a room for improvement to extend the work in identifying Miscellaneous named entities and further subcategorizing the www.ijacsa.thesai.orgmain entity types to subcategories which have been considered by many state of the art systems.

D. NE Extraction from Wikipedia
If any named entity with an entry in Wikipedia can be identified, then hypothesis on the likelihood of recognizing all Wikipedia articles on these entities can be reached.Therefore, the proposed classification algorithm is applied on the English Wikipedia dump dated third February 2014.Table 6 shows the number of each named entity type extracted from Wikipedia database.The number of named entities obtained through this approach (1575966) significantly outnumbers the figure of Wikipedia articles on named entities (1547586) derived from the same database in [8].One may argue that this has been an earlier study while Wikipedia is constantly growing in size.This is true to an extent, however this study has only considered three types of named entities while [8] contains Miscellaneous named entities in addition to the three considered by this work.The generated database of named entities can be used as a training data for supervised classification strategies.V. CONCLUSION A Wikipedia-based approach for predicting three types of named entities namely; Person, Location and Organization using article infoboxes is presented.Unlike common state of the art approaches which rather employ a set of multiple features such as article text, categories, links, among others, this study relies on a single feature consisting of the structured information in the infobox table.This has significantly reduced the classifier's processing time, which would be useful for delay sensitive applications requiring identification of designated names.Despite the use of a single feature, the proposed approach achieves a classification accuracy of above 97% with 3600 named entities and CoNLL-2003 shared task NER dataset used to validate the classifier's performance.Applying the same algorithm on Wikipedia database has resulted in the extraction of around 1.6 million named entities belonging to these three types.As a future work, the ongoing study aims to extend the infobox-based entity identification to generate a fine-grained entity classes in which each of the main types can be further subdivided into multiple subtypes.

TABLE I .
CORE ATTRIBUTES EXTRACTED FROM INFOBOX TEMPLATES

TABLE II .
RESULTS: ACCURACY WITH VARYING DATA SIZES

TABLE III .
ERROR CAUSING FACTORS AND SYSTEM MISIDENTIFICATIONS

TABLE IV .
OVERALL CLASSIFIER RESULTS

TABLE VI .
SUMMARY OF EXTRACTED WIKIPEDIA NES