A Disaster Document Classification Technique Using Domain Specific Ontologies

Manual data collection and entry is one of the bottlenecks in conventional disaster management information systems. Time is a critical factor in emergency situations and timely data collection and processing may help in saving several lives. An effective disaster management system needs to collect data from World Wide Web automatically. A prerequisite for data collection process is document classification mechanism to classify a particular document into different categories. Ontologies are formal bodies of knowledge used to capture machine understandable semantics of a domain of interest and have been used successfully to support document classification in various domains. This paper presents an ontology-based document classification technique for automatic data collection in a disaster management system. A general ontology of disasters is used that contains the description of several natural and manmade disasters. The proposed technique augments the conventional classification measures with the ontological knowledge to improve the precision of classification. A preliminary implementation of the proposed technique shows promising results with up to 10% overall improvement in precision when compared with conventional classification methods. Keywords—Disaster Management; Document Classification; Ontology; Supervised Learning; Information Retrieval

semantic disaster management system to support disaster management.The proposed system comprises the following components: • A knowledge base is used to formally capture knowledge about disasters and disaster management in the form of disaster ontologies.A base level disaster ontology is developed by Afzal et al. [2].
• A data collection components collects disaster-related information from various resources on World Wide Web such as blogs, social networks, wiki sites, news sites, government and non-government organizations etc [3].Ontology developed during the previous phase may also be used to support data collection.
• A reasoner is used to perform reasoning on ontologies and the instance data collected by the data collection component.This process produces useful information to support disaster management such as location of disaster, intensity of disaster, information about inaccessible routes of affected area, services required in affected areas, infrastructure damage, number of casualties, livestock loss, services available and required in nearby hospitals.
• An alert management sub-system sends alerts to various stakeholders such as hospitals, government organizations, non-government organizations and volunteers to support decision making for effective disaster management.This paper presents a document classification technique that can be used in data collection phase of SAHARA.The first step during data collection is to label a newly found document according to specified categories.A supervised learning approach is used because the categorization information is already available in the form of an ontology.These categories are formed by various concepts and properties in the domain of disaster management.A set of measures usually used in conventional classification techniques is supported with the ontological knowledge to improve the precision of classification process.The conventional measures include URL of a link, anchor text, inbound links, position & frequency of the target category and URL depth of the document being processed.Ontology computations involve ontology concepts, properties, relationships, annotations and instances.Rest of the paper is organized as follows: www.ijacsa.thesai.orgSection 2 presents a review of use of ontologies in disaster management systems.Section 3 gives details of the proposed technique.Results are presented in section 4 followed by the conclusion and future directions in section 5.

II. RELATED WORK
To find relevance of a document with the target concept in a distributed environment like Internet, the traditional approaches in document classification focus on processing links in the document, popularity of the document through inbound links, frequency and position of the term in the document.More recently, the researchers have also used ontologies to support the classification process.As ontologies are used to capture domain knowledge in a formal and explicit way, they are a natural choice in document classification process.Ontologies have been used in a diverse range of domains from cultural heritage [4] to 3D modeling [5], ecommerce [6] to health services [7], human anatomy [8] to fraud detection [9] and cyber warfare [10] to agriculture [11].Punitha et al. argue that ontology augmentation can improve the document classification process significantly [12].
Disaster management systems can also benefit from ontologies significantly in various phases and tasks of disaster management.Hristidis et al. have identified five phases in disaster management that need data analysis and management, namely information extraction, information retrieval, information filtering, data mining and decision support [13].Each one of these phases has its own unique challenges and the researchers have explored the use of ontologies in all of them.[15].The proposed method achieved up to 93% accuracy and 64.5% recall for some concepts.
Fan and Zlatanova have used ontologies for semantic interoperability in disaster management [16].The proposed methodology comprises two phases.In the first phase, ontologies are developed and evaluated for actors, static & dynamic data models, processes and task.In the second phase, several ontologies are matched together to identify and match common concepts in these ontologies.Ontologies are also updated if required.The authors have used a primitive case study to validate the proposed methodology.

Haghighi et al. have proposed Domain Ontology for Mass
Gatherings (DO4MG); an ontology for intelligent decision support in medical emergency management for mass gatherings [17].The top level concepts in the ontology include Environmental Factors, Mass Gathering Plan, Gathering Type, Crowd Features and Event Venue.Two evaluation approaches, namely criteria-based evaluation and applicationbased evaluation are used to evaluate the developed ontology.A prototype system is developed for application-based evaluation.The results are encouraging and prove that DO4MG ontology can be used effectively to support the decision making process in mass gatherings.Amailef Lu have proposed a similar system and proved its effectiveness to support case-based reasoning in m-government emergency response services [18].

Chen et al. have proposed an ontology based decision
support system for disaster management in typhoons [19].The proposed system comprises three phases including feature extraction, damage prediction and risk analysis.An ontology is used to support these phases.The authors argue that the performance of the system depends on accuracy and completeness of the knowledge captured by ontologies.
Cabacas et al. have proposed an ontology-based messaging system to utilize social relations as a service [20].The user query is analyzed by the system to "understand" the user's social and physical environment.
A service matching component finds the most suitable service based on several criteria such as location, time and situation.Finally, service messenger component broadcasts the message to the concerned stakeholders.
Hristoskova et al. have used a set of generic as well as domain specific ontologies to support the reasoning process in disaster management [21].A data aggregator component collects data from various devices and sensors.This data is passed on to context engine which updates/queries a semantic model composed of ontologies.The context engine also interacts with a decision engine for updating, querying and evaluating the rules.The proposed approach is validated through implementation in two scenarios.A critical analysis of the related work strengthens the case and need of developing an ontology-based document classification method for disaster management system that can be used to categorize various kind of documents from World Wide Web.

III. PROPOSED METHODOLOGY
The proposed approach attempts to categorize a document with a target concept in the domain of disaster management.The process is divided into three phases, namely link relevance, page relevance and ontology relevance.Finally, these scores are combined into an overall document relevance score.The details of these three phases are as follows.

A. Link relevance computations
Link relevance is based on the measures commonly used in classical clustering methods.These include anchor text, URL text, and link popularity.A page will be assigned a higher relevance score if the target concept appears in the anchor text and URL text.Also, the relevance score will be higher for a popular page i.e., a page having more number of inbound links from external documents.

B. Page rlevance computations
The structure and content of a document/webpage play important role in computing its relevance with a particular concept.Page computation is further divided into the following measures: 1) Term frequency-Inverse document frequency (TF-IDF) TF-IDF score is a classical method of assigning more weight to a more frequent term in a document and a lower weight to unimportant terms in the entire document collection.Several variations exist and one of them is given below [22]: Where f t,d represents frequency of term t in document d.
A commonly used formula for calculating inverse document frequency is: Where N is the total number of documents in the collection and N t is the number of documents in which term t appears.
Finally, P tf-idf can be calculated by simply multiplying P tf and P idf .P tf-idf = P tf * P idf (3)

2) Attribute relevance
The position of a term appearing in a document plays an important role in classifying a document.If a term appears in title, first or second level heading, then the document is more relevant to that term as compared to another document in which the same term appears in a paragraph.
3) URL depth URL depth refers to how deep a web page lies in a website.The closer a webpage is to site root; the more it is considered to be relevant to the target concept.A webpage located deeper in a site hierarchy is considered to be less important.

C. Ontology relevance computation
As mentioned above, Ontologies are an excellent source of document classification because they are formal bodies of knowledge developed for specific domains.In this work, the base level disaster ontology developed by Afzal et al. is used [2].The top level concepts in the ontology include Disaster, Disaster Location, Disaster Date, Losses, Services, Service Providers, and Relief Items.A partial hierarchy of Services concept in the ontology is given in Fig. 1.Fig. 2 shows a detailed description of Transportation Hazard concept in the ontology The details of ontology relevance computations are given below: 1) Ontology concepts A positive match between concepts in a document with the ontological concept to be classified may serve as an important document classification measure.This measure is given the highest weight in our classification process because of the formal semantics captured in an ontology.2) Ontology properties Ontology properties are used to define relationships of concepts with literals only such as OccurredOn is a property of Disaster concept to describe date and time of occurrence of disaster.Ontology properties can play an important role in document classification as they are used to define the concept unambiguously.Two cases may arise in this case.First, if an ontology concept is matched in a document and the properties are also similar, then the confidence of relevance is very high.On the other hand, if concepts are different but there is high similarity between the properties, then there are high chances of similarity and it is assumed that different synonyms are used for the same concept.

3) Ontology relationships
While ontology properties establish a link between ontology concepts and literals, ontology relationships are used to relate concepts with other concepts.Ontology relationships can give contextual and domain information such as hasLocation relates the Disaster concept with the Location concept.Relationships are important measure for document classification as they can help in reducing ambiguity with contextual information.

4) Ontology annotations
An ontology may have a number of annotation properties such as SeeAlso can be used to point to another source describing the same concept.Other examples include Label, Comment, SeeAlso and IsDefinedBy.These annotations may use used to give synonyms of a term, refer to some other resources for further description or give human-readable labels.

5) Ontology instances
Instances relate concrete things to general class of concepts e.g., Katrina3 is an instance of Hurricane disaster.A document containing instance of the target concept is assigned a higher weight.

D. Proposed Algorithms
The algorithms for the three computational components mentioned above i.e., link relevance, page relevance and ontology relevance, are given below.The algorithm for ontology relevance computation is given below.

Algorithm OntologyToConceptRelevance
Inputs: Word vector of document, Word vectors of ontology concepts, properties, annotations and instances, Weight assigned to concepts, properties, relations, annotations, instances and cosine similarity Output: Ontology relevance score Let Concept=Target concept in disaster domain S c , S p , S r , S a , S i , S 0 = Temporary variables to store relevance scores for ontology concepts, properties, relations annotations, instances and ontology respectively Rel c , Rel p , Rel r , Rel a , Rel i , Rel 0 = Relevance of ontology concepts, properties, relations, assertions, instances and ontology with the target concept respectively CS=Cosine similarity measure of the document CS c , CS p , CS r , CS a , CS i = Cosine similarity measures for concepts, properties, relations, annotations and instances respectively W c , W p , W r , W a , W i = Cosine similarity measure weights for concepts, properties, relations, annotations and instances respectively

IV. RESULTS AND DISCUSSION
The proposed algorithm is tested on eighteen sets of documents related to various concepts in disaster management domain.These documents are categorized by human reviewers for their relevance with the target concepts.Then the results of conventional and ontology based classification are compared.Fig. 3 shows results of proposed algorithm on 18 sets of documents, each set consisting of 20 documents and the results are averaged for each set.The first six sets of documents (Set1 -Set6) were highly relevant to the target concept.The next six document sets (Set7 -Set12) were moderately related with the target concept.The last six sets (Set13 -Set18) were unrelated with the target concept.The results show that the ontology based classification performed better both for highly relevant and irrelevant documents.The proposed algorithm ranked relevant document higher than the conventional technique.The overall average gain achieved was 11%.For moderately relevant documents, the difference between proposed and traditional algorithm was marginal i.e., 3%.In case of unrelated documents, the proposed algorithm ranked the documents lower than the traditional algorithms.In this case, the average difference was 9%.Hence, the proposed algorithm achieved an overall improvement of about 10% because of use of ontologies.

V. CONCLUSION AND FUTURE WORK
The proposed ontology-based document classification technique outperforms the conventional methods because of formal semantics provided by the ontology.The initial evaluation on a selected set of documents showed up to 10% overall improvement in the precision of classification.However, the proposed techniques has some limitations.First, it depends on availability of ontologies.As there are no standard disaster ontologies available, the performance of a typical system depends on the quality and accuracy of ontologies used.Another limitation is a lack of availability of instance data.Also, the ontological processing is computationally expensive as compared to traditional approaches.
The future work involves evaluation of the proposed technique in a distributed environment like World Wide Web.A real life implementation in a particular disaster situation is also required to evaluate the proposed methodology.Moreover, in this work, a general ontology of disaster management is used that covers several kinds of disasters.One may also consider using a specific ontology targeted to a particular kind of disaster to improve the effectiveness of the proposed approach, e.g., an earthquake ontology for classifying earthquake-related documents and an tsunami ontology for tsunami-related documents.More specific ontologies may also have added advantage of improved efficiency because of narrower coverage of domain.Another future direction may focus on the selection of ontologies in real time.In this case, the system is not given an initial ontology as input but the most suitable ontology is selected based on the first few documents.A system may also be designed to use different ontologies for different set of documents.The criteria might include level of granularity or specificity of the concepts in the documents being processed.

Fig. 1 .Fig. 2 .
Fig. 1.A subconcept hierarchy of Service concept in the disaster management ontology Fig. 3.A comparison of precision of conventional and ontology-based classification approaches S tf * S idf Normalize S t and S h by length of document Rel t ⃪ S t * W t Rel tf-idf ⃪ S tf-idf * W tf-idf Rel h ⃪ S h * W h Rel p ⃪ Rel t +Rel tf-idf + Rel h vector of concepts, properties, relations, annotations, and instances in the ontology respectively W c , W p , W r , W a , W i , W CS = Weight assigned to concepts, properties, relations, annotations, instances and cosine similarity respectively S i , S p , S r , S a , S i ,S 0 ← 0 Rel c , Rel p , Rel r , Rel a , Rel i , Rel 0 ← 0