Data Citation Service for Wikipedia Articles

The citation of big scientific data is crucial not only for scientific activity but also for the scientific discovery and dissemination within scientist network. The main objective of this research is to develop a service-oriented data citation system using data mining techniques for Middle East and North Africa scientists. A novel service oriented framework is proposed to prototype the development of the system that includes query formalization, service discovery, service composition design, service selection, search space, and service optimization. In this research, Wikipedia scientific-related articles are connected with more than 35 petabyte Pangaea datasets. The output of this work is a web service that takes Wikipedia article information as an input and provides the possible relevant datasets (if exist) related to the article. The evaluation of this research is based on a quantitative assessment performed to the quality of web service metrics, such as number of access and bandwidth utilization; which shows that the framework is robust enough to handle both big data access and its citation. Keywords—Scientific dataset; web services; wikipedia; pangaea; big data


I. INTRODUCTION
Scientific data residing in datasets are usually considered as a shared resource, so that world-wide scientific community can access these datasets for their relevant purposes.Similar to journal papers and conference papers, which are cited in scientific papers, datasets are also significant candidates for citations in scientific papers/articles.Data citation is the reference or link from one piece of content to other contents in the form of data or document.There are many big data centers in which these kinds of datasets reside.Pangaea is one of data publishers for scientific data, mostly from earth and environmental science.It is open to any scientists or projects who are willing to archive and publish data.This online scientific resource repository gives an access to huge number of (mostly free) datasets for more than 35 PB in size.Wikipedia, on the other hand, is a gigantic collaborative encyclopedia consisting of huge number of articles, including a scientific category.The scientific-related articles of Wikipedia, just like any other scientific articles, have a particular section of citations.However, data citations section is not provided in Wikipedia.This is because to find the relationship between Wikipedia articles and scientific big data is not an easy task.
Data mining technique to find the connection between the two huge resources of data i.e.Wikipedia and Pangaea, is proposed in this project.The application of data mining on these big data produces the relationship between a Wikipedia article and a Pangaea dataset.There might be a debate on whether Wikipedia articles can be easily connected to Pangaea datasets or not, but it certainly requires additional layers of work.It means that there must be an application of some intelligent services, so that Wikipedia article can be linked to Pangaea dataset.
It is believed that the research approach adopted in this study constitutes an appropriate and important way to reach international audience by the help of web service, particularly used by scientists in Middle East and North Africa region.One can input information of Wikipedia article and the web service outputs the relevant Pangaea dataset, if exist, together with some statistical measures on its performance.This service is considered to be helpful for anyone who wants to enhance Wikipedia article by editing it and adding data citation for relevant data sets.In addition, this service is helpful for scientists to collect Wikipedia articles related to their research interests and downloading the cited scientific data according to their research purposes.
The main objective of this research is to develop an automatic service-oriented data citation system, utilizing the model produced by data mining techniques.In detail, the objectives are as follows: 1) To develop a robust mechanism to connect scientificrelated articles with scientific dataset.2) To provide service-oriented framework for citing large-scale data.3) To give attribution of the contributors of scientific data and written web documents.

II. RELATED WORKS
Scientific data citation has been attracting many researchers not only in the field of computer science [1].There have been enormous efforts in dealing with big data in e-science [2].However, some challenges remain exist [3].One of the challenges is dealing with data too big to analyze, as highlighted in [4].To date, two technological breakthroughs are available to overcome this issue, grid computing [5] and MapReduce [6].Because MapReduce is the latest technology and proven to be robust in archiving scientific datasets, it is necessary to utilize MapReduce to store big data of scientific dataset, before analyze it.
An automatic data citation idea has been previously proposed as a poster in [7].Although it is still in the conceptual idea and no implementation yet, the authors in [7] give us a rough idea that automatic data citation is very important.In this research, it is necessary to implement the same idea to Wikipedia scientific-related articles by utilizing a robust data mining technique, i.e. association rule discovery, applied find the relationships between text attributes provided by Wikipedia articles and data attributes provided by scientific dataset.
Another similar effort of dealing with scientific workflow has been proposed in [8].A novel rule based approach model and infrastructure is considered as the best approach for handling a huge explosion in scientific data.This is due to the internal requirement in processing the streaming of scientific data, especially through social media and sensor.This approach is suitable for real time and emergency data, but not necessarily required for data citation.Data citation is mostly processed through metadata without mining the big scientific data.
The effort of utilizing Wikipedia articles to understand a better science has been done by Mietchen and others [9].The authors created a bot that is able to search automatically the multimedia files of medical domain and upload these files to Wikipedia media repository.There are approximately more than ten thousands files attached to hundreds Wikipedia articles by the bot.It works by exploiting XML tag values of the files.Neither semantic capability nor data mining technique has been exploited by the bot.
The first definition of data service is coined in [10] which aims at seeking the technological trends in simple, united and cloud-based data service.The authors defined data services as the offsprings inherited from stored procedure in relational database management system that any programmers and database administrators are able to code both SQL queries and programming control logic in one place in order to provide both query and function optimization capability for getting at the data, as normally alleviated in procedural programming languages compilation.This stored procedure is used as an analog for data service since only stored procedure are able to get an access to important and sensitive data to some cases.The authors also characterized data service as a service based on a proprietary model.The advantage of this definition is that there will be more enhanced perspective of the data as well as more data oriented.In addition, the architecture is extended to fit the characteristics of data citation service.
Perhaps the most similar work to this research has been proposed in [11], which provides application-oriented search.The work is exploiting the use of scientific metadata generated by scientific experiment.With the power of indexing, the new search is not only relying on keyword based matching, but also semantic web and its ontology to increase the accuracy of searching.In this research, it is argued that the use of data mining technique is more appropriate to analyze the big scientific data.This is due to the need of Wikipedia articles in relating the scientific data.Indexing scientific metadata is not the focus in this research, but the scope is to analyze the relationship between big scientific data and Wikipedia articles.

III. SCIENTIFIC DATASET
"As research on so many fronts is becoming increasingly dependent on computation, all science, it seems, is becoming computer science" announced by the New York Times in a 2001 famous article [1].Recently, many organizations have accumulated data from various sources like Web, network sensors and constructed large-scale data.Some organizations publish their data to public to facilitate activities of other organizations.For example, WDS (World Data System) 1 , science data in a wide range of domains has been registered.Another example is Pangaea2 , a system which provides a part of science data on WDS related to the earth and environment, contains 0.6 million datasets and 35 PB data in total.These 0.6 million datasets can be referenced by their DOIs for direct access and citation.
Pangaea is chosen in this research, because it is considered as the biggest repository for scientific data, focusing on earth and environmental science.This enables us to contribute to the local scientific society since earth and environmental data are usually specific to the geographical area.Another advantage is that Pangaea provides a comprehensive set of metadata to make us easier to identify and analyze datasets.For example, one of datasets residing at Pangaea has following title: "Contents of rare earth elements and some rock-forming chemical elements in bottom sediments from some deeps of the Red Sea." Fig. 1 depicts what type of information is available for a particular dataset.The attributes of the dataset are also present in the webpage3 from which snapshot is taken but they are not shown here in order to save space.
Wikipedia is a collaborative encyclopedia consisting of millions of articles.There are scientific-related articles that can be expanded by connecting them to relevant dataset residing at Pangaea.This makes the Wikipedia articles more informative and useful, such as connecting the dataset discussed above with relevant Wikipedia article.Since the dataset discusses about Red Sea Minerals, it requires a validation whether Wikipedia has such information in any of its articles or not.Further search reveals that there exists an article on red sea in Wikipedia in which a section is there that discusses about the minerals found in the depth of Red sea as illustrated in Fig. 2.
Another section of the article which discusses about the minerals in Wikipedia article is illustrated in Fig. 3. Hence, the relation between the Wikipedia article and scientific dataset can be in the form of one to one, one to many or many to many, that requires the flexibility of the system to cite on the fly.
After manually checking the two resources, it can be inferred that such dataset from Pangaea can be cited in the above Wikipedia article.However, there is an urgent requirement to have an automatic citation for millions of Wikipedia articles related to the big data of scientific dataset based on heterogeneous, loosely coupled and platform independent application by utilizing web service technology [12].

IV. METHODOLOGY
To achieve the research objective, the following steps are envisioned.
• Development of Corpus.
In this step, full Wikipedia articles are fetched as dumps into the database.Moreover the full metadata for every datasets from Pangaea is also downloaded.This corpus is expected to lead us to venue of Big Data.
• Preprocessing of Data.
Since the articles are in the form of text, lot of text processing techniques are applied to make it suitable  for data mining techniques.The Wikipedia articles require special preprocessing in order to produce output that is used as input for next step.
• Application of different Data Mining Techniques.
Since there is no training data, association rule mining algorithms as well as different similarity measurement techniques are applied to find the related articles.
• Web Service for World Wide Audience.
www.ijacsa.thesai.orgA web service service is dedicated to the users who are interested in finding the relevant scientific datasets for a particular Wikipedia article.It has the functionality of providing the Wikipedia article information.The output of service is the relevant and related Pangaea datasets.
To realize the expected results, the following mechanism is proposed.

1) A properly archived corpus of Wikipedia articles and
Pangaea metadata.
To obtain the corpus, an open source MapReduce based Hadoop archive system is employed.2) An ordered association and relationship information between Wikipedia scientific-related articles and Pangaea big datasets.
To obtain the information, a robust data mining of Association Rule techniques are employed.3) Scientific Data Citation Service for Wikipedia scientific articles, Pangaea big datasets, and the relationship information that supports platform independent, autonomous, dynamic, reusable, heterogeneous, selfcontained and loosely-coupled system.
To develop the service, XaaS (Everything as a Service) techniques, including DaaS (Data as a Service), TaaS (Text as a Service), CaaS (Citation as a Service), and SaaS (Software as a Service) are employed.
The mechanism of XaaS is implemented in service-oriented framework to ease the prototyping of the data citation system development as proposed in Fig. 4. The framework starts with query formalization and decomposition to ease the search for scientific dataset.The service discovery follows the step by providing more semantic capability during the searching process in multi-ontological environment [13].The design of service composition improves a simple pattern of searching into fully automatic composition.The service selection follows the process until it combines Wikipedia and Pangaea datasets.The searching space is eventually improved from simple search by utilizing the previous approach of service atomization [14].Finally, the simulation is conducted in several stages during service optimization of first algorithm.
With this research, it is expected that in long term, there will be an increase of productivity of the scientists in the Middle East and North Africa regions in term of re-do experiments, fast and easy access to Wikipedia scientific-related articles and the corresponding Pangaea scientific datasets.It is also expected there will be an improved management for information services in the Kingdom due to integral administration of data citation as an information asset.This can be done through the expected results of this project that provides useful relationship information between Wikipedia articles and scientific data, in order to be utilized by the scientists.
The data are analyzed to mine the relationship between two different objects of properly archived corpus of Wikipedia articles and Pangaea metadata.The scientists can utilize the corpus for their research.For example, Wikipedia scientific article corpus is used for a layman-friendly guidelines about the research topics and the Pangaea metadata corpus is used for re-do their experiments.
An ordered association and relationship information between Wikipedia scientific-related articles and Pangaea big datasets are provided to the scientist to choose the right Pangaea datasets for particular Wikipedia scientific-related articles.The association and relationship information is ordered based on the degree of the relatedness between Wikipedia scientificrelated articles and Pangaea big datasets, so that the scientists have a variety of options to choose the appropriate datasets for particular Wikipedia articles.Scientific Data Citation Service for Wikipedia scientific articles, Pangaea big datasets, and the relationship information web services are provided to make the information publicly available and seamlessly integrated to any legacy systems.The relationship service between Wikipedia scientific-related articles and Pangaea big datasets is utilized by scientists to cite the most related datasets for the articles.Archive service of Wikipedia scientific articles are provided to scientists to search and access the required articles based on the datasets owned or accesses by scientists.Archive service of Pangaea big datasets are provided to scientists to search and access the required datasets based on the Wikipedia scientific articles written or accesses by scientists.

V. DATA MINING AND ITS ARCHITECTURE
This research proposes to utilize data mining concept in this research, due to the nature of big data.Data mining is able to identify what data are passed between services.Since there are several services in this research and collaboration amongst the services in one composition, the identification of data is non-trivial.Data citation service also requires data mining to know what services are available as well as what results are generated for particular sets of input values.More specifically, data mining enables user to trace the process that led to the aggregation of services producing particular output.
Data mining processes are combined in the form of graph, which is implemented later as rule or workflow.The graph contains hybrid data and services.There is no focus on this data mining technique, since all available data and services are grouped together depends on the aim of the model.For example, the Fig. 5 shows data citation service utilizing three data mining services and seven datasets.Based on the objective of the workflow, the data mining technique determines that the there should be two groups of service S2 and S3, before composed with service S1.The group of service S2 are related with dataset D1 and D3, whereas the group of service S3 are correlated with data set D3, D2 and D2' (a complimentary of dataset D2).
Different objective has different composition.Fig. 6 illustrates that similar workflow may have different data mining tools.All services and datasets are in the same group, except for all datasets complimentary, D0' and D2'.
To implement the data mining tools in data citation service, a data citation service architecture is designed to accommodate the nature of scientific services and dataset as illustrated in Fig. 7.This data citation service architecture is based on the basic data service architecture proposed in [10].The architecture extension includes an enhancement of the traditional service where the operations of the service, i.e. inputs and outputs, are semantically unplugged to the clients; which has no data  model inside.In data citation service architecture, both data and metadata are requested and published through various methods, such as XML, JSON, or atom publishing protocols.This architecture is proposed on top of the previous experience in combining two different architectures, such as workflow based services and pipelining based components [15].Since this research is not only using the traditional datasets, the architecture aims at accommodating both relational data database and various data source, such as big data in Pangaea which has spatial and temporal information.Hence, the integration between the conventional database management system and others are enabled through this architecture.The everything as a service are considered as functional datasets in order to be able to combined with relational datasets.

VI. DISCUSSION
Data citation services are openly published to the earth and environment scientists in north Africa and middle east countries.The right configuration is required to access the service.The example of the configuration of preprocessing filters and user defined index is illustrated in Fig. 8.This con- figuration burdens the data center hardware due to the request of the dedicated resource allocation for each searching index.The structure includes HTML documents, structure analysis, parsing result, caller information, page ranking, evaluation of expression, system coordinate which includes latitude, longitude, and altitude.Fig. 9 represents the evaluation of the performance for data citation service.From total URL access, HTTP200 represents the most access by 62.7%, while updating the data citation holds 50.7%.It is interesting to note that there is a new articles related to scientific data written by the scientists through this citation enablement, although takes the portion of 4.69% only.In addition, the total access might not fully represent the total percentage since there is an execution of several services in one workflow.There is a bottleneck to access the service due to the limitation of the data center hardware capability.There is also an excess of the internal university proxy to restrict the access, which is the base of future work to include a dedicated proxy to improve the access.Full bandwidth is assigned to data citation service, however not all accesses utilized fully the bandwidth.Fig. 10 represents the overall bandwidth utilization mainly on the processing time.It is interesting to note that the access of the scientific data in north Africa and middle east countries has a peak in working hours.This pattern reoccurs in any days, although there is a less amount of peak access during weekends.It infers that there is still a scientific activity beyond normal working hours, although not as much as the one during working hours.

VII. CONCLUSION
This research developed a service-oriented framework and architecture for data citation service.Through this framework and architecture, a robust mechanism is available for scientists, especially in the field of earth and environment, to relate scientific articles with scientific datasets as a big data.This framework enables the scientists; especially in Middle East and North Africa region, to cite their articles with worldwide big scientific data.In the evaluation, it is shown that the framework is robust and reliable enough to handle both big data access and its citation.It remains the future work on how to increase the relevance of the data citation, especially to attract more scientists in this region by utilizing a social media in collaborating the citation.

Fig. 3 .
Fig. 3. Another part of Wikipedia article that discuss the scientific dataset in detail.

Fig. 9 .
Fig. 9. Performance evaluation based on the number of access.