Mining Scientific Data from Pub-Med Database

The continuous, rapidly growing volume of scientific literature and increasing diversification of inter-disciplinary fields of science and their answers to unsolved problems in medical and allied fields of science present a major problem to scientists and librarians. It should be recalled in this aspect that today as many as 4800 scientific journals exist in the internet of which some are online only. The list of journals located in subject citation indexes in Thomson Reuters can be obtained from the website. From researchers’ point of view, the problem is amplified when we consider today’s competition where we may not be able to spend time on experimental work merely because of already published information. Therefore, considering these facts partly and the volume of serials on the other, a study has been initiated in evaluating the scientific literature published in various journal sources. The scope of the study does not permit inclusion of all periodicals in the extensive fields of biology and hence a text mining routine was employed to extract data based on keywords such as bioinformatics, algorithms, genomics and proteomics. The wide availability of genome sequence data has created abundant opportunities, most notably in the realm of functional genomics and proteomics. This quiet revolution in biological sciences has been enabled by our ability to collect, manage, analyze, and integrate large quantities of data.


INTRODUCTION
Scientific discovery in genomics and related biomedical disciplines increased the amount of data and information [3] whereas text mining provide useful tools to assist in the curation process [4] in extracting relevant information using automatic techniques, text-mining and information-extraction approaches [5].Text literature is playing an increasingly important role in biomedical discovery.
Most text mining applications require the ability to identify and classify words, or multi-word terms, that authors use in an article.Several strategies have been tried to recognize biological entity names in articles.Some methods rely on protein and gene databases to assemble dictionaries of protein names.Most of these methods were developed for abstracts, because abstracts are readily available for millions of articles (e.g., PubMed) [6].To support data interpretation, bioinformatics tools were utilized to identify relevant information from literature databases.On the other hand, success has been achieved in developing biomedical literature mining software using semantic analyses to automatically extract information [7].This method uses a pattern discovery algorithm to identify relevant keywords in abstracts.
In this paper, we present segregated information of journals that contain or publish data on bioinformatics, proteomics and genomics.Keyword searches in PubMed database with a list of countries and their involvement in research publications have also been presented.Most of the articles in bioinformatics journals are often technology centred, focusing on rapidly evolved techniques for analysis of sequences, structures and phylogenies [8].Some articles emphasized on data integration and analysis with data-driven data management for integrative bioinformatics systems [9] For the purposes of investigation, the evaluation was confined to the scientific journals hosted in PubMed only [1].It is obvious that in compiling the information on the volume of data published in journals and that even the most careful check could not exclude the possibility of errors; however it is understood that the influence of such errors is minimal considering the huge volume of information in PubMed database.

II. MATERIALS AND METHODS
NCBI PubMed literature database was selected for the study.Initially a generalized search without any limits was employed to retrieve articles related to bioinformatics and computational biology.As search results indicated the presence of keyword anywhere in the article (title, abstract, address, keywords and text), a more stringent search criterion was employed to identify the number of articles appeared when a search performed either by individual or in combinations of keywords by limiting the search within Title and Abstract.Title and abstract only search were considered in this study because the Title field in some articles refers to the most important keywords relative to the subject.Therefore, a validated disparity in information retrieved through text mining limited to Titles and Abstract terms only.
Articles belonging to bioinformatics, computational biology are explicitly reported in journals, some may have the term in Title/Abstract while some are representative of the field without keywords.Therefore, though a myriad of pertinent articles are located; preference is given to the two search techniques: Title and Abstract.Title/Abstract is selected as limit to search the database in order to overcome false hits and to identify true positives.Therefore, an article is considered true positive only if the keyword is explicitly identified in Title/Abstract.
Records without abstracts are counted as true positives only if title contained the keywords [10].Finally, year wise growth in number of articles in each field was carried to find out the enormous amount of data deposited in PubMed.

III. RESULTS AND DISCUSSION
A generalized search in NCBI PubMed literature database, on March 28 th 2012, using bioinformatics as keyword resulted in 97618 articles, of which 42.9 % are free full text and 15.9 % constitute review only articles.On the other hand, a search for computational biology articles in PubMed resulted in 79965 articles, of which 39.7 % and 17.9 % constitute free full text and review articles (see Table I).However, a more stringent search with Title/Abstract as key words revealed 11728 bioinformatics articles (11.9% of wild search as given in Table-1) and computational biology 2608 articles (3.6%) (See Table II).Boolean operator search enabled in PubMed database was used to extract combined keywords (See Table III).This shows the impact of these two ever-growing areas in sharing information and influencing the research publications.From the report, it can be emphasized that text mining is a useful alternative considering the enormous amount of data present in literature database such as PubMed.From an informatics perspective, integrated literature database like PubMed provides new insights for research in areas such as bioinformatics and computational biology.Though many research and review papers aimed at these two fields and as keywords are limited to Title/Abstract only, data suggests the phenomenal rise in number of papers in their respective fields.Therefore, from the work reported here, it can be suggested that scientific literature and approaches towards text mining have greater impact on data integration that support research for potential gains in life sciences and enable to understand the literature database applications.
Fig. I PubMed database search with bioinformatics and computational biology as keywords.Fig 1. illustrates the experimental results when the PubMed database was searched with bioinformatics and computational biology as keywords.The numbers over each bar represent the total number of articles from each field.

Fig. 2
Fig. 2 PubMed database search with Title/Abstract limit for the two fields.

Fig 2 .
Fig 2. illustrates the experimental results when the PubMed database search with Title/Abstract limit for the two fields.The numbers over each bar represent the total number of articles from each field.

Fig. 3
Fig. 3 Annual data on bioinformatics articles published in PubMed IV.CONCLUSION

TABLE II :
DISTRIBUTION OF MAXIMUM NUMBER OF ARTICLES IN PUBMED WITH TITLE/ABSTRACT AS LIMIT

TABLE III :
NUMBER OF ARTICLES RETRIEVED IN A BOOLEAN SEARCH FROM PUBMED DATABASE

TABLE IV ,
Fig. 3 illustrate the annual data of the articles published on Bio-Informatics in the PubMed database.

TABLE IV :
THE ANNUAL DATA OF THE ARTICLES PUBLISHED ON BIO-INFORMATICS IN THE PUBMED DATABASE.