Geo-visual Approach for Spatial Scan Statistics: an Analysis of Dengue Fever Outbreaks in Delhi

— There are very few surveillance systems being used to detect disease outbreaks at present. In disease surveillance system, data related to cases and various risk factors are collected and then the collected data is transformed into meaningful information for effective disease control using statistical analysis tools. Disease outbreaks can be detected but for effective disease control, a visualization approach is required. Without appropriate visualization, it is very difficult to interpret the results of analysis. In this work, a method has been developed for geographical representation of the disease surveillance and response system for early detection of disease outbreaks using SaTScan and open source Geographic Information System software. Maps that combine the geographical location of diseases and clusters to enhance the understanding of results of statistical analysis tool are developed using QGIS library which provides many spatial algorithms and native GIS functions. This library is accessed through PyQGIS and PyQt using Python. I. INTRODUCTION Disease surveillance is a continuous process of collecting information, as well as organizing, analyzing and interpreting the information collected so that the disease outbreaks can be determined and effective actions in disease control may be taken. Outbreaks for diseases can be discovered by monitoring space-time trends of disease occurrences which can highlight changing patterns in risk and help to identify new risk factors. However surveillance datasets are mostly large in size.


I. INTRODUCTION
Disease surveillance is a continuous process of collecting information, as well as organizing, analyzing and interpreting the information collected so that the disease outbreaks can be determined and effective actions in disease control may be taken. Outbreaks for diseases can be discovered by monitoring space-time trends of disease occurrences which can highlight changing patterns in risk and help to identify new risk factors. However surveillance datasets are mostly large in size. Therefore, the availability and performance of software capable of analyzing space-time disease surveillance data on a continuous basis is essential for practical surveillance. At various places disease database for various diseases like cancer, dengue fever, malaria and many more are maintained. Number of new cases is typically added to a disease database daily, monthly or yearly, with the duration depending on the type of disease and the limit of database system.
A traditional disease surveillance mechanism involves reporting of diseases confirmed by labs to local or national organizations of health. This does not normally allow for early detection of new outbreaks. In new surveillance systems, it is easier to find out where the outbreaks will occur and what will be the geographical sizes of these outbreaks. In these systems, clustering is done which tells about number of cases of disease within study area.
Visualization of spatial distribution of the disease over a defined area helps user to analyze clusters easily and efficiently; and users can also detect unusual patterns of disease outbreaks. To study geographical variations of disease risk, the locations of cases are mostly proxies for residential addresses such as pin codes. If individual addresses are available, it is easy to plot the locations of cases on the map. The most common type of map for visualization is Choropleth maps for spatial distribution of disease in the well-defined geographical area using health indicators as occurrence, incidence rate and mortality rate. Choropleth maps usually use color or pattern combinations to show different levels of disease risk associated with each geographical area. These geographical areas are small areas usually defined for administrative purposes, such as counties, zones, wards, colonies, villages, towns and cities. For visualization, various geographical information systems can be used like QGIS which is a cross-platform free and open source desktop geographic information systems application.
In this paper, a method is developed to visualize the results obtained by SaTScan. The method provides an easier way to run SaTScan multiple times and add graphical output for analyzing results obtained by the developed application. This standalone package takes datasets or files of population, cases information, geographical coordinates of each location and optional controls as an input in a simple prescribed format and generates text files in SaTScan format and allows the user to choose SaTScan analysis options. It reads the results from SaTScan and creates geographical outputs, based on a separate map boundary file. The front end was developed using PyQt4, Qt4, PyQGIS and QGIS libraries, with Python as the interfacing programming language. Effectiveness of the user interface is demonstrated by a case study of dengue fever in Delhi, India from 2010-2012. This application is particularly useful to health Officials who do not have knowledge of GIS software in disease surveillance and disease control.

A. Related work
Disease clustering can be classified as temporal clustering, spatial clustering or space time clustering. Temporal clustering observes whether cases are located close to each other in time, spatial clustering observes whether cases are located close to each other in space and space-time clustering observes whether cases are close in space as well as in time. For detection of disease cluster, various statistical methods have 128 | P a g e www.ijacsa.thesai.org also been developed. GAM [19] is a cluster detection approach which performs examination of a large number of overlapping circles at a variety of scales and assesses the statistical probability of the number of events occurring by chance. The drawbacks of this method are that it has a multiple testing problems and is heavily computer intensive. FleXScan [23] is free software which was developed to analyze spatial count data using the flexible spatial scan statistic and circular spatial scan statistic. Current version of FleXScan is still restricted specifically to spatial analyses, ignoring the temporal component. Another software Splancs was developed by Rowlingson and Diggle [21] for spatial and space time point pattern analysis. Kulldorff [13] together with Information Management Services Inc. developed SaTScan software that can perform geographical surveillance of a disease, detect clusters and test whether these clusters are statistically significant or not. It can also perform time-periodic disease surveillance for early detection of disease outbreaks. Unlike techniques such as Openshaw's GAM, SaTScan does not take into account the problem of multiple testing and reports the significance of each detected cluster. In the analysis of cluster detection test, there can be issues concerning data, the scale of analysis, correction for covariates and the underlying background population. Covariate in data is the most important problem affecting cluster study. Any spatial or temporal variation of covariates like gender, age, ethnicity, diet, smoking behavior or population density can worsen the real disease patterns. For example, people of similar ethnic origin traditionally tend to live close together, although in present world this is decreasing due to increased population migration. Some diseases can be inherited and if spatial clusters of genetic diseases are need to be observed then examining clusters for such diseases requires evidence for clustering over background population after adjustments made for the genetic covariates. Methods like SaTScan can do such adjustments in covariates. Correction of spatial variations in the population at risk is an important part of spatial epidemiological research because any observed pattern of health events needs to be adjusted by the background population distribution.
There are two issues related to SaTScan software itself. Those are: (1) Lack of cartographic support for interpreting the detected geographical clusters. (2) Outcomes being sensitive to parameter choices related to cluster scaling.
The software does not directly provide any visualization support. Chen et al. [9] suggested that the Geovisual analytics method make it efficient for users to understand the SaTScan results. Authors have illustrated Geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. between 2000 and 2004. For all the counties Standardized Mortality Ratio and reliability scores are visualized to identify stable and homogeneous clusters. The proposed Geovisual analytics approach is implemented in Java-based Visual Inquiry Toolkit

B. Methodology
. In the Field of public health, spatial clustering analysis and subsequent geoprocessing of clustering results is the most efficient yet technically comprehendible way. Kulldorff [13] described a statistical method for the detection of multidimensional point process using spatial scan statistics. It uses variable window size and a baseline process as an inhomogeneous Poisson process or Bernoulli Process. Scanning window can be any predefined shape and is modelled on a geographical space. Monte Carlo sampling is done in which a regular or irregular grid of centroids covering the whole study region is created and then an infinite number of circles around each centroid are created. Actual and expected number of cases inside and outside the circle is obtained and Likelihood Function is calculated. By using Monte Carlo Simulation random replicas of the data set are generated under the null-hypothesis of no cluster. Likelihood function value is ranked with the maximum likelihood ratio from the Monte Carlo replications. These ranks are called pvalues. For a cluster to be statistically significant its p-value should be less than one. The cluster with the smallest p-value is the most likely cluster that has occurred not by chance. SaTScan software [8], offers many advantages. It is robust, computationally efficient, has flexibility of options, corrects for multiple comparisons, adjusts for heterogeneous population densities among the different areas in the study, detects and identifies the location of the clusters without prior specification of their suspected location or size thereby overcoming pre-selection bias, and allows for adjustment for covariates. However, one of the drawbacks of SaTScan is that it does not have a visualization system for presenting the results. In this regard GIS has sophisticated mechanisms to visualize data. In addition to being able to assess disease cases with a general categorical definition of "place", GIS systems provide the means to analyse spatial-temporal relationships between sets of variables, allowed users to identify spatial patterns in data, and provided the means to integrate databases on the basis of geography. But it is a time-consuming process and one has to learn how to work with GIS packages. The relationships between software components of the developed standalone application developed in python; termed "Visual Interpretation of Statistical Analysis" (VISA) is shown in Fig.  1. SaTScan is embedded in this application for statistical analysis. The portion in the dotted line can be added to this application for the web enablement. This application is a general one and is applicable to other diseases also.
The standalone package provides maps that combine the geographical location of diseases and clusters to enhance the understanding of SaTScan results. A user does not require knowledge of GIS packages. GIS support is provided by QGIS library which provides many spatial algorithms and native GIS functions.
This library is accessed through PyQGIS by Python bindings which provides simpler programming environment. For developing GUI, PyQt4, Qt4-devel, Qt4-doc and Qt4-libs have been used. PyQt is a Python binding of the crossplatform GUI toolkit Qt. Libraries of QGIS (Quantum GIS) is a collection of C++ classes which can be used for accessing and manipulating spatial objects. The libraries used are Core containing GIS functions, GUI containing controls and User interface such as canvas map which displays and manipulates the maps. ftools, a python plug-in of QGIS is used for spatial analysis i.e. for drawing clusters of specified radius. 129 | P a g e www.ijacsa.thesai.org The user interface is developed using Qt Designer and shapefile viewer is developed in python using libraries of QGIS. The first screen generated by the application is shown in Fig. 2. Data regarding number of years of study is read from a text file yeardata.txt which is modified according to the study period. Fig. 2 shows the first screen of the developed application which is generated from yeardata.txt. It shows the buttons for running SaTScan and for generation of disease cases maps and disease density maps for study period.   130 | P a g e www.ijacsa.thesai.org   132 | P a g e www.ijacsa.thesai.org are repeated for cases occurring in the years 2011 and 2012 as shown in Fig. 8 and Fig. 9 respectively.

1) When Cluster button is clicked, the application reads the SaTScan result file dengue_cases.col.txt and forms a file cluster.csv which contains all clusters having p-value less than a specified value. This csv file is read and a layer is created.
Using ftools plug-in, buffers of a specified radius are created. The created layers are displayed which are shown in Fig. 10.
2) On clicking close, the application is terminated.
III. RESULTS AND DISCUSSION In this section an analysis of dengue fever outbreaks in Delhi for past three years is presented as a sample case. In India 28,000 dengue cases were reported in 2010 and there is a significant rise in the incidence of dengue from 18,860 cases in the year 2011 to 49,606 cases in the year 2012 and 1,700 dengue cases were reported from Delhi due to the disease last year . If disease surveillance was done nation-wide for early detection of disease outbreaks, effective actions would have been taken and outbreaks would have been controlled. To perform disease surveillance statistical methods to detect disease clusters are required. It is equally important to have effective visualization approach. The presented work covers both the aspects of disease surveillance along with the surveillance results.

A. Data collection
To perform any type of analysis the most importance requirement is data. Data for analysis was collected from DHO Civil Lines Zone, Municipal Corporation of Delhi, Health Department. The data covering details on dengue cases in Delhi are available for the years 2010, 2011 and 2012 only. The details were taken from the Epidemiological Investigation form for the above mentioned years.

B. Data Pre-processing
The data collected is transformed in the format that is required to perform statistical analysis. The data is transformed into a comma separated file which contains information about the observed cases location name, case count, date of reported case, age of the case and gender. In this file age and gender are the covariates. Another comma separated value file is created for the coordinate information about the geographical location where case has occurred. In this file information about location name, latitude and longitude of the location of the case is stored.

C. Map Digitization
Georeferencing process is used to assign real-world coordinates to each pixel of the raster using QGIS. In the presented work, scanned map of Delhi is digitized by obtaining coordinates from the markings on the map image itself. Using these GCPs (Ground Control Points), the image is warped and it is made to fit within the chosen coordinate system [15].

D. Statistical Analysis
To detect clusters of disease outbreaks, statistical analysis is performed using SaTScan software which is embedded in the stand alone VISA application. To begin analysis, in the GUI of the presented application, click on "Run SaTScan" button. On clicking the button, SaTScan window opens. In the input tab, import the created comma separated value file which contains the details about observed cases under case file option. This file is saved as dengue_cases.cas. The extension supported by SaTScan for case file is .cas. Under coordinate file option import the other comma separated file which contains latitude and longitude information about each location of the observed case. This file is saved as dengue_cases.geo. The extension supported by SaTScan for coordinate file is .geo. The study period is from 2010/1/1 to 2012/12/31 with time precision to be year and coordinates latitude/longitude. In the analysis, space-time analysis and space time permutation model are selected. Space time permutation model is used in the analysis because dengue fever has a relation with environmental variables as many cases are observed in months with warm and humid weather. Hence with geographical location time is also an important parameter to perform analysis. Adjustment for the maximum spatial cluster size is set at 2 kilometers. In the output tab, result file dengue cases.txt is saved and SaTScan software is executed.

E. Visualization results generated by VISA package
Delhi region is divided into 12 zones which are Narela, Rohini, Civil Lines, Shahdara North, Shahdara South, New Delhi City, Karol Bagh, Nazafgarh, central, west, Sadar Paharganj and south. On clicking Delhi Map button, the zone boundary map is generated as shown in Fig. 3. On Clicking Disease Cases Map 2010 button, dengue fever cases spread in year 2010 map are generated as shown in Fig. 4. High number of cases occurred in Sant Nagar (64), Jahangirpuri (47),Timarpuri (24) and Burari Village (22). On Clicking Disease Cases Map 2011 button, dengue fever cases reported year 2011 map are generated as shown in Fig. 5. Majority of cases occurred in Jahangirpuri (12), SantNagar (5), and Malkaganj (4). On Clicking Disease Cases Map2012 button, dengue fever spread in year 2012 map is generated as shown in Fig. 6. High number of cases occurred in Rajnagar-II (34), Mandawali (26) and SangamVihar (23). There are 8 locations in Delhi where cases were reported in 2010, 2011 as well as 2012 as TABLE 1. On clicking "No. of cases Map" button, Choropleth map is generated to show density of dengue cases with graduated colors. On the basis of zone, most cases are reported from Civil Lines Zone, Shahdara North Zone and Shahdara South Zone as shown in Fig. 6. On clicking Clusters button, clusters and buffer map are generated as shown in Fig.  7. There are seven detected clusters out of which one is most likely cluster and six are secondary clusters as shown in TABLE 2. 133 | P a g e www.ijacsa.thesai.org 134 | P a g e www.ijacsa.thesai.org  Vol. 4, No. 10, 2013 135 | P a g e www.ijacsa.thesai.org

F. Interpretation of detected clusters
Total number of locations where cases were reported is 617 and total number of cases reported is 1976. The shape file of the clusters generated and saved by the package is easily imported as a KML file in Google Earth so that interpretations can be made easily. The following are the details of detected clusters:  Bhalswa region has Bhalswa Lake and there is a high probability that the water present in the lake is stagnant.
 Massive construction sites with some unregulated construction sites.
 Demographic in this cluster is lower middle class and migrant population.
 During rainfall Yamuna bank is flooded and there is no place to drain water from these places. In months of July and August breeding of mosquitoes takes place and in months of September and October they become adults and thus large number of cases are observed in these two months as shown in Fig. 11.
On the basis of gender, infected male population was very high in 2010 and 2011 in comparison to female population but in 2012 there was a high increase in the number of infected females as shown in Fig. 12.
On the basis of age group in the year 2010, most cases occurred in 16-20 years of age; in the year 2011 most of the cases occurred in the 11-15 years of age and in the year 2012 most of cases occurred in the 11-15 years and 20-25 years of age groups as shown in Fig. 13.

IV. CONCLUSION
The Geographical visualization approach developed in this paper facilitates the space-time cluster detection methods by providing an efficient representation of the results of statistical analysis in geographical space. Space-time analysis is performed by using space-time permutation model of SaTScan. The disease clusters are detected with the cluster radius as 2 kilometers. With the proposed visualization method maps are generated. These maps show spreading of cases, density of cases within each district or county and statistically significant clusters. With the help of presented work proactive actions can be taken to prevent disease outbreaks. It is also helpful in identifying the hot zone of an epidemic. Therefore on the basis of information gathered from statistical analysis and visualization the overall quality of health of the nation can be improved. This application is specifically useful to health Officials who do not have knowledge of GIS software. The standalone package is developed by using python, PyQGIS, QGIS, PyQT4, Qt Designer and SaTScan. Use of the proposed method is demonstrated to analyze the results of statistical analysis of dengue fever. The team of doctors in Municipal Corporation of Delhi found these results very informative; efficient and accurate and the interpretations related to the clusters were very similar to their interpretations.

FUTURE WORK
The present work will be extended to include a statistical interface for space-time analysis instead of integrating the developed application with SaTScan. This application will be www.ijacsa.thesai.org made web-based and database containing diseases information will be integrated with it. With the facility of a database user can save data whenever required and can perform analysis at chosen time intervals such as on weekly basis or based on months or years from remote location.