Data Analysis of Coronavirus CoVID-19: Study of Spread and Vaccination in European Countries

Humanity has gone since a long time through several pandemics, such as: H1N1 in 2009 and also Spanish flu in 1917. In December 2019, the health authorities of China detected unexplained cases of pneumonia. The WHO World Health Organization has declared the apparition of CoVID-19 (novel Coronavirus) that caused a global pandemic in 2020. In data analysis, multiple approaches and diverse techniques were used to extract useful information from multiple heterogeneous sources and to discover knowledge and new information for decision-making; it is used in different business and science domains. In this context, we propose to use the multidimensional analysis techniques based on two concepts: fact (subject of analysis) and dimensions (axes of analyses). This technique allows decision makers to observe data from various heterogeneous sources and analyze them according several viewpoints or perspectives. More precisely, we propose a multidimensional model for analyzing the Coronavirus CoVID19 data (spread and vaccination in European countries). This model is based on constellation schema that contains several facts surrounded by common dimensions. Keywords—Multidimensional model; constellation schema; coronavirus covid-19; vaccination; European countries


I. INTRODUCTION
Since December 2019, the new cases of pneumonia were detected in Wuhan City (Hubei Province of China). This novel virus caused the new infectious respiratory disease, called Covid-19 by the World Health Organization (WHO) [19] (Pandemic in 2020 with millions of deaths around the world). The fight against this global pandemic is causing cancellations of sporting and cultural events, the implementation of containment measures and the closure of the borders by many countries, etc. It also has effects in terms of social and economic instability [14].
In order to slow the contagion of this new virus, several studies were proposed in the literature [12][15] [16] [17], especially about the spread of the Coronavirus [13]; statistics are announced every day by the countries and databases have been established to store this data. In this paper, we propose to use the Multidimensional Analysis techniques in order to analyze the spread of Coronavirus Covid-19 and the evolution of vaccination in European Countries. This technique allows decision makers to observe data from various sources and analyze them according to several viewpoints. A multidimensional model is composed into two concepts: Dimension and Fact. Dimensions contain a set of unique values in order to categorize a particular theme (Countries, Dates, etc.). Fact is a subject of analysis and it is described by a set of measures.
This paper presents a new approach based on the use of multidimensional techniques on Coronavirus Covid-19 data and the user-defined constraints based on colors in order to highlight relevant information.
This paper is organized as follows. Section 2 presents the literature review for spreading of Coronavirus Covid-19 (Works about data analysis). Then, we present the phase of data preparation (Extraction, Cleaning, Transformation and Loading of Data). In Section 4, we propose a data warehouse schema for storing the prepared data. The next section describes the multidimensional model we propose for analyzing the spread and the vaccination of Coronavirus Covid-19 data. Finally, we present the phase of implementation for European countries and then Conclusion.

II. LITERATURE REVIEW
Since the appearance of the Coronavirus Covid-19, several studies have focused on the spread of the virus (Medical [13] or Data Analysis aspects [18]).
The objective of [1] is to examine the correlation between pollution and climate data and the Covid-19 pandemic. They propose a data warehouse and data cubes built on data from the regions of Lombardy and Puglia (Italia). Their results show that the Covid-19 pandemic is spreading significantly in regions characterized by the absence of rain and wind.
In [2], the authors study the relationship between new cases of Coronavirus Covid-19 and the Multidimensional Poverty Index (MPI) in the city of Manizales (Colombia). The results of the exploration indicate that in the communes of greater poverty the density of cases per Covid-19 is greater; the relation exists between these two parameters.
Internet of Things (IoT) is an interconnection of Internet and physical devices. These devices are record, monitor and respond. The use of IoT with smart sensors to measure and record the body temperature of individuals can help to identify the infected and to maintain social distance. The authors of [3] propose an IoT architecture in order to minimize the spreading of Covid-19.
In [4], the authors study the evolution of cases and deaths of Covid-19 compared to the population of Brazilian cities. The results show that in the short term small towns are proportionately more affected by Covid-19 during the initial 732 | P a g e www.ijacsa.thesai.org spread of the disease. In the long term, large cities begin to have a higher incidence of cases and deaths.
The authors of [5] propose an interactive visualization using the concept of Tableau [6] for analyzing data of Covid-19. A Tableau is used to show the personalized and the most important data (dashboards and worksheets). They consider that data analysis can be very fast with Tableau and Visualizations (several visualizations in a single view).
The author of [7] presents a data analysis of Covid-19 in cities of China, by using datasets. He uses a correlation matrix for the phase of data preparation (to summarize data). He uses Python libraries Matplotlib and Seaborn for visualizing data.
In this paper, we propose a multidimensional model based on Constellation Model in order to study the spread of Coronavirus Covid-19 in European countries and the evolution of vaccination, according to several dimensions. The first stage concerns the data preparation (presented in next section).
III. DATA PREPARATION Data preparation is the process of several steps (gathering, combining and structuring data) in order to analyze them in business intelligence and data visualization applications. Fig.1 presents the process we propose for data preparation: Data extraction, Data cleaning, Data transformation and Data loading.

A. First Step: Data Collection
In this paper, the data used was extracted from [20]. The period of analysis is between 01/01/2021 and 30/09/2021. We mainly use the following files: The first file concerns data on testing for Covid-19 by week and country and contains the following data: Country name and code, Week of year, Level (national or sub-national), Region code and name, Number of new confirmed cases, Population, Testing rate per 100000 population, Positivity rate and Source.
The second file concerns data on Covid-19 vaccination and contains the following data: Week of year, Country code, Population denominators for target groups, Number of doses received, Number of first dose vaccine, Number of individuals refusing the first vaccine dose, Number of second dose vaccine, Number of doses where the type of dose was not specified, Region, Target group, Name of vaccine and Population.

B. Second Step: Data Cleaning
In this step, we removed unnecessary data: • Source and Level from the first file.
• Population denominators for target groups and Target group from the second file.
We also add the following data in order to perform analyzes at several levels of granularity: • Month, Trimester and Year for the week of year.
• Zone for countries: we distinguish four zones: Eastern Europe, Western Europe, Northern Europe and Southern Europe (cf. Table I).
• Continent: In this study, we focus on Europe.

C. Third Step: Data Transformation
In this phase, we merged the following data from the two files: Country code, Year of week, Region and Population.
The result after cleaning and merging data is a new file that contains: • Week of year, Month, Trimester and Year.
• Country Name and Code, Population, Region code and Name.
• Number of new confirmed cases, Testing rate per 100000 population and Positivity rate.

D. Four Step: Data Loading
After data is retrieved, extracted and transformed, it is then loaded into a storage system (a data warehouse); it involves sorting, checking integrity, and building indices and partition.
After the initial load, the data warehouse needs to be updated by the incremental changes in the data sources.

Excel Files
Step 1 Data Extraction Step 2 Data Cleaning Step 3 Data Transformation Step 4 Data Loading Data Warehouse 733 | P a g e www.ijacsa.thesai.org IV. DATA WAREHOUSING Data storage is keeping data in a secure location that the user can easily access. An operational database handles frequent daily changes due to the transactions that take place by the company. However, a data warehouse provides consolidated data in multidimensional form. [8].
A data warehouse is constructed by heterogeneous data from multiple sources in order to support analytical reporting and decision making [11]. Indeed, it focuses on modeling and analysis of data to help decision-makers. The data in the warehouse must be subject oriented, integrated and nonvolatile.
The data warehouse possesses consolidated historical data in order to organize, use and analyze this data to take strategic decisions. The main objective of data warehouses is to transform heterogeneous data into a form suitable for analysis.
In this step, we propose a schema of data warehouse (cf.

V. DATA ANALYSIS: MULTIDIMENSIONAL SCHEMA
A data warehouse provides Online Analytical Processing (OLAP) tools that present an interactive analysis of data in a multidimensional view. The results of these OLAP tools are generally Data Cubes, defined by dimensions (described by attributes and hierarchies) and facts (described by measures) [9].
• A dimension is a structure that describes a subject in order to help decision-makers answer business questions (Example: product, store, and date).
• An attribute describes a summary level or characteristic of a dimension (Example: Year).
• A hierarchy classifies a dimension into several levels of granularity (Example: Date can be decomposed into Date→Month→Year).
• A fact presents a subject that models a set of events (Examples: sales, purchases); it has dynamic properties (numeric attributes).
• A measure is a numerical property of quantitative aspect that is relevant to analysis (Example: quantity, number_of_customers).
Schema represents a logical description of a database, data warehouse, XML document [10], etc. If a database generally uses relational model, a data warehouse can use Star, Snowflake or Constellation schemas.
• Star Schema: Each dimension is represented by only one-dimension table and the fact table at the center that contains the keys of all dimensions.
• Snowflake Schema: Some dimension tables are normalized.
• Fact Constellation Schema: It contains multiple fact tables connected by common dimensions. Table II presents the components of multidimensional schema we propose.

Constellation C C= (F ; D i )
F is a set of facts. D i is a set of dimensions.

F=(NameFct; M i )
NameFct is the fact name of F. Mi is a list of measures.

Dimension D i Di=(NameDim i ; Att j ; Hierar k )
NameDim i is the dimension name. Att j is the list of attributes. Hierar k is the list of hierarchies. Fig. 3 presents the proposed multidimensional model that contains two facts (Testing and Vaccine) surrounded by two dimensions (D_Date, D_Region). D_Date is decomposed into the hierarchy H1 (Week→ Month → Trimester → Year). D_Region is decomposed into the hierarchy H2 (Region → Country → Zone → Continent).  We note that the positivity rate is very high for the zone of Eastern Europe during the first two trimesters of 2021. In order to analyze more this multidimensional table, we propose to apply the Drill-down Operator that fragments data into smaller parts. It can be done by descending from a level to another in the hierarchy. Example: For Dimension 1, we can visualize data by passing from Zone to Country (cf.   We note that the number of vaccines is very low for the first trimester. To improve the visibility of this observation, we propose the following multidimensional query. Table VIII presents the result of this query.

VII. CONCLUSION
Multidimensional analysis techniques have been used to visualize data from several perspectives, in order to help decision-makers exploring data according to several granularities and so make appropriate decisions.
In this paper, we propose a data warehouse for storing data about spreading of Coronavirus Covid-19 and vaccination in European countries. We present a multidimensional model based on constellation schema in order to deduce new knowledge. The user can add constraints or criteria on multidimensional tables based on colors in order to highlight the most important values. This paper presents a new approach based on the use of multidimensional techniques on Coronavirus Covid-19 data and the user-defined constraints based on colors in order to highlight relevant information.
For future work, we plan to study the impact of vaccination on the spread of the Coronavirus Covid-19 by integrating statistical tools into multidimensional tables.