A Graph-oriented Framework for Online Analytical Processing

—OLAP (Online Analytical Processing) is a tried-and-tested technology and a core concept in Business Intelligence. With data flowing from different and countless sources, exploring data in order to deliver actionable insights has become a daunting task with current OLAP tools despite the cycle of improvement that has gone through it. In the last decade, with the emergence of the big data phenomenon, NoSQL databases are seeing a spike in popularity and become more used in industry and academia as their value in handling a huge and varied amount of data become increasingly evident. Graph oriented database is one of the four chief types of NoSQL oriented databases that represent a promising technology candidate for big data analytics. In this paper we bring forward our contribution to graph-oriented analytical processing, which is twofold. First, we provide a novel approach for modeling a graph-oriented data warehouse. Second, we propose a data cube materialization through the precomputation of aggregated nodes. We present how typical OLAP queries can be performed against data warehouses stored in NoSQL graph-oriented database management systems. An implementation is conducted on a fictional data warehouse using Neo4j and the Cypher declarative language. The same dataset is stored in a relational data warehouse in order to compare storage space and query performance. Thus, the obtained results shows that graph OLAP implementation outperform clearly the relational alternative in term of query response time.


I. INTRODUCTION
OLAP stands for (Online Analytical Processing) and describe a software technology dedicated to decision-making purpose. It is designed to locate meaningful intersections between multiple axes of analysis. The dimensional modelling is an integral part of OLAP systems and defines at the conceptual level the fact concept which holds measurements or metrics regarding a business process event, and the dimension concept which provides a context describing the fact. Data conversion from an OLTP (Online Transaction Processing) database of two-dimensional to the multidimensional model is done by an ETL (Extract, Transform, Load) tool. OLAP servers have historically been implemented mainly using four approaches: Relational-OLAP(ROLAP), Multidimensional-OLAP(MOLAP), Hybrid-OLAP(HOLAP) and Desktop-OLAP(DOLAP ) [1], [2]. Each implementation has its strengths and its limitation and must be evaluated based on the business requirements.
With the IT revolution, and being aware of the potential of information, organizations around the globe has moved from the archaic age, which relies on industrial economy into a new era characterized by data driven economy. This race after technology in order to gain competitive advantages has contributed to the generation of large volumes of data. As a consequence, data analytics are becoming a huge challenge for traditional OLAP systems due its vertical scalability and its low computation ability. Indeed, earlier-generation of OLAP implementations are of poor storage and computational capacities, because they are built upon on old architectures and cannot match the requirement of big data analytics, especially data storage and data retrieval requirements. Another common problem is OLAP cube building over big data which could reach a critical complexity due to the increasing number of dimensions and the unstructured nature which characterize big data sets [3], [4].
To overcome the challenges of scale and complexity associated with today"s data, OLAP researches moved in a new direction. Namely, the use of NoSQL databases in OLAP solutions which is considered as a promising alternative for traditional data storage tools [5]- [8], [9]. This revolutionary technology offers several interesting features that cannot be achieved with classical database management systems like cluster computing and the ability to process both semistructured and unstructured data. In this paper, we are focused particularly in graph database, a class of NoSQL databases that uses a graph model composed of nodes and edges instead of relational model [10] [11], and we claim that the graph data structure is suitable for data warehousing and online analysis.
Implementing an OLAP cube using a graph database is not a straightforward process. The multidimensional model used to instantiate the data cube must be converted to a logical model suitable to graph oriented database. Furthermore, typical OLAP queries must be translated to a specific language supported by this technology. The aim of this work is to illustrate the potentiality of graph databases to handle OLAP structures designed for reporting. In this context, we define a set of mapping rules in order to migrate dimensionally modelled data into the graph database. And we demonstrate how typical OLAP operations can be performed against a graph database. In Fig. 1, we position our proposal regarding the literature. The key contributions of this work can be summarized as follows: (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 548 | P a g e www.ijacsa.thesai.org  We propose an implementation of OLAP engines under graph database using two different logical models that are equivalent to ROLAP and MOLAP models. We define a set of rules used for the mapping from the multidimensional model to these models. An experiment is conducted to highlight the differences between the two meta-models using a case study.
 We propose an effective aggregation technique to build the lattice of cuboids from a data warehouse built upon a graph database management system.
 Then, we provide an extension of the declarative Cypher language to basic OLAP queries. We consider in this work Neo4j as a graph database engine.
The remainder of this paper is structured as follows. In the next section we present the background of our work, and we provide an overview of the state of the art related on Graph-OLAP. In Section III we present our modeling approach for graph OLAP. In Section IV we give an implementation of the proposed approach using the Cypher language. In Section V, we discuss experimental results. The last section concludes this work and suggests eventual research directions.

A. The Multidimensional Schema
The multidimensional schema is the starting point to design and implement data warehouse systems. It defines four major concepts: fact, measures, dimensions and hierarchies [12].   Weak a wa wa  is a function possibly associating parameters to a set of weak attributes.

B. The Graph Model
NoSQL graph-oriented database are based upon the concepts of graph model which organize data into collections of nodes and edges. Once data loaded, graph theory algorithms make it easy to handle semantic queries by calculating the shortest path between nodes. Graph database specify connections at insert time and avoid by then the problem of join index lookup performance as querying data becomes a matter of graph traversal. This makes graph engines optimum when the meta-model of data being stored has many overlapping relationships. This contrast with relational database which store the links between tables at the logical level and relies on relational algebra operations to manipulate the data stored in the database management systems in a relevant logical format.
Formally, a graph database denoted G is a set of properties ( , , , , , , )  N a set of nodes (also called vertices).  the starting node and y  the ending node.
Although graph databases are widely used in OLTP systems, especially when the need of modeling multiple connections is self-evident, it does not exist, to the best of our knowledge, any OLAP solution which uses a graph database at the physical level in the market. However, graph OLAP concept has been around for years. Indeed, some interesting works attempted to implement OLAP systems using graph technology. A decade ago, Chen et al. [20], [21] studied the possibility to perform multi-dimensional analysis on graph data, the authors developed a graph OLAP framework having two major subcases: Informational OLAP and Typological OLAP and proposed the basic definition of OLAP operations under this framework.
Many recent research works have been interested in implementing OLAP engines under property graph databases. In [22], the authors introduce a new data warehousing concept called Graph Cube which stands for an OLAP infrastructure that support analytical queries over a multidimensional network. In [23], the authors define the concept of GOLAP which is an extension of Online Analytic Processing(OLAP) under graph database, some features are listed such as semantics queries and structural analytics. In this work the authors address the challenges of speed and storage related to GOLAP and proposes possible solution to deal with them like graph data reduction and query result approximation when the execution time is too long, unfortunately the authors did not provide an implementation of the proposed framework and focus rather on the possible formalization. In [24], the authors propose a novel graph cube framework called Two-Step Multi-dimensional Heterogeneous(TSMH) which consists of an Entity Hyper Cube and Dimension Cube. In the Entity Hyper Cube n-meta path relation algorithm is used to guide the aggregation of the network and to extend drill-down/rollup operations. In the Dimension Cube the efficiency of dimension operation is improved by using a hierarchical coding for entity type and dimensions.
Along the same vein, in [25] the author proposed an OLAP data structure that relies on typed nodes to store facts and dimensions, and introduced an extension of the Cypher language to basic OLAP queries. The authors didn"t provide any experimental campaign to validate their proposal as they rather focused on the demonstration of its feasibility. In [26], [27], the authors proposed a formal multidimensional data model for graph analysis based on node and edge-labeled called graphoids, and presented a proof of concepted implementation using a Neo4j graph database.
Regarding the instantiation of data warehouses using property graph database, in [28] the authors define a set of transformation rules for mapping between the multidimensional conceptual model and NoSQL graph model.
All the cited works present an interesting background for graph-based online analytical processing. The majority of them addressed the issue of the adaptation of graph structure to OLAP needs. Although they share some similarities with ours, the contribution of this work is quite different as we propose a novel approach for implementing both a data warehouse and OLAP engine based on efficient data cube materialization over graph database.

III. GRAPH OLAP MODEL
OLAP engines have been traditionally categorized whether they perform pre-computation of OLAP cuboids or not. Following this taxonomy, OLAP systems where all part of the cube is pre-computed and stored in memory or disk are called multidimensional OLAP systems (MOLAP) and systems where computation of OLAP cuboids is performed on-demand directly from the data warehouse are considered as Relational OLAP models (ROLAP).
In this section we define the logical graph model for data warehousing. We consider two approaches by analogy to the ROLAP and MOLAP models; each one differs in term of structure and content when the mapping from the conceptual model is performed. In the first approach, fact, dimensions and the link between them are materialized by nodes and edges following several mapping rules, while in the second approach we talk rather about an aggregate lattice modeled using the graph paradigm. In what follows, we will use a fictional electronics company as a running example. The star schema of our cube is depicted in Fig. 2:   Fig. 2. The Star Schema. www.ijacsa.thesai.org

A. First Approach
This approach corresponds to the lightly summarized data model. It defines a meta-model in which each component (fact and its associated dimensions) will be transformed to a node. The relation between nodes will be materialized by edges as detailed by the following mapping rules:  For the star schema represented in Fig. 2, the application of the aforementioned rules will give us the following metamodel, Fig. 3:

B. Second Approach
When we want to perform aggregation on a graph OLAP built according to the first approach, the query we should write is served on-demand and relies on fact nodes which are retrieved then aggregated using an aggregation function. This technique achieves the required result, but it is not optimized for a large data volume. Moreover, it is tending to the opposite of OLAP philosophy where data aggregation is pre-computed and stored.
The second approach corresponds to a highly summarized data model where measure aggregations are pre-calculated and directly available for the sake of query performance. The set of pre-computed aggregations is called an aggregate lattice. Concretely, fact measures are aggregated according to different combinations of dimensions and stored as a node with two labels.  If we refer to our running example and considering only high levels of granularity. Let"s assume by convention that the order of position levels is: Product.Brand-Product.Product-Store.Region-Store.Country-Date.Year-Date.Quarter. Table I and Fig. 4 displays such a representation:

IV. IMPLEMENTATION
A. Answering Typical Analytical Operation using Cypher OLAP operations help users to view data from different perspectives providing a convenient environment for real-time data visualization and analysis. OLAP defines several basic operations; the most popular ones are roll-up, dicing and slicing. In this section we present how these operators can be expressed over a data cube designed according to the first approach.
Queries are written using the Cypher syntax, a declarative query language intended to be executed on a database engine built on the graph model. Cypher relies on the concept of pattern matching for querying and updating graphs [12]. A detailed description of the Cypher syntax is beyond the scope of this paper.

1) Roll-up:
The roll-up operation (also called consolidation or aggregation operation) performs aggregation on a data cube in two ways, either by reducing the number of dimensions or by climbing up a concept hierarchy for a dimension. It is like zooming-out feature from the most detailed granularity level to the less detailed one.
In the query given in Listing.1, the rollup operation is performed by climbing up the concept hierarchy of Product dimension (Product → Brand), and of Store dimension (Store → City). The execution of the query results in the creation of a node containing the aggregated measures and two new relations linking the created node with its associated dimension hierarchies.
Listing. 1. Roll up-Aggregation of sales and quantities by product brand and store city.  3) Slicing: Slicing is similar to dicing with a little difference. It emphasizes one specific dimension and provides a new sub-cube by filtering on a particular attribute. It can be considered as a specialized filter for specific dimension parameter value.
In Listing.3 Slice is carried out for the dimension Region using the criterion Region= "Asia'.

B. Aggregates Creation
We refer to the property graph in Fig. 3 and the set of aggregates in Table I, and then we show how we can perform pre-calculation of our sample cuboids.

1) Aggregate by product brand:
Query results in cypher are evaluated by its core concept, namely, pattern matching. By using patterns, you describe the requested data shape, then the Cypher engine is responsible for restoring the data you are looking for. For example, to build the aggregate value Aggregate:S10x0x0, a join is implemented by means of matching Sales → Brand against the OLAP-graph. It is worth noting that the edge label linking the fact and the dimension nodes is not required as it is inferred from node types.
In SQL, this is equivalent to a join between the fact table Sales and the dimension table Brand followed by the aggregation function SUM and GROUP By clause over Brand attributes.
Listing. 4. Creation of the aggregate Aggregate:S10x0x0.  Fig. 5 shows how the aggregate Aggregate:S10x0x0 (By product brand) fits in the property graph (colored in grey). It is a one-level aggragate as it is calculated against one hierachical level(colored in red). www.ijacsa.thesai.org Increasing the materialization of the aggregates can improve considerably query performance, but can also affect drastically storage space since aggregate nodes are stored on disk. The precalulation of all possible aggregate values is often not needed. Generally, OLAP engines chose the percentage of precompted values based on business needs, the remaining aggregates are calculated in response to a query. We can imagine a scenario in which potentially requested aggregates are infered from log files that contains previously executed queries.

V. RESULTS AND DISCUSSION
We conducted experiments to evaluate two aspects for the OLAP implementation under graph database: storage space and query performance. For this, the solution we propose is compared with a ROLAP implementation under Oracle relational database containing the same dataset. The experiment is carried out on a Unix machine (macOS) having a core-i7 CPU,16GB of RAM and 1 TB of stockage memory and running Neo4j community edition v4.3.

A. Data Generation
The dataset used in the experiment is generated using a novel NoSQL star schema benchmark named KoalaBench [29], [30], [30]. This tool is developed with Java language and is derived from the reference benchmark TCP-H. For clarity and to fit the meta-model in our running example the Supplier is replaced with the Store dimension, LineItem is renamed with Sales, and for the equivalent graph model, only few dimension parameters are tracked. Datasets can be generated in different configurations (different file format including tab, csv, json, xml..., and multiple models). The size of the generated data by scale factor is detailed in Table II.

B. Experiment 1: Memory Consumption Per Scale Factor
In this experiment we use a global flat CSV file representing data in a flat meta-model. In the appendix (Listing.8), we attach the Cypher script for loading data from an CSV file into Neo4j database according to our modeling approach. A fragment of the generated graph is represented in Fig. 6. The number of nodes and edges for the corresponding graph is depicted in Table III. www.ijacsa.thesai.org   Table III, we can see that a snowflake schema on a graph database requires more storage space than in a relational one (more than 3 times for SF=1). This is easily explained: property graph databases store relationships physically on disk using edges while the concept of foreign key is used instead by relational databases. Furthermore the metadata is stored individually for each record in graph database unlike relational model which define the structure of the data at a higher level( the table itself). Which means that property names are repeated for each item. Indeed, graph databases are very storage intensive. This is traded for higher query performance. Since nowadays hard disks are inexpensive, it woud be worthwhile trade-off to buy more storage space than keeping users waiting.

C. Experiment 2: Query Performance
The purpose of this experiment is to measure empirically the performance of graph-OLAP to process analytical queries when scaling up in comparison with the ROLAP implementation under Oracle database. We have exposed the system to a scale factor equal to 10 wich generates 11,6 Go of random data in csv file format. Query configuration includes queries involving gradually an increasing number of dimensions as depicted in Table IV. Each query was executed three times and the average of the elapsed time is presented in Fig. 7.
Experiment results show that the relational implementation defeats the Graph alternative when the query involves one dimension, but when the query dimensionality increases the graph alternative show better performance ranging from 1,82 to 2,29 times faster. Indeed, in relational databases the deeper we go in joining tables the more queries show slower processing time because it requires scanning of all table involved in the query which has a considerable cost. Unlike relational databases which suffer the pain of joining tables, graph databases express relationship at the physical level. That means, the links between nodes exists physically on disk and are named and directed which, makes graph traversal easier.

VI. CONCLUSION
The ability of graph technology to handle highly interconnected data makes it suitable for interactive analysis and more relevant for businesses today. In this paper, we addressed the topic of extending NoSQL graph-oriented databases to OLAP. We have proposed a modeling approach for implementing graph-based data warehouses using labeled nodes and edges. We have also shown how materialized aggregates can pre-computed across different levels to speed up query processing. At the physical level Neo4J engine is used as a graph-oriented database management system. Typical OLAP queries are rewritten using its declarative query language Cypher.
The Graph-OLAP implementation is compared to ROLAP one in terms of query performance and storage space, results show clearly that graph implementation of OLAP presents better performances than relational alternative in term of query response time when facing a huge data volume.
In the forthcoming extended work, we look forward to extending Cypher to support OLAP features by writing a userdefined aggregation function using the low-level API provided by Neo4J engine.
Without any doubt, using NoSQL technology to support OLAP features is a promising research direction. Therefore, we claim that implementing OLAP engines under columnoriented and document-oriented databases using novel frameworks would be an interesting research issue that can be addressed. Graph OLAP Oracle ROLAP