Pentaho and Jaspersoft : A Comparative Study of Business Intelligence Open Source Tools Processing Big Data to Evaluate Performances

Regardless of the recent growth in the use of “Big Data” and “Business Intelligence” (BI) tools, little research has been undertaken about the implications involved. Analytical tools affect the development and sustainability of a company, as evaluating clientele needs to advance in the competitive market is critical. With the advancement of the population, processing large amounts of data has become too cumbersome for companies. At some stage in a company’s lifecycle, all companies need to create new and better data processing systems that improve their decision-making processes. Companies use BI Results to collect data that is drawn from interpretations grouped from cues in the data set BI information system that helps organisations with activities that give them the advantage in a competitive market. However, many organizations establish such systems, without conducting a preliminary analysis of the needs and wants of a company, or without determining the benefits and targets that they aim to achieve with the implementation. They rarely measure the large costs associated with the implementation blowout of such applications, which results in these impulsive solutions that are unfinished or too complex and unfeasible, in other words unsustainable even if implemented. BI open source tools are specific tools that solve this issue for organizations in need, with data storage and management. This paper compares two of the best positioned BI open source tools in the market: Pentaho and Jaspersoft, processing big data through six different sized databases, especially focussing on their Extract Transform and Load (ETL) and Reporting processes by measuring their performances using Computer Algebra Systems (CAS). The ETL experimental analysis results clearly show that Jaspersoft BI has an increment of CPU time in the process of data over Pentaho BI, which is represented by an average of 42.28% in performance metrics over the six databases. Meanwhile, Pentaho BI had a marked increment of the CPU time in the process of data over Jaspersoft evidenced by the reporting analysis outcomes with an average of 43.12% over six databases that prove the point of this study. This study is a guiding reference for many researchers and those IT professionals who support the conveniences of Big Data processing, and the implementation of BI open source tool based on their needs. Keywords—Big Data; BI; Business Intelligence; CAS; Computer Algebra System; ETL; Data Mining; OLAP


INTRODUCTION
Business Intelligence software converts stored data of a company's clientele profile and turns it into information that forms the pool of knowledge to create a competitive value and advantage in the market it is in [1].Additionally, Business Intelligence is used to back up and improve the business with reasonable data and use the analysis of this data, to continuously improve an organisation's competitiveness.Part of this analysis is to provide timely reports, for management's to make the decision based on factual information, so their decision-making is based on concrete evidence.Howard Dresner, from Gartner Group [2], was the first to coin the term Business Intelligence (BI), as a term to define a collection of notions and procedures to support the decision-making, by using information found upon facts.
BI system gives enough data to use and evaluate the needs and desires of customers and in addition it allows to: [3]: i) Design reports for departments or global areas in a company, ii) Build a database for customers, iii) Create scenarios for decision-making, iv) Share information between areas or departments of a company, v) Sandbox studies of multidimensional designs, vi) Extract, transform and process data, vii) Give a new approach to decision-making and viii) Improve the quality of customer service.
The benefits of systemizing BI include the amalgamation of information from several sources.[4], creating user profiles for information management, reducing the dependence on the systems department, the reduction in the time of obtaining information, improves the analysis, and also improves the availability to access real-time information according to specific current business criteria.
The recent publication of Gartner Magic Quadrant for Business Intelligence Platforms 2015 [5] has highlighted the changes being taken by the BI sector to rapidly deploy www.ijacsa.thesai.orgplatforms that can be used by both business users and analysts to extract information from collected data.Traditionally, business intelligence has been understood as a set of methodologies, applications and technologies used to transform data into information and then information into a personal profile of clients that is generated into structured data to serve different areas of business enterprise [6].
Therefore, Big Data will aid to develop better procedures that allow (BI) tools to be used to gather information, such as [7]: i) Process and analyse volumes of information; ii) Increase the universe of data to consider when decisionmaking: and inherent historical data of the company, to incorporate data from external sources ; iii) Provide an immediate response to the continued provision of real-time data of the devices and the possibilities of interconnections between devices; iv) Working with structures of complex and heterogeneous data: logs, emails, conversations, locations, voice, etc.; v) and lastly, to Isolate from the physical constraints of storage and process by making use of scalable solutions and high availability at a competitive prices.This paper presents an experimental analysis of the comparison of two of the best positioned open source BI systems in the market: Pentaho and Jaspersoft, processing Big data and focussing on their Extract Transform and Load (ETL) and reporting processes by measuring their performance using Computer Algebra Systems.The aim of this paper is to analyse and evaluate these tools and outline how they improve the quality of data, and inadvertently helps us understand the market conditions to make future predictions base on trends.
Section II describes the capabilities and components of both Pentaho and Jaspersoft BI Open Sources.Section III introduces the computer algebra systems SageMath and Matlab.This is followed by the materials and methods (Section IV) used in the analysis and experimentation, especially the ETL and Reporting measurements and how they were implemented.In Section V, the results of the study for CPUtime as a function of the "size" from the input data for the ETL and Reporting processes from both Pentaho and Jaspersoft Business Intelligence Open Sources, applying two different Computer Algebra Systems.Section VI contains the discussion of the experimentation.Section VII, the conclusion of the study.

II. PENTAHO AND JASPERSOFT BUSINESS INTELLIGENCE OPEN SOURCES
A. Pentaho Pentaho, created in 2004 is the current leader of Business Solutions Intelligence Open Source.It offers its own solutions across the spectrum of resources to develop and maintain the operations of BI projects from the ETL with data integration to the dashboards with Dashboard Designer [8].Pentaho has built its solution Business Intelligence integrating different existing and recognized solvency projects.Data Integration was previously known as Kettle; indeed, it retains its old name as a colloquial name.Mondrian is another component of Pentaho that retaining its own entity.
Pentaho has the following components: a) ETL: Pentaho Data Integration (previously Kettle) is one of the most widely used ETL solutions and better valued in the market [9].It has a long history, solidity, and robustness that make it a highly recommended tool.It allows transformations and works in a very simple and intuitive way, as it is shown in Fig. 1.Likewise, the Data Integration projects are very easy to maintain.The BI Pentaho Server is a 100% Java2EE allows us to manage all BI resources [10].It has a BI user interface available where reports are stored, OLAP views and dashboards as it is illustrated in Fig. 2. In addition, it offers access to a management support that allows managing and monitoring both application and usage.Fig. 2. Pentaho Server User Interface to manage BI resources where all the reports are founded, OLAP views, and dashboards.Also the access to a management supports that allows managing and monitoring both application and usage

c) Pentaho
Reporting: Pentaho provides a comprehensive reporting solution.Covering all aspects needed in any reporting environment, as shown in Fig. 3.The Pentaho reporting tool is the old form of JFreeReport [11]: i) It provides a tool for reporting (Pentaho Reporting), ii) Provides an execution engine, iii) Provides Metadata tool for conducting reports Ad-hoc, and iv) Provides a user interface that allows ad-hoc reports (WAQR).Fig. 3. Pentaho Reporting Interface with a comprehensive reporting solution, covering all aspects needed in any reporting environment d) OLAP Mondrian: Online Analytical Processing is the technology that allows us to organize information in a dimensional structure that will allow us to move information by scrolling through its dimensions [12].Mondrian is the Pentaho OLAP engine.Although it can be integrated independently on any other platform, and indeed it is the component.Data Integration that is used independently.Mondrian is a Hybrid OLAP engine that combines the flexibility of ROLAP engines with a cache that provides speed.
• Viewer OLAP: Pentaho Analyser: OLAP Viewer that comes with the Enterprise version [13].Modern and easier to use than JPivot as it is illustrated in Fig. 4.
AJAX provides an interface that allows great flexibility when creating the OLAP views.e) Dashboards: Pentaho provides the possibility of making dashboards [13] through the web interface using the dashboard designer as it is shown in Fig. 5.

B. Jaspersoft
Jaspersoft is the company behind the famous and extended Jaster Reports.Open Source reporting solution preferred by most developers to embed in any Java application that requires a reporting system.Jaspersoft has built its solution B.I. around its reporting engine [14].This has been done differently from Pentaho.Jasper has integrated its projects that also solves existing and consolidates projects nonetheless, has not absorbed it.This strategy makes it "depend" on Talend solution regarding ETL and Mondrian -Pentaho for the OLAP engine.Jasper has access to the code Mondrian that can adapt and continue its developments with Mondrian.
Jaspersoft has the following components: a) ETL -JasperETL is actually Talend Studio.Talend, unlike Kettle, it has not been absorbed by Jasper and remains an independent company that offers its products independently [15].Working with Talend is also quite user-interface intuitive and proprietary although the approach is completely different.Talend is a code generator that is the result of an ETL exercise and it is native Java or Perl code.It can also compile and generate Java procedures or instructions.Talend is more oriented to a type of programmer used with a higher level of technical expertise than it requires by Kettle as it is illustrated in Fig. 6.To sum up, the flexibility is much better with this approach.Fig. 6.Jaspersoft ETL Interface is actually Talend Studio, it is also quite user-interface intuitive and a code generator b) Web Application-JasperServer: JasperServer is a 100% Java2EE that allows us to manage all BI resources [16].The overall look of the web application is a bit minimalist without sacrificing the power as shown in Fig. 7.However, having all resources available on the top button bar makes it a 100% functional application and has all the necessary resources for BI.Fig. 7. Jaspersoft Server Interface is a 100% Java2EE to manage business intelligence resources c) Reports: As described, the report engine is the solution of Jaspersoft as it is illustrated in Fig. 8.The component provides features such as i) Report development environment: Ireport as a system based on environment NetBeans.What makes it challenging to machine resources?In return it offers great flexibility, ii) System of metadata (Domains) the web.These, along with ad-hoc reports, are the strengths of this solution, iii) Web Interface for ad-hoc reports really well resolved, iv) The runtime JasperReports widely was known and used in many projects where a solvent reporting engine is needed, and v) The reports can be exported into PDF, HTML, XML, CSV, RTF, XLS and TXT.
• Predefined -Ireport: IReport is a working environment that allows a large number of features [17].Here something like that Talend is a working environment with larger demands as a result of offering a number of possibilities occurs.
• Ad hoc: This is the real strength of Jasper solutions.
The editor of ad-hoc reports is the best structured and best featured tool for analysing [17].It offers: i) Selection of different types of templates and formats, ii) Selection of different data sources, iii) Validation consultation on the fly, iv) Creation of reports by dragging fields to the desired location: i) Tables, ii) Graphics, iii) Crosstable (Pivot), and iv) Edition of all aspects of the reports.The OLAP engine that uses JasperServer is Mondrian and uses a Viewfinder-JasperAnalysis [18], which is no longer JPivot but with a layer of makeup as shown in Fig. 9. Already mentioned in Pentaho paragraph.Fig. 9.The OLAP engine that uses JasperServer is Mondrian and uses a Viewfinder-JasperAnalysis which is no longer JPivot but with a layer of makeup e) Dashboards: Dashboard Designer.Illustrated in Fig. 10.
• Predefined: They do not make much sense, given the designer panels [19].In any case, to be a Java platform it can always include proper developments.
• Ad-hoc: Dashboard Designer: It is back to a really easy and simple use of the web editor.

A. Sagemath
SageMath is a computer algebra system (CAS) that is built on mathematical packages and contrasted as NumPy, Sympy, PARI / GP or Maxima.It accesses the combined power of the same through a common language based on Python.The interaction code combines cells with graphics, texts or formulas enriched with LaTeX rendered.Additionally, SageMath is divided into a core that performs calculations and an interface that displays and interacts with the user.Even, a command line-based text is also available using Python that allows interactive control calculations [20].The Python programming language supports object-oriented expressions and functional programming.Internally, SageMath is written in Python and a modified version of Pyrex called Cython.It allows parallel processing [21] using both multi-core processors and symmetric multiprocessors.It also provides interfaces to other non-free software as Mathematica, Magma, and Maple (undistributed with SageMath) that allows users to combine software and compare results and performances.
All the packages cover most features such as i) Libraries of elementary and special functions, ii) 2D and 3D graphs of both functions and data, iii) Data manipulation tools and duties, iv) A toolkit for adding user interfaces to calculations and apply, v) Tools for image processing using Python and Pylab, vi) Tools to visualize and analyze graphs, vii) Filters for importing and exporting data, images, video, sound, CAD, and GIS, viii) Sage embedded in documents LaTeX6 [22].

B. Matlab
Matlab is a computer algebra system (CAS) that provides an integrated environment that develops and offers representative characteristics such as the implementation of algorithms, data representation and functions.Also, communication with programs in other languages and other hardware devices [23], among others are advanced.The Matlab package has two extra tools that extend those functionalities: Simulink is a platform for multi-domain simulation and GUIDE that is a graphic user interface -GUI.Additionally, its potential could be expanded using Matlab toolboxes; and Simulink blocks with block sets.
The language of Matlab is interpreted, and can run in both interactive environments through a script file (*.m files).This language allows vector and matrix operations to function, lambda calculus, and object-oriented programming.An additional tool called Matlab Builder has been launched that contains an "Application Deployment" which allows using Matlab functions, as library files, that provides the ability to be used with environments such as .NET or Java.Matlab Component Runtime (MCR) should be used on the same machine where the main application is set for the Matlab function properly [24].One of the versatilities of this CAS is that it is quite useful to carry out measurements and it provides an interface to interact with other programming languages.Thus, Matlab can call functions or subroutines written in C or FORTRAN [25].As the process is accomplished by, creating a wrapper function that allows them to be passed and returned by their data types Matlab.

IV. MATERIALS AND METHODS
Accurately measuring the processing times is not a trivial task, and the results may vary significantly from one computer to another.The number of factors that influence the execution times has used an algorithm, operating system, processor speed, the number of processors and instruction sets that understands the amount of RAM, and cache, and speed of each, math coprocessor, GPU Among each other.Even on the same machine, the same algorithm sometimes takes much longer to give results, due to factors such as using more time than other applications, or if there is enough RAM when running the program.
The objective is to compare only the ETL and Reporting processes, trying to draw independent conclusions from one machine to another.The same algorithm can be called with different input data.
The goal of this study is to measure the run-time as a function of the "size" of the input data.For this, two techniques are used: -Measure run time of programs with different input data sizes and -Count the number of operations performed by the program.

A. ELT Measurement
With Sage was measured the run time and efficiency of ETL processes in both BI tools mentioned previously as it is illustrated in Fig. 11.For the CPU time, Sage uses the concepts of CPU time and Wall time [12], which are the times that the computer is dedicated solely to the program.
The following flow chart shows the ETL process measurement.www.ijacsa.thesai.orgThe CPU time is dedicated to our calculations, and Wall time clock time between the beginning and the end of the calculations.Both measurements are susceptible to unpredictable variations.The easiest way to get the run time of a command is to put the word time to the command as it is shown in Fig. 12. Fig. 12. Sage code to measure CPU time of ETL processes in both BI tools for small and higher data sizes The time command is not flexible enough and needs the CPUtime functions and Walltime.CPUtime is a kind of meter: a meter progresses as the calculations are done, and moves many seconds as the CPU dedicated to Sage.The Walltime is a conventional clock (the clock UNIX).For the time spent on the program, also the before and after times of the execution were recorded and calculated and the differences are illustrated in Fig. 13.For doing this activity, a C function was used that implemented a High-Resolution Performance Counter for measurement the Reporting processes on both BI tools as it is illustrated in Fig. 16.

C. Databases Analysis
Six different Excel databases with different sizes have been used to perform the analysis.Those databases were acquired from UCI Machine Learning Repository [26] and their main features are described in Table 1:

D. Computer system
In order to perform the experiment and examination, the Business Intelligence Tools, Computer Algebra Systems and Databases are set on and customised in a PC with the following features: i) Operating system: x64-based PC, ii) Operating system version: 10.0.10240N/D iii) Compilation 10240, iv) Number of processors: 1, v) Processor: Intel (R) Core (TM) i5-3317U vi) Processor speed: 1.7 GHz, vii) Instructions: CISC, viii) RAM: 12 Gb, ix) RAM speed: 1600 MHz, x) Cache: SSD express 24 Gb, xi) Math coprocessor: 80387, xii) GPU: HD 400 on board.

V. RESULTS
In this study, results for CPUtime as a function of the "size" was obtained from the data input for the ETL and Reporting processes from both Pentaho and Jaspersoft Business Intelligence Open Sources, applying two different Computer Algebra Systems.
The measurements of the computational times might fluctuate considerably based on many factors such as the used algorithm, operating system, processor speed, number of processors and instruction set that understand the amount of RAM, and cache, of each speed, along with math coprocessor, GPU among others.Even on the same machine, the same algorithm sometimes takes much longer to give the result of others, due to factors that it is more time-consuming than other applications that are running or if it has enough RAM when running the program.
Tables 2 and 3 shows the results of the CPU time (in minutes) of the ETL and Reporting processes and they present the times it took per tool in processing the different sized databases.Additionally, the increment of processing data can be considered as a difference between those BI tools in process.As a result of the first examination (Table 2), it is clear that the computational times for Pentaho ETL process measured by Sage were: 8 min; 12.01 min; 21 min; 32.01 min; 39.06 min and 48.01 min.Conversely, the computational times for Jaspersoft, were 9.The results of the CPU time of the ETL process is shown in Table 2 and it presents the times that it took per tool in different databases.The Graphical comparison results of the CPU times for the ETL and Reporting processes performed by the BI tools, accessing six different sized databases which is illustrated below.In Fig. 17, we observe that Jaspersoft has significantly increased the results of the CPU time of the ETL process represented by 19.22%, 60.85% 51.79%, 39.75%, 40.77 and 40.99% processing DB1, DB2, DB3, DB4, DB5 and DB6, respectively.This means that in the ETL process, Pentaho had a better performance than Jaspersoft.In Fig. 18, it is evident that Pentaho had a considerable rise in the outcomes of the Reporting process denoted by 25%, 32.99%, 40%, 48%, 53% and 59.75% processing DB1, DB2, DB3, DB4, DB5 and DB6, correspondingly.In this case, Jaspersoft had a better performance than Pentaho in the Reporting process.The outcomes also showed that both results of CPU time of the ETL and Reporting process are directly related to the sizes of the databases.What is more, is that the study could identify that Pentaho had a superior performance for ETL process and Jaspersoft an improved performance for Reporting process.

VI. DISCUSSION
The ETL experimental analysis results clearly shows that Jaspersoft BI had an increment of CPU time in the process of data over Pentaho BI, represented in an average of 42.28% of performance metrics over six databases.Pentaho provided its data integration and ETL capabilities as shown in [9], having a better performance.Evidently, for this part of the experimentation, our study has demonstrated that Pentaho had higher performance ETL capabilities with the aim of covering the whole data integration requirements, simultaneously by big data as well.That high performance is provided by its parallel processing engine and these features are shown in [11].
Pentaho BI had a marked increment of the CPU time in the process of data over Jaspersoft evidenced by the Reporting analysis outcomes with an average 43.12% over six databases.Clearly, in this part of the examination, the analysis has confirmed that Jaspersoft has had a higher performance Reporting capability with the objective of generating reports.This particular feature is aligned with other studies, which argue that Jaspersoft extends the range of its BI requirements including reporting based on its operational production, interactive end-user query, data integration and analysis as shown in [11].On top of this, investigating various security features [27][28][29] could be an interesting avenue to explore in the future to protect BigData.

VII. CONCLUSION
This study has tested two of the best positioned open source Business Intelligence (BI) systems in the market: Pentaho and Jaspersoft.Both BI systems present notable features on their components.Pentaho on one side along with ETL component with great usability, maintainability and flexibility in making the transformations: Web Application with Java j2EE application 100% extensible, adaptable and configurable; the configuration management is integrated in most environments, that communicate with other applications via web services; it integrates all the information resources into a single operating platform; Reports with an intuitive tool that allows clients to create reports easily; OLAP Mondrian with a consolidated engine widely used in environments of JAVA; Dashboard Designer makes dashboards Ad-hoc, dashboards based on SQL queries or Metadata and a great freedom by offering a wide range of components and options.Jaspersoft on the other side has JasperETL (Talend) with Java / Perl native, Web Application with a Java j2EE application 100% extensible, adaptable and customizable; the management settings are very well resolved, it allows almost all through the same Web application; It integrates all information resources into a single operating platform; the editor Ad-hoc reports and Box Editor Ad-hoc command are best resolved; Reports are fast; Ad hoc and have a nice interface, with good flexibility and power, simple, intuitive and easy to use.
The experimental analysis has focussed on their ETL and Reporting processes by measuring their performance s using the two Computer Algebra Systems, Sage and Matlab.During the ETL analysis results, clearly showed that it could observe Jaspersoft BI and has an increment of CPU time in the process of data over Pentaho BI, represented in an average of 42.28% of performance metrics over six databases.Meanwhile, Pentaho BI had a marked increment CPU time in the process of data over Jaspersoft evidenced by the Reporting analysis outcomes with an average 43.12% over the databases.This study is a useful reference for many researchers and those who are supporting decisions of Big Data processing and the implementation of BI open source tool based on their process expectations.The future work of the author would involve new studies and implementations of BI with Data warehousing to create a technological tool to support the decision-making at the enterprise level by taking this paper as a base.

Fig. 1 .
Fig. 1.Pentaho Data Integration Interface, the ETL solution allows transformations and works in a very simple and intuitive way b) Web Application-BI Server:The BI Pentaho Server is a 100% Java2EE allows us to manage all BI resources[10].It has a BI user interface available where reports are stored, OLAP views and dashboards as it is illustrated in Fig.2.In addition, it offers access to a management support that allows managing and monitoring both application and usage.

Fig. 4 .
Fig. 4. Pentaho Analyser Interface to create OLAP views that provide a comprehensive reporting solution.Covering all aspects needed in any reporting environment

Fig. 5 .
Fig. 5. Pentaho Dashboards Interface that provides the possibility of making dashboards through the web interface by using the dashboard designer

Fig. 8 .
Fig.8.Jaspersoft Reporting Interface that includes report development environment, a system of metadata (Domains) web, web interface for ad-hoc reports and runtime JasperReports d) OLAP: The OLAP engine that uses JasperServer is Mondrian and uses a Viewfinder-JasperAnalysis[18], which is no longer JPivot but with a layer of makeup as shown in Fig.9.Already mentioned in Pentaho paragraph.

Fig. 10 .
Fig. 10.Jaspersoft Dashboards Designers Interface able to select predefined or ad-hoc III.COMPUTER ALGEBRA SYSTEMS

Fig. 11 .
Fig. 11.The ETL process measurement using the Sage computer algebra system for all databases

Fig. 13 .
Fig. 13.Sage code using CPUtime functions and Walltime to measure the ETL process in both toolsThe following code saves the list of the CPU times used to run the factorial function with data of different sizes as shown in Fig.14.

Fig. 14 .
Fig. 14.Sage code to save lists of the CPU times used to run the factorial function with data of different sizes B. Reporting Measurements With Matlab the measured run time and efficiency of Reporting processes in both BI tools mentioned previously is shown in the flow chart in Fig. 15.

Fig. 15 .
Fig. 15.The Reporting measurement process using the Matlab computer algebra system for all databases

Fig. 16 .
Fig. 16.Code to measure the reporting process In this case, Query Performance Counter acts as a clock () and Query Performance Frequency as CLOCKS_PER_SEC.That is the first function that gives the counter value and the second frequency (in cycles per second, hertz).It is clear that an LARGE_INTEGER is a way to represent a 64-bit integer by a union.

Fig. 17 .
Fig. 17.CPU Time of the ETL process with the times it took per tool in processing data in the different databases

Fig. 18 .
Fig. 18.CPU Time of the Reporting process with the times took per tool in the process data in different databases

TABLE I .
DESCRIPTION OF THE SIX EXCEL DATABASES INCLUDING THEIR NUMBER OF ATTRIBUTES, INSTANCES AND SIZES

TABLE II .
RESULTS OF THE CPU TIME OF THE ETL PROCESS WITH THE TIMES TOOK PER TOOL AND THE INCREMENT IN THE PROCESS DATA IN THE DIFFERENT DATABASES

Table 3 )
, we can detect and see that the result of the Pentaho Reporting process measured by Matlab was: 3.75 min; 5.35 min; 8.47 min; 12.03 min; 17.07 min and 22.60 min.Conversely, the reporting process for Jaspersoft were 3 min; 4.02 min; 6.05 min; 8.13 min; 11.16 min and 14.15 min, processing 0.009 Mb from DB1; 0.017 Mb from DB2; 0.134 Mb from DB3; 1.321 Mb from DB4; 35.278 Mb from DB5 and 144.195Mb from DB6, respectively for both tools.The results of the CPU time of the Reporting process are shown in Table3and it presents the time it took per tool in different databases.

TABLE III .
RESULTS OF THE CPU TIME OF THE REPORTING PROCESS WITH THE TIMES TOOK PER TOOL AND THE INCREMENT IN THE PROCESS DATA IN THE DIFFERENT DATABASES