Prediction of Tourist Visit in Taman Negara Pahang, Malaysia using Regression Models

Tourism is among the significant source of income to Malaysia and Taman Negara Pahang is one of the Malaysia's tourism spots and the heritage of Malaysia in achieving the Sustainable Development Goals (SDG). It has attracted many international and local tourists for its richness in flora and fauna. Currently, the information of tourists’ visits is not properly analyzed. This study integrates the internal and public information to analyze the visits. The regression models used are multiple linear regression, support vector regression, and decision tree regression to predict the tourism demand for Taman Negara, Malaysia and the best model was deployed. Predictive analytics can support the decision-making process for tourism destinations management. When the management gets a head-up of the demand in the future, they can choose a strategic planning and be more aware about the factors influencing tourism demand, such as the tourists’ web search engine behaviors for accommodation, facilities, and attractions. The factors affecting the tourism demand are determined as the first objective. The role of independent variable was set to the total number of visitors, subsequently being set as the target variable in the modeling process. A total of 30 models were generated by tuning the cross-validation parameters. This study concluded that the best model is the multiple linear regression due to lower root mean square error (RSME) value. Keywords—Regression models; SDG; Taman Negara Pahang; tourist analytics


I. INTRODUCTION
Attractions of tourism destinations produce economic values as it impacts the number of tourist arrivals in Malaysia. As a result, tourism has become one of the highest revenue industries after automobiles and oil (REF). Today, one of the largest service sectors is the tourism industry where the industry became one of the highest in terms of revenue after automobiles and oil. This significant growth is a result of the efforts undertaken by the Ministry of Tourism where the planning and execution policy underlined by the government spearhead the success of the tourism industry. It is the longterm aim of the government to make Malaysia as one of the most popular tourism destinations. The success of the tourism industry is defined by the demand and supply which can be measured by tourists arrivals and receipts [1].
One of the famous tourism destinations is Taman Negara Malaysia, also known as Taman Negara National Park. This natural park protects a diverse flora and fauna, renowned for its nature trails, and adventure activities hence making it a valuable tourism source [2]. Encompassing an area of 4,343 km2, Taman Negara National Park straddles three states of Malaysia; Taman Negara Pahang, Taman Negara Kelantan, and Taman Negara Terengganu; in which Taman Negara Pahang takes up around 57% of the total national park area [3]. There are two different main entrances for Taman Negara Pahang namely, Kuala Tahan and Sungai Relau. There are many activities to do in Taman Negara such as jungle trekking, hiking, and fishing. Panoramic scene and captivating places such as waterfall cascades and canopy walkway attract many people to visit the Taman Negara National Park. Therefore, the service industry needs to be concerned about visitors' management of the tourism destination. In 2013, a research study was made about Ecotourism in Taman Negara National Park: the issues and challenges [4]. One of the issues that they found is the lack of visitor management especially on overcrowding problem and excess visitors during certain period of time. Moreover, based on the news [5], it was pointed out early on those statistics shows of slightly higher number of visitors to Taman Negara in January to February. The lack of proper service management on popular places at the park lead to overcrowding problem, mainly due to the inability to optimize staff's workload/working hours. In addition, overcrowding of tourists lead to the loss of authenticity and implies a significant risk to the destination's future attractiveness, especially towards vulnerable destinations such as the Taman Negara National Park.
The attractiveness of Taman Negara Malaysia as a tourism destination has been studied by Universiti Putra Malaysia [6]. The study evaluated that there are total of thirteen attractions in Taman Negara, namely oral history, local culture and lifestyle, flora, fauna, building architecture, nature trails, shopping opportunity, canopy walkway, caves, stream, fishing, mountain, and adventure activities [6]. The attractions of tourism destinations produce economic values as it also gives an impact to the number of tourist arrivals in Malaysia. These attractions have been characterized as demand structures. Thus, demand studies are needed for decision making support of tourism destination management. Therefore, tourism demand forecasting is vital and will benefit the nation's tourism industry greatly. The study of regression techniques helps to forecast future and seasonal demands for tourism growth, management, and planning purposes. Regression analysis is a collection of statistical methods for estimating relationships between a dependent variable and one or more independent variables which can be used to determine the strength of a relationship between *Corresponding Author. 746 | P a g e www.ijacsa.thesai.org variables and to predict how they will interact in the future [7]. This study was made to determine the factors affecting the tourism demand, to make a comparison study of regression techniques, and develop an analytical dashboard based on the best regression model that helps in forecasting tourism demands components, incorporating the applicable criteria in the following sub-sections.

A. Factors Affecting Tourism Visit
In demand studies, factors affecting tourism demand was determined as the independent variables that might affect the target variable, where the target variable is the total number of visitors. The comparison study was made to find the best model in predicting the target variable. Tourism industry has been a huge contribution factor to the economy, even though there exist many factors affecting tourism demand in Malaysia. Several studies have shown that the economic variables play an important role as the key economic factors. Based on [8] and [9], the key economic factors are exchange rate, income, consumer price index (CPI), and population of the country. The studies found a strong relationship between these key economic factors and the volume of tourists. However, the study also showed that there is a negative correlation between the exchange rate and the number of tourists, the higher the exchange rate, the lower the volume of tourists' arrivals. The exchange rate is related to the depreciation of Ringgit Malaysia (RM) that affects the cost of living in Malaysia.
As many researchers used economic factors as the contribution to the demands, some researchers [10][11][12] observed the tourist's web search behavior by using Google Trends data. Specific keywords were identified for web scraping related to tourism activities, such as "skiing", "skiing in sweden" or "sweden skiing" or "ski sweden" [10]. Meanwhile, [12] made use of these keywords: "destination", "destination + guide", "destination + travel guides", "destination + tickets", "destination + weather", depending on the tourism destination, strategy, ticket price, scenic spots, weather, and accommodation, among many other factors. At the end, 50 initial keywords related to the decision-making process were selected [11]. Due to the time gaps between web search activity and tourist arrivals, this new approach is truly relevant. Web-based data sources, such as search engine traffic, often have a natural relationship with tourism demand. Because of strong interest in certain tourism destinations, potential tourists for instance, browse websites extensively before visiting these destinations.
In other study, climate change is an additional factor to know the relationship with tourism demand. Research by [13] was conducted regarding the dimension of climate change in Malaysia based on tourists' perception. Generally, Malaysia has an equatorial climate. Extreme weather and seasonality are commonly related with climate change in Malaysia. Temperature, rainfall, and, to a smaller extent, wind are all examples of extreme weather factors. Seasonality, on the other hand, is always linked to the dry and wet (monsoon) seasons. The main variables for this factor are the average temperature and the average precipitation. The weather in Malaysia is hot and humid all year with Malaysia's average daily temperature ranges from 21°C to 32°C. Precipitation is the measure of the falling water from the sky which is the rainfall. The findings of the study revealed that tourists had sufficient knowledge of climate change, which influence their travel decisions [13].

B. Prediction in Tourist Domain
Many past research studies have been conducted in order to predict the tourism demand. A comparative study made by [14] to forecast the tourism demand in Turkey using data mining techniques based on regression modelling. The techniques used include multiple linear regression (MLR), multilayer perceptron (MLP) regression, and support vector regression (SVR). The author in [14] used monthly data points for their study, unlike previous studies which usually uses yearly or quarterly data. The author in [14] decided to choose these regression models because of the nonlinear mapping capabilities. In addition, the conventional methods like these models are more efficient to use for data that are likely to pattern the trends, seasonality, and cyclicality. SVR originated from a machine learning model hence a support-vector machine (SVM) can work for regression tasks and is suggested in order to forecast tourism demand. Unlike most conventional neural network model, SVR applies the theory of structural risk minimization which based on the idea of empirical risk minimization, to minimize the upper limit of the generalization error, instead of minimizing the error in training [15][16][17].
A research on tourists visit in the province of West Sumatra was done using MLR and Artificial Neural Network (ANN) [25] with inflation rates and Rupiah exchange rates. The results show an impressive accuracy within 96 to 99 percent. Other researchers from Indonesia made use of seven independent variables including the characteristics of foreign tourists (sex, age, occupation/profession, length of stay, nationality, purpose of visit, and accommodation) to identify the effect on total expenditure [26]. They also found that American and European tourists contributed to the largest average of the total expenditure for vacation purpose. Regression tree was used by [27] to segment the tourist length of stay in Barbados, with socio-demographic profile of the tourist, trip-related characteristics, distance, and economic conditions in the source country. Another study on tourism to model the revenue from international tourism using the foreign trade balance of the country shows the positive correlation in Azerbaijan example [28]. The result also showed that tourism will increase the country's foreign trade turnover. More advanced methods were explored in tourism domain in China [29] with social evaluation index as the attributes and hybrid methods of back propagation and fuzzy as the model.
Another angle on predicting tourism trend is by looking at the flow of tourists' movement in a specified area, as studied by [30][31]. Usage of user-generated content assisted in narrowing down specific criteria to forecast tourism demand apart from projecting possible point-of-interest for tourists [30]. By creating trajectory graphs on past data, the study by [30] yield a better result than traditional machine-learning based algorithms for forecasting the next tourist movement, which is useful for predicting tourism demand in certain areas. While the study by [31] focuses more on statistical-based techniques, they applied statistical method with BPNN model (SMBPNN) on variously collected data such as historical tourism flow, weather, and temperature. Their hybrid model which combines statistical technique with neural network 747 | P a g e www.ijacsa.thesai.org suggested an improvement of forecast as compared to standalone neural network models [31].
The methodology used by other researcher suggested using various techniques and at the same time, incorporating a multidimensional dataset. Hence for this study, an integrated dataset were created from various sources that will be described later in this paper. The suitability of the techniques with the available real-world dataset was also considered; hence this study will be focusing on application of regression techniques.
In the rest of the paper, we first provide some necessary background of proposed modelling methods in Section II, and the data source and evaluation are discussed in Section III. The experimental results are demonstrated and discussed in Section IV. Finally, the conclusion and future works are explained in Section V.

II. MODELLING METHODS FOR REGRESSION
Regression analysis is used for predicting real values, for instance, to forecast the daily sales of the business which makes the number of sales as the target or dependent variable. To determine the relationship between the variables of interest, the data collected will be trained using a regression model. In order to determine the optimal model, Goodness-of-fit measurements, such as the square of the correlation coefficient (r² or squared correlation), was used to examine the scatter of data points around the fitted value. The number denoted the percentage of variance in one variable that can be explained by the other. The higher the r² score, the more precise the prediction. However, the number does not tell how precise the forecasts were in the dependent variable's units.

A. Multiple Linear Regression
Multiple Linear Regression (MLR) aims to model the linear relationship between the independent (explanatory) variables and dependent (response) variables, whereas simple linear regression only has a single input to estimate the value of the coefficients used in data representation. MLR model helps to predict an outcome based on multiple explanatory variables provided with details [18]. The representation of multiple linear regression will be like the following: In which Y refers to the dependent variables, x 1 , …, x n represent independent variables, a 1 , …, a n as regression coefficients, and a 0 is y-intercept (a constant term).
By measuring slope and regression coefficients, it can be represented in the form of mathematical equations in multiple linear regression. Using the regression coefficient formula, the intensity and direction of the relationship between the two variables can be calculated [19]. Hence when comparing with simple linear regression, MLR perform better with less error rate. Moreover, multiple regression can be implemented in linear and non-linear modelling. Multiple regression is based on the statement that there is a linear relationship between both dependent and independent variables, where no assumption was made for major correlation between the independent variables [18][19][20].

B. Support Vector Regression
Support-vector regression (SVR) comes from a machine learning model namely the support-vector machine (SVM). SVM can work for regression tasks and is suggested in order to forecast tourism demand. SVM is a class of linear algorithms that can be used for classification, regression, density estimation, novelty detection, and other applications. SVM uses classification techniques to build a predictive model where its algorithm's main purpose is to find a hyperplane in an N-dimensional space that distinctly classify the data points. Separating two classes of data points may lead to many possible hyperplanes. The hyperplane equation is reflected in (2) and in Fig. 1: where w is a weight vector, x is input vector and b is bias. SVM searches for the hyperplane with the largest margin in separating the circle objects and square objects with minimum classification error.
An optimal network structure can be achieved by SVR on the basis of the theory of structural risk minimization which the margin of the hyperplane will be maximized [15,21]. The main differences between Least Square Support Vector Machine (LSSVM) and SVM is that LSSVM includes the equality of constraints instead of the inequalities, and it is based on the least squares cost function [22]. SVR has been successfully implemented to solve forecasting problems in many fields, such as financial time series (stock index and exchange rate) forecasting, engineering and software (production values and reliability) forecasting, atmospheric science forecasting, electric load forecasting and commodity demand forecasting [16].
The SVR model has been successfully applied to forecast tourist arrivals too. Empirical research has shown that the choice of the parameters in an SVR model significantly influences the accuracy of forecasting. SVR solves the problems of estimation, classification, and nonlinearity via its loss function [17]. Tourism data often exhibit nonlinear characteristics, with SVR widely used in the tourism demand forecast. Low speed, however, is the key drawback of SVR in the training process [23], due to various hyperparameters setting in the model. Some inappropriate SVR parameters allow for the occurrences of overfitting or underfitting problem.  748 | P a g e www.ijacsa.thesai.org

C. Decision Tree Regression
Decision tree algorithm that has been used for classification or regression predictive modelling problems is called Classification and Regression Trees (CART). Decision Tree is one of the classifiers in supervised learning algorithm with a tree-like structure. It consists of root, interior, and leaf nodes in which the outcomes are represented at leaf node. CART is relatively straight-forward for prediction making. The algorithm works through multiple iterations until the tree is able to predict a proper value for the data point. Among the benefits of using CART algorithm is that it is easy to understand, less data cleaning process, non-linearity does not affect the output of the model, and the number of hyperparameters to be tuned is almost null [21,24]. The drawback is that it may have an overfitting problem, but which can be solved using the Random Forest algorithm.
The split attribute in the tree is chosen based on the standard deviation for the independent variable and dependent variable (outcome), and the formula for the standard deviation based on an attribute x as shown in equation (3): where, S is standard deviation, x i is the value of the attribute, x ̅ is the mean value of x and n is the record of x.
The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Construction of a decision tree is basically finding an attribute that have the highest standard deviation reduction (SDR), where according to the calculation in equation (3) as the most homogeneous branch. In other words, the standard deviation of the target will be compared to different standard deviation of each independent variables in the dataset.

A. Data Preparation
The data collection is obtained from different sources which are opened to both public and private use. This research study used monthly data points with observation period from year 2012 until 2018. The first dataset collected was from Google Trends, which is a search trends feature that displays the frequency with which a certain search term is entered into Google's search engine in relation to the site's total search volume over time. There are six keywords or search term used for this study: 1) "Taman Negara", 2) "taman negara accommodation", 3) "taman negara resort", 4) "taman negara canopy walkway", 5) "Cuti Cuti Malaysia", and 6) "visit Malaysia".
The second dataset was retrieved from the Federal Reserve Bank of St. Louis (FRED) where the dataset recorded monthly currency exchange rate with units of Malaysian Ringgit to One U.S. Dollar. The third dataset was extracted from Visual Crossing website that provides historical weather public data. To generate the worldwide weather observation database, the website processes millions of hourly weather observations from thousands of observation stations. However, the dataset extracted from the website is in weekly frequency. Hence, the need to calculate the monthly frequency of average attributes were done using the formula of average in Microsoft Excel. The attributes extracted from the source are Location, Date and Time, Maximum Temperature (degC), Minimum Temperature (degC), Temperature (degC), Heat Index (degC), Chance Precipitation (%), Precipitation (mm), Wind Speed (kph), Wind Direction, Wind Gust (kph), Visibility (km), Cloud Cover, Relative Humidity, and Conditions. However, for this research process, only two attributes were selected: 1) Temperature (degC); 2) Precipitation (mm).
The last and most important dataset is the total visitors' arrival to Taman Negara Pahang, Malaysia. This public dataset was found in a Malaysia Open Data Portal at data.gov.my. However, the data is in a yearly term basis and only covers data from the year 2012 until 2018. Thus, an additional data obtained from Jabatan Perlindungan Hidupan Liar dan Taman Negara (PERHILITAN) Pahang was used to identify the monthly number of visitors to Taman Negara Pahang. Since the data collected for this research were huge and obtained from different sources hence making it unstructured, data transformation process needed to be done. Data preprocessing is the process of changing the variety of raw dataset into one dataset suitable to be used in software such as RapidMiner. For this study, the regression modelling was done with the help of data analytics software tool, RapidMiner. Fig. 2 shows the conceptual framework for the regression modelling that illustrated the source of data, independent variables, and dependent variables. The variables for the observation period (year 2012-2018) were Date, Currency Exchange, Visit Malaysia, Cuti Cuti Malaysia, Taman Negara, Taman Negara accommodation, Taman Negara Resort, Taman Negara canopy walkway, Average Temperature, Average Precipitation and Total of visitors.

B. Model Evaluation
For this study, a comparison was made between three predictive models, namely Multiple Linear Regression (MLR), Support-Vector Regression (SVR), and Decision Tree Regression (DTR), as shown in Fig. 3. Among the important tasks conducted were data preparation from four data sources: data preprocessing, training and testing data split, modeling with three algorithms, and finally evaluation and deployment. The model development in this research study used RapidMiner's default values. The comparison was measured based on the models' performances by manipulating the sampling type and number of folds in cross validation.
The RMSE (Root Mean Squared Error) is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data-how close the observed data points are to the model's predicted values. While squared_correlation is a relative measure of fit, RMSE is an absolute measure of fit. Thus, the lower the values of RMSE, the better it is. The formula for Root Mean Square Error (RSME) is as in (4), where, x actual is an observed value and x model is the predicted value.  Another useful measure to determine the optimal model is goodness-of-fit measurements, such as the square of the correlation coefficient (r² or squared_correlation). This measure is used to examine the scattered-locations of data points around the fitted value. The number denotes the percentage of variance in one variable that can be explained by the other. The higher the r² score, the more precise the prediction. It does not, however, tell how precise the forecasts are in the dependent variable's units.

IV. RESULT AND DISCUSSION
This section presents the results and provides a discussion based on the outlined framework given in Fig. 2 and Fig. 3.

A. Data Description
The final dataset was constructed based on the following data sources: • The Google Trends dataset, in Fig. 4 recorded 504 data which were generated from monthly frequency of year 2012 until year 2018 for six related keywords, as mentioned in section 3 (12 months/year × 7 years × 6 keywords = 504 data).  • Third dataset, extracted from Visual Crossing website which provides historical weather public data. To generate the worldwide weather observation database, the website processes millions of hourly weather observations from thousands of observation stations. However, the dataset extracted from the website is in weekly frequency.
• Fourth dataset consists of the number of visitors in Taman Negara National Park, including nearby places and Gunung Tahan from Malaysia Open Data Portal by yearly and PERHILITAN Pahang, by monthly, starting from January till December, as in Table I. The highest number of visitors found was in March.
Based on the gathered information, the final dataset consists of 11 variables or attributes with 84 rows of monthly data points for the observation period (year 2012-2018). The variables/columns are observation date (month), Currency exchange, the six keywords: Visit Malaysia, Cuti Cuti Malaysia, Taman Negara, Taman Negara accommodation, Taman Negara Resort and Taman Negara canopy walkway, Average Temperature, Average Precipitation and Total Visitors. The Fig. 6 shows the snippets of 10 rows from 2012 as the sample of the real dataset used in the study. Meanwhile Fig. 7 to 10 show statistical measures of selected attribute. In Fig. 6, the currency exchange is represented in real value, while for the six keywords (Visit Malaysia, Cuti Cuti Malaysia, Taman Negara, Taman Negara accommodation, Taman Negara Resort and Taman Negara canopy walkway) are represented as the count of the words mentioned every month. The count of word "Taman Negara accommodation" being mentioned were found to be more as compared to other words. The total number of word count in the dataset is as shown in Fig. 10, with "Visit Malaysia" being the highest count and "Taman Negara canopy walkway" as the lowest count.  In Fig. 8, the average temperature and average precipitation are presented from 2012 till 2018, with gathered information from Visual Crossing. The average temperature is between 25 and 30 degrees Celsius. Here, the precipitation shows the highest value in 2015 and the lowest are shown to be present in every year, due to the dry season that occurs in the respective years.   A total of 30 experiments were performed by tuning five different numbers of folds with two types of sampling in three predictive models (5×2×3 = 30). The comparison table shows the overall performance recorded during the experiment for the modelling.

B. Multiple Linear Regression Results
The best model, multiple linear regression, resulted with the RMSE values of 2545.977 and the values of r² is 0.276 which is the highest value with linear sampling among four models, as shown in Table II

C. Decision Tree Regression Results and Rules
Modeling of decision tree resulted in a set of rules represented in a tree-like structure. Each node corresponds to a splitting rule for a single attribute. Fig. 11 shows the extracted tree and the conditions of each predicted total of tourist visit. Table III shows the result of RMSE and r² for the decision tree regression, in which the RMSE is higher as compared to multiple linear regression. As can be seen, average temperature and the count of the keywords appeared to be among the important variables.  Based on the lowest RSME, the best SVR is when linear sampling done with k = 4, at 3278.578, though the r 2 is the lowest among the other experiments.

D. Support Vector Regression Results
The Support Vector Regression produces an average result between Multiple Linear Regression and Decision Tree Regression. Table IV displays the best result for linear sampling and shuffled sampling. Based on the lowest RSME, the best SVR is at linear sampling with k = 9, at 2727.532 and the r 2 is the second best among the experiments conducted.

E. Best Model Deployment
Selection of best model is measured by the lowest RMSE value and the highest value of the squared correlation between the predicted and the actual values. The experiment proved that the decision tree model displayed a weak performance for this research study as it produced a greater error as compared to other model at any parameters tuning. Between MLR and SVR, the error produced in MLR is almost the lowest, but the squared correlation value is lesser than the SVR model. Additionally, the cross-validation with number of folds, k = 8 for the linear sampling type indicated the best model is MLR which outperforms SVR and Decision Tree. The assumption is made that the number of folds is depending on the number of instances in the dataset and the sampling type is based on the problem model, which in this study the input is the linear problem to find the relationship between these variables.    Fig. 12 represents the real (actual) value of visitors and the predicted value of visitors by using multiple linear regression. The advantage of using multiple linear regressions is that when given values in decimal point, the results can be easily interpreted by the decision maker. The graph also visualizes clearly the gap between actual and predicted value of the visitors. Subsequently, Fig. 13 shows the predicted total number of visitors until 2030 with some simulated values for each variable in the best MLR model. In future, this research can be extended by providing the estimated values for each relevant variable, and the predicted total number of visitors can then be stipulated to the tourist management.

V. CONCLUSION
This paper presented the implementation of regression models, namely Multiple Linear Regression (MLR), Support Vector Regression (SVR) and Decision Tree Regression (DTR). A set of variables were constructed based on the selected keywords, the currency exchange and the weather variables for predicting the number of total visitors. The experiments conducted had indicated that these regression algorithms were able to predict the total number of visitors to Taman Negara National Park. The results for the experiments after tuning of parameters demonstrated an improved accuracy of the models since it can control the complexity, which indirectly prevented from overfitting of the model. In this study, the linear problem (input) discovered was to find the relationship between the factors affecting the demand for the total number of visitors to Taman Negara National Park. Multiple Linear Regression model with linear sampling type and 8-fold cross validation approach appeared to be the best model. The experiments showed that the best parameters setting was based on the instances of the dataset itself. Consequently, some suggestions for future works to improve the quality of the research study were identified. Firstly, the use of advanced visualization tools to work with real-time data to the dashboard can be applied. Next, more data ought to be collected to produce a better performance of predictive models, such as different keywords and any other related campaigns. Lastly, the use of hybrid machine learning and optimization algorithms can be considered to optimize the parameter tuning for better accuracy. The developed model is useful to the tourism management, for predicting the number of visitors to Taman Negara National Park, Malaysia. The tourism management as the user can improve their operations by making a strategic decision making based on the predicted outcomes. If the tourism destination can operate more smoothly, the visitors can reap the benefits from the meaningful experience they received when visiting the national park. Other study can also be performed such as the length of stay and recommended activities based on the tourists' profiles.