Machine Learning based Forecasting Systems for Worldwide International Tourists Arrival

The international tourist movement has overgrown in recent decades, and travelers are considered a significant source of income to the tourism economy. When tourists visit a place, they spend considerable money on their enjoyment, travel, and hotel accommodations. In this research, tourist data from 2010 to 2020 have been extracted and extended with depth analysis of different dimensions to identify valuable features. This research attempts to use machine learning regression techniques such as Support Vector Regression (SVR) and Random Forest Regression (RFR) to forecast and predict worldwide international tourist arrivals and achieved forecasting accuracy using SVR is 99.4% and using RFR is 84.7%. The study also analyzed the forecasting deadlock condition after covid-19 in the sudden drop of international visitors due to lockdown enforcement by all countries. Keywords—Tourists; forecasting; machine learning; Covid-19


I. INTRODUCTION
The tourism industry plays a significant role in economic development, with several countries focusing on building the best possible policies for international travelers. Tourism is playing a significant role in contributing to multi-dimensional economic growth [1]. Multiple business sector economies across the globe rely on tourism to create employment opportunities, improve infrastructure, and foster cultural interchange between visitors and residents. Tourism can reap more benefits through a multi-stakeholder engagement approach [2]. Tourists rely on local transportation, accommodation, food and beverage, entertainment, and very importantly, visitors may want to buy new things which are not available in their local places. Such transactions contribute to mobilization of the local economy. hence contributing to the local economy. According to a World Tourism Organization (WTO) study in 2020, the percentage of people who travel for enjoyment as family and solo trips have increased from 50% in 2000 to 55% in 2019 [3]. The revenue generated from international tourists' arrivals can help the Country's economy and significantly contribute to balance payment of downgraded sectors such as unemployment, transportation, and healthcare [4].
One of the main motives for tourists to travel is to visit a new place to escape the monotony of boring routine life.
The solo or family trip helps ease stress and get a unique environment for a happier and healthier experience. The host country aims to provide the best possible facilities to tourists even when there are high-demand referrals. The forecasting system can help host countries prepare for the tourist requirements well in advance. Forecasting is a technique for creating accurate and optimize predictions [5] based on previous data. Fig. 1 shows the process of forecasting Systems. Many business stakeholders adopt the forecasting for various variables, including projecting future costs, quantity, or planning the budget. The major problems researchers face for developing a forecasting system is to collect the actual data. Two sources are there to gather the data, the first primary source contains firsthand information gathered directly by the organization. The data is generally collected by various surveys, focus groups, or interviews and direct methods of obtaining data make it more reliable and accurate to build the systems. Second, secondary sources are data that has already been collected and processed by a third party. The forecasting process is sped up by receiving data in a well-organized and compiled format. A tourism forecasting system helps administration in planning and arranging essential things for tourists. With rapid infrastructure, economy, and politics changes, forecasting systems help to get things done on prior deadlines. Government organization and associated stakeholders which are involved in tourism planning required highly accurate forecasting system. With the help of forecasting system, they can adopt the required changes in much better and faster way. When there is no availability of highly accurate forecasting systems, these organization face difficulties [6]. In simple words, the meaning is, to minimize the possibility of the decision failing to attain the coveted goals. Hence an accurate prediction is very essential to the government.
Machine learning methods have attracted significant attention in tourism research [7] for better results than traditional approaches. Some machine learning methods like Neural Networks (NN) and SVR play a big role in forecasting time. Most of the techniques applied in prediction and tourism modelling are categorized into four categories: time series model, econometrics model, Artificial Intelligent (AI) techniques, and qualitative methods. AI techniques have been applied across In the current study, the ten-years tourists arrival data have been collected for developing forecasting systems. Many countries can use the proposed methods for tourist arrival forecasting for arranging the required facilities. This can inform tourism policy by forecasting tourism revenue. The forecast can assist the government in creating temporary job opportunities in the tourism sector for a particular period in that year. The advantage is that it can promote seasonal work in tourism and assist those whose livelihoods depend on tourism. Most of the work in tourism has been focused on domestic tourism forecasting [8]- [10]. Due to the rare availability of data, there are existing challenges in developing proper tourist forecasting systems. This study develops a worldwide tourist forecasting system by applying machine learning techniques such as SVR and RFR. The machine learning methods [11], [12] have been tested with different kinds of feature selection techniques and clear identification of attributes before feeding into the model. Developed tourist forecasting systems will help to analyze the flow of tourists internationally and in the host country. This system will also help to identify the transport traffic and facilitate the number of flights between two countries, arranging or extending local transport systems and analyzing the required number of rooms in hotels. The COVID-19 pandemic had sudden travel restrictions across borders. The year 2020 was the worst in tourism history in international tourist arrivals. The SARS-COV-2 virus has led to a setback in the current forecasting Systems. In this study, we retrieve the forecasting data and compare the results with the current covid-19 scenario. A tourism forecasting system helps to plan and arrange essential things for tourists. With rapid infrastructure, economy, and politics changes, forecasting systems help to get things done on prior deadlines. Government organization and associated stakeholders which are involved in tourism planning required highly accurate forecasting system. With the help of forecasting system, they can adopt the required changes in much better and faster way. When there is no availability of highly accurate forecasting systems, these organization face difficulties [6]. In simple words, the meaning is, to minimize the possibility of the decision failing to attain the coveted goals. Hence an accurate prediction is very essential to the government.
Machine learning methods have attracted significant attention in tourism research [7] for better results than traditional approaches. Some machine learning methods like Neural Networks (NN) and SVR play a big role in forecasting time.
Most of the techniques applied in prediction and tourism modelling are categorized into four categories: time series model, econometrics model, Artificial Intelligent (AI) techniques, and qualitative methods. AI techniques have been applied across several domains and in a variety of data structures.
In the current study, we have collected ten-year prior visitor history data to develop forecasting systems. Many countries can use the proposed methods for tourist arrival forecasting for arranging the required facilities. This can inform tourism policy by forecasting tourism revenue. The forecast can assist the government in creating temporary job opportunities in the tourism sector for a particular period in that year. The advantage is that it can promote seasonal work in tourism and assist those whose livelihoods depend on tourism. Most of the work in tourism has been focused on domestic tourism forecasting [8]- [10]. Due to the rare availability of data, there are existing challenges in developing proper tourist forecasting systems. This study develops a worldwide tourist forecasting system by applying machine learning techniques such as SVR and RFR. The machine learning methods [11], [12] have been tested with different kinds of feature selection techniques and clear identification of attributes before feeding into the model. Developed tourist forecasting systems will help to analyze the flow of tourists internationally and in the host country. This system will also help to identify the transport traffic and facilitate the number of flights between two countries, arranging or extending local transport systems and analyzing the required number of rooms in hotels. The COVID-19 pandemic had sudden travel restrictions across borders. The year 2020 was the worst in tourism history in international tourist arrivals. The SARS-COV-2 virus has led to a setback in the current forecasting Systems. In this study, we retrieve the forecasting data and compare the results with the current Covid-19 scenario.

II. LITERATURE SURVEY
A forecasting system for tourism will provide direct and indirect benefits to the government, society, people, business, services, and economy of the country. Tourism contributes to GDP, employment, visa services, and tourism-related businesses. Given the significant positive impacts of tourism, performing the prediction on the number of tourist visitors, the time, when tourists visit the places, the duration of tourist's visits will provide crucial information to the government. Researchers are keen to develop an accurate forecasting system and to find a novel approach to deal with different sizes of data datasets. The seasonal ARIMA, v-Support Vector Regression and Multi-Layer Perceptron (MLP) Neural Networks models were applied on monthly data for the tourist arrival in Turkey and proposed an approach to select the model in a given time series [11]. Combined techniques have been discussed to predict tourism demand [18] . The authors combined ACF, NN, and Genetic Algorithms (GA) to perform the classification. A framework has been suggested based on the Generalized Dynamic Factor Model (GDFM) to generate the composite search index [19]. It has improved the forecast accuracy as compared to the traditional time series model and Principal Component Analysis (PCA) model. Decomposition based on eigen were used to reduce the dimensions [20] in time series data prediction. Wang Jun et.al. [21] have proposed the forecasting model by combining ANN and a clustering algorithm and compared this model with other ANN-based and ARIMA model; this model performed better than other related methods. For the multisource data and passenger flow volume, authors have proposed a new algorithm by merging the non-linear, genetic algorithm and S.V.R. Karo Solat, et al. [22], have used elliptically symmetric principal components for predicting exchange rates. Forecasting data belongs to the regression category; researchers have applied the methods such as regression, the Delphi method, moving average models, ARIMA, MLP, GRNN, radial bias function (RBF) among others. Shaolong Sun et.al. have developed a tourist arrival forecasting model. One of the most widely used time series forecasting models is the ARIMA. However, the latter does not perform better with multi-source data [23]. The authors proposed the Kernel Extreme Learning Machine (KELM) models to improve the forecasting accuracy and robustness analysis on the Baidu Index and Google Index data. According to the authors in [24], the most used time series analysis model for the prediction of tourist arrivals is ARIMA and was used extensively in the last few years. Authors used Seasonal Autoregressive Integrated Moving Average (SARIMA) with Generalised Autoregressive Conditional Heteroskedasticity (GARCH) to forecast tourist arrivals in Taiwan [25]. In [26], the authors have used SARIMA to predict the demand for traveling by air. Hence, all these studies, research, and work done demonstrated that enhanced ARIMA models lead to better predictions.
Accurate tourist forecasts are essential because they provide crucial information to tourism practitioners and academics when making decisions about resource allocation, priority, and risk assessment. Based on an extant review of the litera-ture [27], prediction methods in tourist arrivals can be categorized into Machine Learning (ML) models and techniques of time series analysis. With reference to model building, tourism demand prediction studies depend strongly on variables that are input to the model [28]. These variables are supposed to be strongly connected to tourism demand, with no missing or incorrect values. Tourism demand prediction components can be defined in several ways using various parameters. They can be classified into indicators and determinants, depending on the relationship with tourism demand establish ML methods for estimating the number of tourists coming to Turkey. In their work, Linear Regression and NN-MLP are implemented to create multivariate tourism predictions for Turkey. They compare performances of the predictions in the context of Relative Absolute Error (RAE), Root Relative Squared Mean (RRSE) and Correlation Coefficient (R) measurements depicting MLP for regression produces enhanced performance. Extensively used ML models consist of Artificial Neural Networks (ANN) and SVR Authors in [29] have used the method SVR and "Fly Optimization Algorithm (FOA)" together for predicting tourism arrivals. In [13], a prediction model has been suggested, which amalgamates "Back-Propagation Neural Network (BPNN)" and "Empirical Mode Decomposition (EMD)". This model foretells how many tourists will visit the place. Li et al. [14] enhanced BPNN by incorporating the PCA and DE (ADE) algorithm to predict how many tourists are willing to visit the place in the future. Outcomes of the work in [13] and [14] revealed that enhanced BPNN was outperforming ARIMA Authors in [30] created a new structural NN model, forecasting the number of tourists willing to visit the place in the future. The outcomes demonstrated that none of the models were superior in any of the situations.
Fernandes et al. state that Artificial Intelligence (AI) has played an essential role in attaining outstanding applications in predicting the demand of tourists in the region. Despite that, many of the AI methods used till now are not deep architectures. They have little ability for researching greater non-linearities, especially when data is big-scale and vague patterns [31]. The authors have come up with a new deep learning technique called the "Stacked Autoencoder" with "Echo-State Regression (SAEN) which helped in predicting demand for tourism [32]. SAEN is employed in four different tourism situations and the outcome of the prediction reveals that SAEN is better than the standard models. A big data based system for tourism forecasting is proposed [33]. The authors have included leading indicators such as price index which improved the performance of the model. In [34], the authors present the Real-Value Genetic Algorithm (RGA) to specify the available parameter of SVR, called GA-SVR. It optimizes each parameter of SVR at the same time from training data. Afterward, they forecast tourism demand in China. Moreover, they carried out a comparison between BPNN and  time series models. This comparison helped them know that SVR has a good predicting capability. Moreover, the authors have mentioned eight sections. One section presents studies associated with SVR, while another section summarizes the current methods to the option of hyperparameters. Another section details the GA-SVR technique and the rest of the sections deal with the analysis of outcomes, the origin of the data, etc. Various NN models were developed in [34] on cross learning to predict time series data. The principal components of prediction are "Determinants". Traditional economic ideas, like "consumption behavior theory" and "utility theory" indicate that factors, for example, cost, earning, and publicising affect the demand of tourist arrivals. It can be seen [1], [16] [16] [29] to have a complete examination of tourism demand prediction studies. All these studies notice that the functioning of predicting models differs based on various considerations, for example, the data's frequency, prediction horizon's length, the source countries, and the destination. What made us concentrate on the data-driven methods in accordance with ML, is the dearth of agreement about the most correct model to predict tourism demand. In [35], the authors notice that the technique of SVM based is much signified to deal with traits of the tourist's data. They contrast and differentiate the predicting accuracy of various ML models to ARMA using month-wise tourist visits from 13 countries coming to Hong Kong. They considered the years from 1985 to 2008, and from their work, they obtained the best correct results with ML techniques. The requirement for further correct predictions had given rise to more dependency on ML models to get better-sophisticated forecasts of tourists.
In [29], the authors have employed a "Rough Sets Approach" to predict demand for tourism in Hong Kong from the US and UK Gaussian Process Regression (GPR) has been used in past years for prediction. It is a supervised learning approach followed by generalized linear regression to forecast data locally. To bridge the gap, the authors in [13], [19] have designed a prediction test to contrast GPR to NN and SVR. Their primary objective behind this study was to examine the relative advancement of ML techniques' prediction accuracy through a linear stochastic procedure employing two substitute approaches. First is the direct one, which predicts the aggregate series. The second approach is using the same models to predict the particular series for the regions one by one. Finally, the predicting performance of both methods was compared. Weather forecasting can be applied using Deep Learning (DL) techniques also. DL can be used in various fields like entertainment, visual recognition, and including forecasting such as tourism forecasting, forecasting stock prices, etc. The authors have compared the performance of the prediction of "Recurrent Neural Network (RNN)", "Conditional Restricted Boltzmann Machines (CRBM)" and "Convolution Neural Network (CNN)" [15]. Authors in [13] give an introduction to principles of ANN and they also provide a stage-by-stage guide to methods they have applied for building a NN for predicting tourist arrivals. They have involved many rules and have included some points of discussion among the authors to apply ANN effectively. A comparison of various forecasting methods is shown in Table I.

III. METHODOLOGY
This research proposes tourist forecasting systems in Fig.  2. To predict international tourist arrivals, the methods adopted are data collection from globally trusted sources, followed by data analysis, data processing, and the creation of a machine learning model. The machine learning techniques include SVR and RFR.

1) Data Collection and Analysis:
This research draws on historical data to tackle the forecasting challenges and develop the predictive model. A substantial amount of data gathered by the government or other public entities is made available. These data sets are referred to as public data since they do not require specific authorization to use them. The data is gathered from reliable online sources and official tourist websites of countries. This dataset contains tourist arrivals for most of the countries between 2010 to 2020. Since data for nearly 13 countries are not available on the internet, those countries are not included in proposed forecasting system. In addition, the tourism industry suffered greatly because of Covid-19 in terms of visitor arrivals. As a result, data for 2020 is not available for most of the countries. Whatever data have been found for year 2020 been used for Covid-19 analysis in respective to international tourist arrivals.  Year vs average analysis essentially explains the distribution of arrival data as well as the year-by-year data association of annual arrivals. In most cases the year-wise data is matching with average data, there are not many changes in tourist arrivals as depicted in Fig. 6. scatter plot depicts the annual data points distribution and it can be seen Fig. 7(a) and 7(b) that data is not equally aligning in years 2011 and 2012.
Correlation matrix shows the relationship between two variables which is shown in Fig. 8. If the variables are highly correlated then the value will be closer to 1. In Table the data is ranging from year 2010 to 2019 and average.

A. Data Preprocessing
The significance of preprocessing data must be comprehended first before moving on to developing forecasting system. It has the potential to make or ruin forecasting. The selflag differencing method have been used to preprocess the data where the previous 3 years had been used for training and 4th year for forecasting. Table II shows the originally annual collected data for 10 countries. This data is in the form of raw data which cannot be directly fed to the forecasting model, so before moving ahead preprocessing steps have been applied. The previous day's, month's, and year's data are very important to make the prediction. In other words, the value at time t-1 has a significant impact on the value at time t. Lags are the past values, therefore t-1 is lag 1, t-2 is lag 2, and so on. The lag features-based data preparation techniques have been used and after the process the result that have been found is shown in Table III. What are the variations in the top 5 arrivals countries have been shown in Fig. 9, which shows that the United States is at the top and Spain is in second place, but the growth of tourist arrivals are growing year by year. France having variations every year means ups and down in arrivals year by year.

B. Machine Learning Models
Machine learning technique has inspired due to the wide variety of applications in multiple domains. Machine learning has proven to perform better on complicated data and tasks, and this is a reason for draining it for adopting into the forecasting systems. Below are the models with different parameters that have been applied in this research: a) Support Vector Regression www.ijacsa.thesai.org  The Support Vector Regression (SVR) is adopted from support Vector Machine (SVM) for the regression type data to predict the value. While dealing with real number data, the SVM changes its variant as regression. The output for real type data has infinite possibilities, and researchers have to see all possible solutions to decide the final prediction. While dealing with real time data, the primary idea is to minimize equation 1 and in case, if problem is linear then support vector regression is represented by equation 2 and error minimization has given in equation 3: below constraints need to be taken care with linear support vector regression.   b) Random Forest Regressor A tree structure of data arrangement gives an actual estimator. Random forest follows the pattern of the decision tree, where each data node will be split into daughter nodes. While splitting the data nodes, a split criterion is being chosen to be appropriately partitioned. All the data nodes at the bottom are terminal. In the case of regression data, the predicted value at a node is the average response variable for all observers in the nodes. Splitting criteria for regression is chosen by equation (4).
Where y * l = mean y value f or lef t node y * r = mean y vlaue f or right node Sometimes dealing with classified data where the predicted class is the most common class in the node, which is also known as the majority vote. So far classification tree estimated probability calculated members of each class. Splitting criteria for classified data is given by Gini index, which is shown in equation (5). A random forest is a meta estimator that uses averaging to increase predictive precision and control over-fitting by fitting several classifying decision trees on different sub-samples of the dataset. Although the sub-sample size is the same as the initial input sample size, the samples are drawn with substitution. For classification tasks, the Decision Tree and Random Forest models are often used. However, the concept of Random Forest as a regularizing meta-estimator over a single decision tree is better illustrated by extending it to regression problems. In this way, it can be shown that a single decision tree is vulnerable to overfitting and learning false associations in the face of random noise. At the same time, an adequately built Random Forest model is more resistant to such overfitting.

IV. EXPERIMENTAL RESULT
The experimental results are collected using the following setup. Dataset used contained tourist arrivals for mostly all global countries. Python 3.7 was used along with scikit-learn, NLTK and NumPy libraries for each learning algorithm used, the regression techniques, and the confusion matrix. First, baseline results have been obtained using SVR, and then RFR model have been used to train data. The number of features is 11, and data partitioning between training of 2/3 and testing is 1/3.
It is essential to evaluate the model using testing data once it has been trained. To verify the model's correctness, numerous evaluation matrices have been utilized. This study focuses mostly on R-Squared, a commonly used effectiveness accuracy metric as shwon in Table V. R-Square determines if the data is near the fitted regression line. The regression model, it's also known as the coefficient of determination or the value of multiple determination. R-squared is defined as the percentage of the variance in the response variable that is explained by a linear model. The R-squared value is always between 0 and 100%. In this research two models have been considered: SVR with different kernels and RFR with tree model. The partitioning of data between training and testing is 67% and 33% and found that the accuracy which is shown in Table V for SVR (kernel= linear) is 0.994, with kernel RBF is 0.863. The random forest regression works well for small size dataset and found R-Square result is 0.847.

V. FORECASTING BEFORE AND AFTER COVID-19
The graphs plot the forecasting regression percovid and post covid for the next 4 years and found that normal forecasting upper boundary line is going as usual however when Covid-19 enforced the restriction around the world then tourist arrival has drastically resulted null. Fig. 10(a) shows the forecasting before covid-19 for the Country USA and Fig. 10(b) depicts the forecasting during the existence of Covid-19. As per the graph visualization upper bound forecasting is reviving in the year 2022. The same things can be seen in Fig. 10(c) without Covid-19 and Fig. 10

VI. DISCUSSION
The collected worldwide tourist arrival data from different trusted and official web portals are analysed to forecast future international tourist arrivals. Such analysis can mobilized the tourism industry. Table II has shown a different kind of collected data and time-frequency with applied methods by the researcher in [16], and the studies from [36]- [38] show that most of the collected data is focused on a specific region in a country. The focus is to align with the objective of the United Nations World Tourism Organization (UNWTO) to work on collective universal data and analyze the impact of tourism due to worldwide tourist movements.
The collected data has the frequency of yearly and building optimized machine learning models [39] of this variety of data having a lot of challenges. The actual data is from the year 2010 to 2019; in 2020 international travel was heavily impacted due to place confinement [37]. Whole word is gone through a very worst situations due to covid and many technological techniques have been used to analyze and predict the situations. This study emphasizes the comparative study before the Covid-19 pandemic of actual forecasting and how suddenly forecasting systems stopped predicting the correct values once the global health pandemic started. Handling the future pandemic situations and fulfilling the basic requirement for new arrivals, forecasting models will help not only to governing bodies but also to hospitality service provides such as hotels, restaurant, transportation, etc. The pandemic has also given a crises situation in healthcare industries and how basic medicine facilities can be provided to tourists who could not return to his/her country due to lockdown enforcement.

VII. CONCLUSION
Digitalization has made the whole world a village, it remains important to have collective forecasting of data that represents the whole globe. The UNWTO, and the World Travel and Tourism Council (WTTC) are working continuously on improving the global tourism facilities by analyzing the demand and increasing number of arrivals. This research focused on overall worldwide data with machine learning approaches such as support vector regression and random forest regression and the result shows that support vector regression has given better results as compared to random forest regression.
Since the number of vistors for any country is not exactly known, building the model with multiple techniques would give an analytic view for the comparative study. This is the reason for developing the model by using machine learning. Since the collected data is on annual frequency, it doesn't fit well with deep learning techniques so consideration for this work is machine learning techniques i.e., support vector regression and random forest regression. A future extension of this work would be a clustering-based forecasting system where the groups of data would be based on countries with most arrivals, mid arrival countries, and low arrival countries. The focus is to collect monthly data to forecast the season-wise and finding the most interesting month of a tourist visit.