A Japanese Tourism Recommender System with Automatic Generation of Seasonal Feature Vectors

Tourism recommender systems have been widely used in our daily life to recommend tourist spots to users meeting their preference. In this paper, we propose a contentbased tourism recommender system considering travel season of users. In order to characterize seasonal variable features of spots, the proposed system generates seasonal feature vectors in three steps: 1) to identify the vocabulary concerned through Wikipedia; 2) to identify the trend over all spots through Twitter for each season; and 3) to highlight the weight of words contained in each identified trend. In the decision of recommendation, it does not only match the user profile with features of spots but also takes user’s travel season into account. The effectiveness of the proposed system is evaluated by a series of experiments, i.e. computer simulation and questionnaire evaluation. The result indicates that: 1) those vectors certainly reflect the similarity of spots for designated time period, and 2) with using such vectors of spots, the system successfully realized a tourism seasonal recommendation. Keywords—Tourism recommender system; seasonal feature vector; Wikipedia; Twitter


I. INTRODUCTION
In recent years, tourism recommender system is widely used to support users' choices of tourist spots.Unlike commodity such as books or movies, many of the features of Japanese spots change following season.A typical example is the famous tourist spot Niseko in Hokkaido, Japan, whose red leaves attract visitors in autumn.In winter, the snow makes its scene totally different so that visitors may enjoy skiing.Therefore, the decision of spots made for the user must not only fit his interest, but also realize the seasonal fashion.
In this paper, we focus on content-based seasonal recommendation.Although content-based recommendation has been applied successfully in the tourism domain [2], [3], [4], [5], they seldom produce recommendation considering season.
We propose a content-based seasonal tourism recommender system which fits the designated season of travel.For example, for a user who likes Matsushima island because of cherryblossom viewing and wishes to travel in spring, the system can recommend other spots with the attraction of cherry-blossom viewing.However, such spots may not be the recommendations of another user who wishes to travel in autumn even if is fond of Matsushima's sea and coast.In order to characterize the dynamically changed seasonal features, the proposed system generates seasonal feature vector for each spot for a given season.Its generation consists three steps.Firstly, it identifies the vocabulary concerned through Wikipedia1 document about the spot.The reason for using Wikipedia is that each tourist spot has its unique document which introduces its detail features (see Fig. 1 which shows the description about the spot "Ritsurin Garden" in Kagawa prefecture including seasonal features for each season).Secondly, it identifies the trend (i.e., seasonal variance of features for a designated period) over all spots in Japan through Twitter 2 .Finally, for the words (i.e., the features) in the vocabulary, it highlights the ones contained in the identified trend which corresponds to the given season.With these vectors of spots, the proposed system match them with user's profile which presents his/her preference to decide the recommendations for the travel period of user.
As the implementation, the proposed system gathered 6,057 Wikipedia documents to cover almost all sightseeing spots in Japan.On the other hand, it also collected more than 500 thousands tweets that are published during 7 months.In the experiments we conducted a series of experiments including both a computer simulation and a questionnaire evaluation.
The contribution of current paper is as follows: • For a designated season, it provides a method to generate seasonal feature vector for sightseeing spot as the characterization of its features.The experimental results demonstrate that the property of those vectors certainly reflect the similarity of spots.
• The proposed recommender system provides the user a recommendation of spots with awareness of the travel season.The experiment of questionnaire shows that the seasonal recommendations have higher precision of user's actual choices than the one without applying seasonal feature vectors.
The remainder of this paper is organized as: Section II overviews related works.Section III describes the detail of the proposed system including the generation of seasonal feature vectors and recommendation process as well.Section IV represents the implementation of the prototype system.Section V represents the method of evaluation and shows its results.Finally, a conclusion is given and future works are discussed in Section VI.

A. Content-based Tourism Recommendation
Content-based tourism recommendation systems try to recommend spots similar to those users have liked in the past (i.e., history) [6].From the history, user's profile is built to represent his preference.On the other hand, the features of spots are characterized in order to match with user's profile to decide the recommendation.Some of existing researches aim to provide the user an appropriate tour plan to meet his/her constraints, such as time or cost [7], [8], [3].The features of spots are given from experts of tourism, which simply includes available time, normal visiting time and geographical information, etc.. Therefore, the recommendation of the tour plan turns to an integer programming problem or traveling salesman problem to approximate a combination of spots with minimization of the travel path or time wasted in movement.Győrödi et al. [9] proposes a spot recommendation with a mobile application.In order to determine user's interest and features of spot, they use tags such as food, music etc., which can be established by users and assigned to a specific spot.The recommendation is produced by matching such tags of the given user and spots.
In existing content-based researches much efforts are made to provide tourist spots or plan to meet user's needs.However, the travel season which is an essential factor in decision of spots is seldom taken into account.
By utilizing the proposed method not only in content-based recommendations but also in some hybrid approaches [10], [11], [12], a seasonal recommendation can be easily realized.

B. Tourism Recommendation using Wikipedia
In many recent researches of tourism recommendation, Wikipedia is integrated as an external source in identification of spots.It is effective at reducing the cost of manually construction or maintenance of spots' information.A common idea is to take advantage of geographic information included in Wikipedia documents about spots to filter users' geotagged photos (e.g., photos in Flickr) and extract their visiting trajectories of spots [13], [14], [15].Techniques such as Tpattern tree [16] are exploited to mine the traveling patterns potentially contained in extracted trajectories.Additionally, in Wikipedia since categories are assigned to each tourist spot, some of researches further transform user's trajectories into sequences of categories to represent user's preference [14], [15].Although in many existed researches Wikipedia is used to combine with SNS to improve the performance of recommendations, few researches take advantage of the content of textual article in Wikipedia, even the detailed description of both permanent and seasonal features is contained.

C. Identification and Analysis of Tweets for Tourism
Recently, Twitter has been paid much attention as a source of data mining and characterization of spots for tourism informatics.In order to detect tourism related tweets which are posted at specific spots, Shimada et al. [17] applies a Support Vector Machine(SVM) to their gathered tweets.Their idea is that the target tweets are similar on their textual content.With the aid of geo-tag, Oku et al. [18] proposes another SVMbased method of detection of tweets relevant to tourism, and extract temporal features of spots from them.They regard tweets issued within a week as a single document and obtain a temporal feature vector for each week by calculating the TF-IDF weight of keywords contained in such documents.However, it does not generate vectors to cover a sufficient number of spots because it is solely based on tweets.Similarly, Menchavez et al. [19] focus on the identification of tourism related tweets and a Naive Bayes based sentiment analysis to mine the opinions when tourists visit spots in Philippines.Furthermore, such mined opinions are classified into positive and negative polarity and presented at the geographical map as references for tourists.Similar study is done by Claster et al. [20] for Thailand, in which the sentiment analysis of tweets is applied in time-series.Although the problem definitions of previously mentioned researches are different with the current paper, the objectives are similar which aim to discover useful information of tourism via Twitter.
In such researches, many efforts have made to reduce the noise in tweets.However, due to the irregularity of tweets, many meaningless words like punctuation or prefix are always extracted and significantly influence the accuracy of the analysis.In addition, those techniques based on machine learning are sometimes difficult to conduct for minor spots having few related tweets.Since in this paper the proposed system uses Wikipedia as the corpus combined with Twitter, it is free of the influence from such noise.Furthermore, for the minor spots which are seldom tweeted Wikipedia can cover their features and avoid the failure of their recommendation.

III. SEASONAL RECOMMENDATION OF TOURIST SPOT
In this section, we represent the proposed recommendation system in detail.Fig. 2 shows the architecture, which consists of two processes: 1) to generate seasonal feature vectors for each spot; 2) to identify user's preference as profile, and match it with such vectors of spots to produce recommendation.Following subsections detail each part.

A. Generation of Seasonal Feature Vectors
Firstly, the time axis is assumed to be separated into several ranges so that the features of spots are regarded to be invariant, as in the year end season, the season of cherryblossom viewing or the bathing season.Each range is called a season.The proposed system generates one seasonal feature vector (SFV, for short) of each spot for each season.SFV is calculated by extending the basic feature vector (BFV, for short), in such a way that it reflects the trend of words in each season.More concretely, BFV is a vector of TF-IDF weights (defined bellow) and SFV is its extension.
Let O be the set of spots and d i be the Wikipedia document about spot o i ∈ O. Generally d i is a summarization of the entire information of o i .Therefore, the reader should note that d i is the union of statements on spot o i relevant to various seasons.In other words, in order to generate SFV for each season, the system needs to distinguish word sets relevant to each season in document d i .Let W i be the set of words included in document d i and W = i W i (i.e., W is the set of words included in Wikipedia documents about O).Then, the term frequency (TF, for short) weight of word w j in document d i is defined as w∈W n i,w and the inverse document frequency (IDF, for short) weight of word w j over |O| documents is defined as Where, n i,wj is the number of occurrences of w j in d i and m j (≤ |O|) is the number of documents containing w j .With these notions, the BFV v b i of spot o i is defined as: The words which are frequently mentioned in d i and seldom contained in other documents would have high weights in BFV.
For a given season, the key idea is to extend the definition of the TF weight in (1) by considering the trend of words.Let t k be a collection of tweets issued in season s k .By considering t k as a single document, the TF weight of word w j in season s k is defined as follows: Where, n k,wj is the number of occurrences of word w j in t k .Because W is the set of words contained in Wikipedia documents about O, the proposed system omits words in tweets which do not appear in any Wikipedia document.With the above notions, SFV v s i,k of spot o i for season s k is defined as Where, 0 ≤ α ≤ 1 is an appropriate parameter.Note that for o i , only for the word w j ∈ W i it has T F i,wj to be nonzero.If word w j ∈ W i has not tweeted in specific season s k , T F k,wj = 0; otherwise T F k,wj > 0, then we say that w j is highlighted in s k .

B. Identification of User's Preference and Recommendation Process
Although an analysis of user's history of tweets would help us to extract his/her preference on the features of sightseeing spots, it may fail for the users who even do not have Twitter accounts or seldom tweet about travel.In order to fit such users, the proposed system extracts user's preference in an explicit way that it directly asks the user for a history of travel.In other words, the user answers two easy questions when he/she begins to use the system: 1) the season that he/she wishes to travel; 2) the most favorite spot that he/she has visited during assigned season until now.Assume that user u chooses tourist spot o i and the period of season s k .His/her profile which presents preference is defined by the SFV v s i ,k as U k .Although the user profiling is simple, it effectively characterizes user's seasonal preference and is with various benefits: first, it does not suffer the cold start problem; second, such questions are easy to answer and time-saving.
With the constructed user profile, the proposed system matches it with SFVs of spots to decide recommendations for season s k .To quantify the correspondence of o i and a given spot o l (o l = /o i ) in s k , the system calculates their cosine similarity as follows: The spots with Top-t similarities are the recommendations to user u in s k as a ranking.Note that the recommendations vary for different seasons designated.

A. Datasets Description
Since the objective of recommendation is entire tourist spots in Japan, for this prototype system, we focus on 6,057 spots given in the category of "tourist spots in Japan" in Wikipedia, and download the Japanese document for each of them from the Wikipedia server.The prototype system uses only nouns as words in each document d i .The set of words W i in d i is obtained by conducting the morphological analysis using MeCab3 with the default IPA dictionary.From all collected Wikipedia documents, 608,390 words are extracted overall, in average 100.5 words for a document.The relationship of the count of spots and the size of words which are extracted from  their documents is shown in Fig. 3.It represents that most spots are introduced in detail in their Wikipedia documents.
A set of tweets relevant to tourism is gained from Twitter using Twitter Streaming API.More concretely, 50 million Japanese tweets issued from September 2013 to March 2014 are acquired.For each of the tweet, its textual content is matched with the names of collected spots.As a result, about 500 thousands tweets containing at least one name of 6,057 spots are regarded as tweets relevant to the tourism and extracted as a part of dataset.Although it may contain tweets which are not relevant to tourism and may miss tweets relevant to the tourism, we didn't evaluate the precision of such a naive extraction since it is the out of scope of this paper.Let T be the resulting set of tweets.

B. Parameter Assignments
Considering seasons always last more than one month with different periods of time, assume that one year is divided into 12 disjoint seasons of (almost) equal length in the way that the first season is from January 1st to January 31st, the second season is from February 1st to February 28th (or 29th), and so on.More precisely, seven documents, which represent the trend of each season, is derived from T because collected tweets are for seven months.For a given season and its corresponding document in T , the prototype system generates one SFV for each spot.As a result, in all spots' SFVs 170,978 words are highlighted by Tweets, 28.3 words for one spot's SFVs in average.
Finally, another task is to identify appropriate value for α in (3).The value α is assumed more than 0 without loss of generality.Recall that W i is the word set contained in the Wikipedia document about spot o i .Relatively, let W k be the word set contained in tweets in season s k .When α < 1, the L 0 -norm of SFV v s i,k coincides with |W i | which is independent  of α and k, but when α = 1, it coincides with |W i ∩W k | which varies depending on k.It may cause failure of recommendation for some spots which are lack of vocabulary in their Wikipedia documents and have been seldom tweeted.Such spots are defined as minor if the L 0 -norm of its SFV is smaller than or equal to .Table I shows the number of minor spots for given α and k.Although the number of minor ones is 61 when α < 1, it exceeds 170 by increasing α to 1. Thus, in the following, the prototype system will restrict our attention to the case of α < 1.
Next, consider the variance of SFVs in various α to decide its assignment.Let Ω be the vector space spanned by all SFVs (of all spots).In the following, each vector is normalized by the length in the L 2 -norm to have an unit length in Ω.Therefore, each spot o i is mapped to a point by SFV v s i,k in Ω for each season s k .This implies that the "intensity" of the variance of SFVs is characterized by Where, | • | denotes the L 2 -norm.See Fig. 4 for the illustration.δ i is affected by α and called the diameter of o i hereafter.Fig. 5 illustrates the cumulative distributions of δ for α = 0.99, 0.995 and 0.999, where the horizontal axis is the length of the diameter and the vertical axis is the accumulative size of the spots having diameter less or equal to a specific value.It indicates that the diameter follow Gaussian distribution and its mean and variance of diameter certainly increase as α increases.In general, a large δ implies that for corresponding spot its seasonal features are well highlighted in SFVs.
Additionally, for each of the word sets W i \ W k and W i ∩ W k , we observe its average weight of words in SFVs.When α = 0.995, a comparison is made that the two averages are both at range of 2.3×10 −4 .As α increased to 0.999 the words in W i \ W k are weighted as one-fifth as the ones in W i ∩ W k overall.It implies that although in the latter case the seasonal features are well highlighted, static features which do not relate with the seasons are weakened significantly and with failures of characterization.On the other hand, it is observed that the average distance of a spot's BFV to its nearest neighbor is almost 1.2, which is nearly twice of the average of all spots' δ in the case of α = 0.995(almost 0.55).Therefore, α is fixed to 0.995.

V. EVALUATION
In this section, the effectiveness of the proposed system is evaluated with respect to the following two aspects: 1) whether the proposed SFV certainly extracts and characterizes seasonal features from Wikipedia and Twitter; and 2) whether the proposed system effectively provides seasonal tourist spot's recommendation.
A. Variance of SFVs 1) Evaluation Methodology: In this section, rather than a direct observation of the difference of SFVs for a given spot, the evaluation of time transition of the similarity of spots is conducted.They are obtained by applying the K-means method [21] to SFVs of all spots.More concretely, if the spots contained in a cluster in season s k are separated into several clusters in other seasons, those spots are given similar SFVs for s k and the set of words characterizing the cluster should represent the feature of those spots for s k .Considering the size of spots is over 6,000, the value of K is setted to 70 in the process of clustering.This evaluation examines the mean of each resulting clusters and focus on several typical ones for the convenience of presentation.
2) Result: From the resulting clusters, four typical clusters, say C r , C i , C s and C c , are identified.Their details are summarized in Table II.Note that each of these four clusters is defined only for a specific season.Since red leaves and cherryblossom have higher popularities than illumination and snow in Japan, the corresponding ones also have larger sizes than the others.
Results on the time transition of the similarity of spots are summarized in Fig. 6.The left-most figure of the first line in Fig. 6 shows the result of cluster C r and the other three figures concern with clusters C i , C s and C c , respectively.In November, all spots in C r form a distinct cluster, but in other seasons, they separate into different ones.It indicates that the common features of those spots are highlighted in November, although they have various features in other seasons.Similar phenomenon can also be found in other clusters.On the other hand, several spots in C c are also confirmed to remain in the same cluster through all seasons.It indicates that SFVs of those spots are close with each other in vector space Ω regardless of the transition of seasons.

B. Impact on Recommendation
1) Evaluation Methodology: In this subsection, the performance of the proposed seasonal tourism recommender system is evaluated with simulated users.Although a questionnaire evaluation is also conducted in the next subsection, here we aim to compare and observe the difference between recommendations which are generated with considering season (i.e., SFV) and without season (i.e., BFV) in detail.The evaluation focuses on the aforementioned clusters C r , C i , C s and C c , and regards the mean of SFVs contained in each cluster as the preference of users 4 .In other words, there are four users who are fond of red leaves, illumination, snow and cherryblossom, with the spots in clusters C r , C i , C s and C c as the answers respectively.The performance as the proposed system is evaluated by analyzing the Top-t spots' recommendation to the designated points for each season k.Such a subset of spots is denoted as Q k t hereafter.As comparison, according to the cosine similarity of the corresponding BFVs to the designated points, the Top-t spots (denoted as P t ) are also calculated.
For the mean of a given cluster C, the goodness of a subset X concerned is measured by |C ∩ X|.Thus, the advantage of using SFV instead of ordinary BFV can be measured by calculating which depends on the value of parameter t and the selection of season k.
2) Result: Table III summarizes the results for t = 30, where the emphasized numbers designate the seasons in which the corresponding clusters are defined (e.g., cluster C r is defined for November).The result implies that by using SFVs, the proposed system can recommend more spots to fit simulated users' preferences and the effect is maximized when the designated season coincides with the one defining the cluster.Recall that the value of ξ(t, k) depends on parameter t.Table IV summarizes the results for each cluster, where the value of t is fixed to be equal to the cluster's size, e.g., let t = 109 for cluster C r .Comparing with Table III, in each row a larger gap of ξ(t, k) is observed for each season.It indicated that there are various Q k |C| for given cluster C. In other words, if the designated season is not relevant with given C, fewer spots that contained in C will be recommended.
C. Questionnaire Evaluation 1) Evaluation Methodology: Finally, a two-steps' seasonal tourism questionnaire is conducted to evaluate whether the proposed system can provide a seasonal recommendation of spots in an actual case.Recall that the proposed system extracts user's preference from his visited favorite tourist spot in Section III-B.Therefore, as the first step of the questionnaire, participant selects the spot and the season (i.e.month) s k that he/she wishes to travel.According to his/her selections, system generates a list of recommendations for s k denoted as Q k t .For comparison, another recommendation list P t using BFVs instead of SFVs of spots are generated.Here t denotes the number of spots that are included in the recommendations, i.e. the lengths of Q k t and P t .Q k t and P t are randomly combined into one list of recommendations, as Q k t ∪ P t .In the second step, from Q k t ∪P t the participant chooses at most 5 spots that he/she wishes to visit in s k .Since in following we focus on entire participants and their experimental results, the superscript k in Q k t is omitted for convenience.As quantification, this evaluation calculates average precision and recall of all participants' choices in Q t and P t as follows:  Where, h t is the size of spots having been chosen from Q t ∪ P t by a participant, and H is such size of spots with t = 10.
In this evaluation, 55 participants' cooperation is received, including 17 college students major in information engineering and 38 second-year high school students, and 64 spots are chosen as their favorite spots overall, i.e. 64 trials by 55 participants.Table V summarizes the favorite spots having been chosen by the participants with various s k .
2) Result: Table VI shows the detail of participants' choices from recommendations Q t ∪ P t .In either case of t, the size of spots having been chosen in Q t is higher than P t .It represents that participants prefer the recommended spots in Q t than P t .Also note that from Q t ∪ P t , 107 and 185 spots are chosen when t = 5 and t = 10 respectively, which contain duplicated spots in Q t and P t .More concretely, in all trials 6 spots are chosen from Q t ∩ P t when t = 5 by participants, and 11 spots with t = 10 respectively.In average, almost 2.89 spots are chosen from Q t ∪ P t in one trial.
The results of precision and recall are given in Fig. 7.It represents that when t ≤ 2, although the precision of Q t is worse than P t , the recall of Q t and P t are almost at the same level.This phenomenon represents that the users who wish to visit the spots in the top of Q t are with little interest to the spots having been included in P t .Furthermore, such users tend to choose fewer spots overall than the ones who have not chosen the spots in the top of Q t .On the other hand, in the case of t > 2, proposed seasonal recommendation outperforms the ordinary recommendation only utilizing BFV.Also considering the fact that in most of the recommender systems the list of recommendations often includes more than 3 spots, the spots recommendations provided by the proposed system more fit user's demand than ordinary ones without considering travel season.

VI. CONCLUDING REMARKS
This paper proposes a seasonal tourism recommender system using Wikipedia and Twitter to provide a list of tourist spots as seasonal recommendation.The effectiveness of the proposed system is experimentally evaluated by detailed observation of seasonal feature vector of spot and questionnaires of users' actual choices of spots.The results of evaluations indicate that SFVs certainly characterize the variable seasonal features of the spots.More concretely, the variance of SFVs follows Gaussian distribution and the similarity of SFVs reflects the similarity of the features of the corresponding spots in a designated season.Further more, the result of questionnaire verifies that in most of the case the proposed system successfully provides seasonal spots recommendations to fit user's demand in tourism.
A future work is to extend the proposed recommender system to extract and characterize spatial-temporal features of the spot.Another issue is to integrate user modeling techniques into proposed recommender system, in order to improve the accuracy of recommendations.On the other hand, we also consider that in some hybrid recommender systems like [11], [22], our proposed method can be used as a component to improve them to achieve a seasonal recommendations.In future, we wish to combine the proposed method with such approaches and evaluate the performance of recommendations.

Fig. 1 .
Fig. 1.A part of the Wikipedia article about Ritsurin Park.A detailed description of the main features is included.

Fig. 3 .
Fig. 3.The counts of spots with various sizes of words extracted from their documents.

Fig. 4 .
Fig. 4. Diameter of spot in the vector space.

Fig. 6 .
Fig. 6.The distributions of clusters for the spots that are included in Cr, C i , Cs and Cc.The bars with various colors represent different clusters and their lengths depend on the sizes of focused spots.For a given cluster and a season, the number of clusters that focused spots separate into is also given in parentheses at the bottom.

Fig. 7 .
Fig. 7. Precision and recall of Q t and P t following t.

TABLE I .
THE NUMBER OF MINOR SFVS FOR EACH s k WHEN = 3.

TABLE IV .
THE VALUE OF ξ(t, k), WHERE THE VALUE OF t IS FIXED TO BE EQUAL TO THE CORRESPONDING CLUSTER SIZE

TABLE V .
THE NUMBERS OF TRIALS WITH EACH s k HAVING BEEN CHOSEN.FOR EXAMPLE, 10 QUESTIONNAIRES ARE SUBMITTED WITH s k = Dec..

TABLE VI .
THE SIZES OF SPOTS CHOSEN FROM P t AND Q t BY ALL PARTICIPANTS