Hashtag the Tweets : Experimental Evaluation of Semantic Relatedness Measures

On Twitter, hashtags are used to summarize topics of the tweet content and to help search tweets. However, hashtags are created in a free style and thus heterogeneous, increasing difficulty of their usage. Therefore, it is important to evaluate that if they really represent the content they are attached with? In this work, we perform detailed experiments to find answer for this question. In addition to this, we compare different semantic relatedness measures to find this similarity between hashtags and tweets. Experiments are performed using ten different measures and Adapted Lesk is found to be the best. Keywords—component; formatting; style; styling; insert (key words)


INTRODUCTION
According to Danah M. Boyd & Nicole B. Ellison [1], a social network is a community, place or platform where people gather for similar cause or interest.Online social network users can interact with their social contacts as well as use number of services provided.For example, they can create profile, share posts, pictures, videos, status and use messaging [2,3].The huge numbers of online social network users share their feelings, thoughts, activities, knowledge, emotions and information on online network.Among popular online social networks, Facebook [4] and Twitter [5] are top.Facebook has leading position with registered user approximately 1 billion plus, 968 million active users per day, 844 million user's login on Facebook through mobile every day, and 4.75 billion posts are posted per day 1 .
Twitter, another rapidly growing social networking site having registered users over 600 million, 316 million active users, 500 million tweets posted by users and 80% users use twitter's services on mobile phones 2 .The messages posted by a registered user on twitter are called tweets that can have 1 http://newsroom.fb.com/company-info/ 2 https://about.twitter.com/companymaximum of 140 characters.User can post a lot of tweets at daily bases by using web interface, SMS or smart phones app.Unregistered users can only view and read tweets posted by different people [5].The popularity of the twitter can be estimated by the fact that just after two years of its launching, company gained rapidly growth with 100 million tweets per quarter posted in 2008.By 2010, this amount reached to fifty million tweets per day [6] with company having 7000 registered applications [7].This amount reached to phenomenal number of 600 million by year 20153 .
A tweet is a short length message maximum 140 characters may consist of text, photos, web links and videos [15].A Hashtag (like a label or metadata tag) is used by users on different social networking media and on other micro-blogging web sites that helps users to find messages, topics with a specific contents or theme.Users can create or use Hashtags by putting a "#" sign before a word or phrase without blanks and can be used in anywhere in tweet.Internationally, the hashtag became a practice of writing style for Twitter posts during the 2009-2010 Iranian election protests, as both English-and Persian-language hashtags became useful for Twitter users inside and outside Iran [36].Beginning July 2, 2009, Twitter began to hyperlink all hashtags in tweets to Twitter search results for the hashtagged word.In 2010, Twitter introduced "Trending Topics" on the Twitter front page, displaying hashtags that are rapidly becoming popular4 .Trending topics, the most discussed topics on Twitter at a given point in time, have been seen as an opportunity to generate traffic and revenue.Spammers post tweets containing typical words of a trending topic to attract clicks.This kind of spam can contribute to devalue real time search services [37].
Hashtags help searching relevant tweets from twitter.Spammers also use hashtags for their own purposes by attaching famous hashtags with spam information.In such cases tweet and their hashtags are both not related.This is the reason knowing relevancy of hashtags and tweets is not only important to search relevant information but also important to determine authenticity of the information.The purpose of this work is to find the relatedness of hashtags with tweets they are tagged to.There are many semantic relatedness measures proposed in the literature.The effectiveness of these measures for finding relatedness between hashtags and tweets is questionable.Hence, we plan to compare findings of these measures on our data collection.
To overcome such kind of spam tweets, there is a need to find if hashtags attached with a tweet are really relevant to it i.e. if the hashtags are describing actual content of tweet or an irrelevant hashtags has been attached to get some commercial benefits or because of some bad intentions.In this paper, we perform detailed experiments to find relatedness between a tweet and hashtags attached to it.We compare findings of several relatedness measures (already proposed in literature) for computing this relatedness between tweets and their hashtags.This task is challenged by many complex sub-tasks like segmentation of compound hashtags.For this particular sub-task, we use three methods and compare their results.

A. Research Questions
We focus on finding answers to following three research questions in this work: • Do hashtags represent their tweets or not?
• Which is the best method for segregating compound hashtags?• Which relatedness measure is good for estimating relatedness between tweets and hashtags?

I. CONTRIBUTIONS
• We develop a web application that helps to extract tweets on the bases of screen names (users) from twitter with the collaboration of twitter Rest API and save tweets into MySQL database.• This application can help to find which types of POS are used in majority of tweets.• This application can find mean (arithmetic) and also finds standard deviation from calculated result, from where we can predict about hashtags.

II. RELATED WORK
Hashtags have been used for several purposes in the work related to twitter.However, we discuss some important works in this section.
Piyush Bansal et al. [25] proposed a system that can analyze and segments the hashtags and return pages from Wikipedia for corresponding context of hashtag and its tweet text input to the system.Their proposed system has three components, segmentation seeder that generates a possible segmentation list by using variable length window technique.2nd component deals with two tasks.First is the feature extraction from segmentation and 2nd is entity linking on segmentations.This component also finds the different scores like bigram score for accurate context matching, context score to find maximum contextual similarity with tweet's contents, capitalization score to find the information from hashtags written in capital words and relatedness score to find the relatedness between tweet context and hashtag segmentations.The 3rd component responsible for ranking of segmentation, produce a ranked list of segmentation along entity linking [25].
Ilknur Celik et al. [26]  Given the across the board of interpersonal organizations, research efforts to recover data utilizing tagging from informal communities correspondences have expanded.Specifically, in Twitter informal organization, hashtags are broadly used to define a common connection for occasions or subjects.While this is a typical practice frequently the hashtags openly presented by the user turn out to be effectively one-sided.Costa et al. [27] proposed to manage this inclination clustering so as to define semantic meta-hashtags comparable messages to enhance the classification.They utilized the client defined hashtags as the message class labels of Twitter and applied the meta-hashtag way to deal with help the execution of the message classification.The www.ijacsa.thesai.orgmeta-hashtag methodology is tried in a Twitter-based dataset built by asking for open tweets to the Twitter API.The test results yielded by looking at a baseline model taking into account user's defined hashtags with the grouped meta-hashtag methodology demonstrate that the general classification is moved forward.It is presumed that by joining semantics in the meta-hashtag model can have sway in different applications, e.g. proposal frameworks, occasion identification or crowdsourcing.Their proposed system has two models, first baseline model that deals with user defined hashtags.In 2nd model (meta-hashtag model), they defined meta hashtags.They get a dataset using Twitter API and use support vector machine method for classification, and they use van Rijsbergen measure for classification.For evaluation their method they use Support Vector Machine to discover ideal optimal hyperplane b/w positive and negative cases [27].Another related work is done by Planck et al. [28] and is worth reading.Kywe et al. [30] research on hashtags by analysis a Twitter dataset having more than 150,000 users.They proposed a method based on filtering, the proposed method considers both client inclinations and tweet content in selecting hashtags to be suggested.They use collaborative filtering approach that referred as a "user to user" U2U filtering approach to rating targets both items and users to assign an item to a target user, another approach "item to item" based on correlation to assign an item to target user.They also recommend another approach based on measurements of similarity b/w items by comparing their features [30].
In paper [31], they proposed a novel method that recommends hashtags for Tweets written in English Language.They use a skip-gram model for distributed word representation that uses a log-linear classifier to predict words in a range.Feed forwarded neural network (FFNNs) model is used for language modeling and natural languages tasks (NLP).They created a neural network that uses a non-linear function on each of its layers.They create a network having different computational unit ranges from 300 to 1000 and dimension of each input & output layer is fixed by 300.A component name feature vector generation in which, They get a 300-dimensional component vector for each word of the tweet, Hashtags are also included, after getting dimensions they perform an average operations on different tweets word feature vectors to make a solitary tweet feature vector that's used as an input in FFNN.Dimension of this tweet feature vector is same as the dimension of a word feature vector.By averaging tweet feature vector is close to the tweet in semantically concept.They used Batch Gradient Decent algorithm and a Mean Squared Error MSE as objective function.
We have observed that many researchers have used the potential of hashtags on twitter.However, we have hardly seen any attempt (to the best of our knowledge) that focuses on their similarity with the actual content of tweets i.e. how much the hashtags represent the original tweet content?In next section, we describe our experiments to find answer for this question.

A. Data Collection
We The Twitter Rest APIs5 also gives facility of extract tweets on the base of user (screen names).There are following steps to extracts tweets.
• A user account is required at twitter web site that's initially step towards tweets extraction, here we have already registered user of twitter, • we create an application on twitter to get user authentications information that helps a user to authorized himself, these are consumer key, secret key, access token and access secret token.Any user gets this confidential information by registering and creating an application on twitter, • OAuth, an open standard that authorized a user by using above mentioned confidential information, • Next step is to write a code to make connection with twitter using Rest API, for this purpose we use PHP web programming language and adobe Dreamweaver CS6 as editor.This application is host on a temporary domain for extracting tweets.www.ijacsa.thesai.org We extract around above 15000 tweets as a sample data set.We extracts the General Tweets posted in different domains, instead of specific domain or community i.e.Politicians or businessman.These are mixed tweets that contain Hashtags having English words, different terminologies, local language words, symbols and many more.

B. Data Processing and Storage
Once the data collection has been prepared, we do following steps: • Remove duplicate tweets, • Remove digits from hashtags, • Remove tweets having words other than English language, • Remove tweets with hashtags including abbreviations or acronyms, After applying this processing, we are left with total 8001 tweets to work on.We use MySQL database management system for storage of these processed tweets.We store tweets, separated hashtags, segmented hashtags words.The database schema is shown in figure 1.

C. Segregating Compound Hashtags
As discussed earlier, the task of computing relatedness between tweets and their hashtags involves some sub-tasks.One of the sub-tasks is to extract individual words from a compound hashtag.For example, if we find a hashtag #savetheworld then we need to identify all three individual words present in this hashtag i.e. 'save', 'the', and 'world'.It is very important to identify correct words from hashtags because performance of major task i.e. finding relatedness between hashtags and tweets depends on this correct identification.

IV. SEGREGATING COMPUND HASHTAGS
This sub-task of extracting individual words from compound hashtags is performed in two steps.First step in this regards is to extract hashtags from tweets themselves.For this purpose, we write a simple code which separates hashtags from tweets with 100 percent accuracy.The second step is separation of compound Hashtags.A careful analysis of hashtags reveals that most of the people use capitalized and well defined words to make hashtags.However, some use hashtags containing words in lower case letters while many others combine lowercase and uppercase letters for this purpose.We categorized Hashtags into the followings: It is very easy to identify words in a hashtag written in capitalization form (i.e.#SaveTheWorld) while it becomes difficult for rest of the forms.Therefore, we decide to experiment with four different methods to identify individual words in hashtags and extract them.Each method is described in detail in following sub-sections.

A. Extracting Individual Words Using Regular Expression
This method based on regular expression that uses along coding and this is the most successful method because majority of people used capitalization case to make compound Hashtags.This method can segment hashtags that contains the following type of words We use the pooling method for finding accuracy of this method by using a pool of 1000 hashtags.The accuracy of this method is 911/1000 = 91.1%.

B. Extracting Individual Words Using Google Search Engine
In this we method, we use Google search engine to separate Hashtags.We use this method based on query technique, in which we make a unique query for each Hashtags to Google search engine.This method can segment hashtags that contains the following type of words

D. Extracting Individual Words Using Lexicon Method
This method is based on coding as well as lexicon database to segment Hashtags.It is effective in segmentation of compound hashtags written either in uppercase, capitalization, mixed and specially in lowercase letters but it is not easy to implement because it makes a lot of queries to database to find possible matches and processing, after finding possible matches then it makes permutations of all retrieved matches, after permutations it again find the accurate permutation that will takes a lot of time and involve a lot of processing, time and processing cost of this method is depend on hardware.It takes much time to segment a Hashtag consisting of two words like #blindlycartouche, it is consisting of two words blindly and cartouche.It successfully segment the above mentioned hashtag but take a lot of time.The compound hashtags may consists of different no. of words and each word has different length of characters.This method fails if hashtags contain words are not present in lexicon WordNet database.We are unable to implement this method due to a lot of time consumption, here's some calculations that describes how's this method is costly and not possible to implement.We choose random hashtags containing two to five words and each word having length from two to five characters.We assume that all hashtags belong to lexicon database and have length from two to four words, the cost of lexicon method cost shown below in different tables (table 5 onwards).We find accuracy of this method by testing and analyzing the sample of first 1000 hashtag that is 512 / 1000 = 51.2%.

E. Extraction of Parts of Speech (POS) from Tweets
At this step we extract part of speech (POS) from Tweets for the purpose of finding semantic relatedness between segmented hashtags and contents of tweets that consists of (nouns, adverbs, adjectives, verbs and their sub forms).To perform this task we used a PHP library named "PHP Tagger", which is used to extract POS from any given sentence.We use it for extraction of POS from tweets by passing tweets to Tagger function.We extract nouns, verbs, adverbs, adjectives and respective sub forms.We extract POS and save into MySQL database.Following types of POS tags are considered for match with hashtags.Hirst & St-Onge (HSO) III.Lin IV.
Path Length VI.
Jiang & Conrath (Jcn) A. Normalization Algorithms used to match tweets with their hashtags use different scoring mechanisms and hence results computed by using these algorithms cannot be compared unless brought on same scale.We use following formula for normalizing results: Normalized values Wi = Xi -minimum (X) / (maximum (X)minimum (X)) Here, • Wi is the normalized value.
• Xi is a value belongs to variable X.
• Minimum(X) is the smallest value from variable X.
• Maximum(X) is the extreme value from variable X.
As a result of normalization, results of all algorithms are brought on 0 and 1 scale and hence are comparable.These normalized values are then averaged (arithmetic mean) for each algorithm.

B. Final Results
After normalization of semantic relatedness scores and averaging them, the final results for different algorithms are given below: Another major reason why we need to evaluate our data by users is that numerous users don't follow the netiquette.For example, they tag the irrelevant word with Hash (#) which creates the problem to specify hashtag for any Generic Field.Due to lack of standards for inputting hashtag peoples add any kind of hashtags to represent tweets.Even English people (our concerned data set) create hashtags that have no meanings in any sense but they are written followed by # sign i.e.
Above mentioned hashtags and many more are written by users upon their desires, they are not following any standard or rules.Peoples use special characters, pure in upper or lower case and mixed case, digits, roman words etc.Our web application cannot fulfill judgment on all types of hashtags, so we decide to perform user evaluation.

VII. CONCLUSIONS AND FUTURE WORK
On the basis of our experimental results and user evaluation, we conclude that hashtags mostly represent the tweets they are attached with.As far as correctness of semantic algorithms is concerned, our findings are given below: • Adapted Extended Lesk is found to be the best at finding the relatedness between hashtags and tweets, the measurements of this algorithm is 0.701, We keep following points as part of our future work: • Integrate other twitter APIs specially search API to get a large no. of screen names as well as tweets.• Write a code to search and save screen names into database on the bases of any given keywords • Write a code to deal automatically with data rate limit restriction by twitter to save time and get better results.• Upgrade web application to deal with every kind of compound hashtags in lowercase and uppercase.
• Upgrade web application to deal with roman hashtags as well as hashtags written in other than English language.• Improve lexicon method to get better results in every aspect like speed, accuracy and time.• Improve regular expression method that also deals hashtags other than written in capitalization.

•
This application is capable to separate hashtags from tweets and save into selected database.
It is a hybrid method which combines regular expression method as well as Google search engine method.First we use Regular Expression.While segmenting hashtags it checks every segmented word from lexicon WordNet database that's we already stored in our local schema, if match found that word is stored in another "splithashwords" table with flag value 0, if match not found then it also store relevant word into same table but with a flag value 2. After finishing compilation of regular expression, second method Google Search Engine start working, in which it select all words from table by querying and try to segment words with flag value 2. Successful segmented words again save into same table with flag value 0 and compound word with flag value 2 is deleted at the last of searching completed otherwise not.Its accuracy rate is much higher as compared to first two methods.We find accuracy of this method by testing and analyzing the sample of first 1000 hashtag that is 978 / 1000 = 97.8%.

Table 1 :
Part of Speech

Table 2 :
Data Size after Processing

Table 3 :
Algorithm wise results

Table 5 :
• Improve Google Search Engine Method to finds context / meaning of abbreviation and acronyms.• We want to restructure our database schema for better performance.• Improve our segmentation methods of hashtags and our web application to handle o Hashtags including abbreviations, acronyms.o Hashtags having words or symbols other than English language.o Hashtags having pure lowers case i.e. not using capitalization.o Hashtags having mixed abbreviations, acronyms & English Words.o Hashtags having mixed English words and Roman Urdu words.o Hashtags with roman Urdu (Pakistanis & Indians writes their Urdu & Hindi in English) Cost using two words in Hashtags

Table 6 :
Cost using three words in hashtags

Table 7 :
Cost using four words in hashtags

Table 8 :
Average Cost of lexicon method

Table 9 :
Cost of overall dataset using lexicon method