Micro-Blog Emotion Classification Method Research Based on Cross-Media Features

Although the sentiment analysis of tweet has caused more and more attention in recent years, most existing methods mainly analyze the text information. Because of the fuzziness of emotion expression, users are more likely to use mixed ways, such as words and image, to express their feelings. This paper proposes a classification method of tweet emotion based on fusion feature, which combines the textual feature and the image feature effectively. Due to the sparse data and the high degree of the redundancy of the classification feature, we adopt the canonical correlation analysis to reduce dimensions of data expressed by the text emotional feature and image feature. The dimension reduction of data can maximally retains the relevance of characteristics of the text and the emotional image on the high-level semantic and utilize the support vector machine (SVM) to train and test the feature fusion data set. The results of data experiment on Sina tweet show that the algorithm can obtain better classification effect than the single feature selection methods. Keywords—tweet sentiment classification; CCA; Text emotional; Image emotional


INTRODUCTION
Recently, people witness a rapid development of social networks, such as twitter, facebook, sina tweet and tencent tweet.For example, Twitter, according to reports, the registered users of twitter had outnumbered 500000000 until July 1, 2012.Tweet attracts more than 500000000 users and have about 100000000 messages everyday.Because of the increasing number of tweet users, tweet gains attention and wide space of development.Mining the emotional value of tweet has been extensively studied and applied in many fields, such as business advertisement, social network analysis, public opinion monitoring, cause analysis of accident.At present, the widespread practical technology of sentiment analysis is divided into two categories: (1) the type of adopting emotional dictionary [1]with the help of dictionary counts the amount of positive and negative emotional words in the text and then analyze the emotional polarity of the text according to the difference of positive and negative emotional words; (2)other method utilized machine learning [2]by labeling the training corpus and test corpus and use support vector machine (SVM), "maximum entropy", and KNN classifier to classify emotions.Wang et al. [3] construct a analysis system of Twitter sentiment, which can analyze the emotional tendency of the comments about the presidential election in real time; Agarwal et al. [4] characterized the polarity and part-of-speech of the words to investigate emotion classification towards the text of tweet based on the kernel tree model and have obtained certain outcomes; Jiang [5] et al. adopt the approaches of relevant and irrelevant to the topic classify emotional polarity, and it can be divided into positive affection and negative affection.Zhiming Liu [6] et al. study three kinds of machine learning algorithms, three feature selection methods and three calculation methods of feature weight for micro blog emotion classification research, but this method fail to consider the impact of emoticon on the emotional polarity of the whole micro blog.Lixing Xie et al. [7] propose muti-strategy sentiment analysis of micro blog based on hierarchical structure which has certain improvement in classification results compared with rules of emoticons, but this method ignores the characteristics of the polarity in the micro blog text.
A new trend of micro blog message is the increase of the visual content, just as users sometimes sent status words with pictures.This is very common, especially mobile phone users, it is more convenient for them to express the mood by photographs, rather than lengthy words.Fig. 1., for example, contains two posts.The one on the right is positive, on the left is negative.Obviously, the message of emotion can be expressed more clearly by images rather than words.It shows that the image is meaningful for micro blog emotion classification.To understand the transfer of visual emotion, Borth et al. [8] put forward visual emotional ontology contains emotional detector library and their methods are chiefly concentrated in the image analysis, but the accuracy of image emotional analysis relies on the accuracy of image semantic labels and machine learning.Weining Wang et al. choose line direction histogram to describe the "dynamic" and "static" types of emotional images and completed emotion classification based on the line [9]by studying the relationship of the image line and emotion; When Dai researched the component of the HSV color and the texture parameters in the gray level co-occurrence matrix , he www.ijarai.thesai.orgdiscovered the effects of texture on five kinds of emotion [10];H W. Yoo proposed feature extraction methods combining the feature of color and texture as the core technology, thus set up a general framework of image retrieval based on emotion [11].T.Hayashi et al., firstly, segmented an image into L*L.Then, take the average color of each image block as the color features of the image.Finally complete the mapping of image bottom-layer feature and the emotional keyword by neural network [12]; Weiwei Lu adopt CSIFT generating emotional visual words, and combine with based on global HSV color histogram ,forming graphical semantic expression types of muti-feature for the image semantics [13]; Xia Mao et al. consider the link between the fluctuations of graphics and emotional reaction and the 1 / f wave theory, then obtain the relationship between image features and the emotion information using power spectrum characteristics of the image [14]; Yali Fu investigate the unique characteristics of the wood's shape and extract the image texture feature of the wood, such as directivity, roughness, strength and contrast, completing the extraction of color features under the L * a * b color space and implementing the classification of images of wood between the "gorgeous" and the "natural" emotion [15].
The existing classification of microblog-oriented emotion primarily consider the emotion of the text.Due to the increasing number of with microblog users who sometimes express their feelings by images, so we consider the emotional characteristics of images and text comprehensively.The textual features mainly include the characteristics of emoticons, emotional dictionary and cyber language; The emotional features of images include the feature of color, texture and shape.We classify emotions of microblog with SVM.In the paper, the main contributions are: (1) consider the characteristics of textural and graphical emotion more comprehensively and classify emotions of Chinese microblog accurately; (2) select feature with Canonical correlation analysis and after dimension reduction characteristics it can maximumly keep semantic emotional correlation between the original text and image characteristics matrix.The proposed algorithm reduce the redundancy and improve the operation efficiency and accuracy.
The rest of this paper is organized as follows: In Section 2, 3 we extract the feature of the textual and image emotion.In Section 4, we reduce the dimension of the original characteristics with CCA.In Section 5,The experimental results and analysis of the new approach has been given.Finally, we summarize the main results of the paper.

II. TEXT EMOTIONAL CHARACTERISTICS IN TWEET
We summarize the related researches and extract several emotional characteristics of micro blog text with its unique characteristics.

A. The characteristic of emotional dictionary
HowNet (called HowNet in Chinese), established by Zhendong Dong and Qiang Dong, is one of commonsense knowledge bases of the describing object represented by the concept of English and Chinese [16].In Hownet, each concept and what it describes is the content of a record and each word is explained correspond to a number of concepts， the concept is a kind of word semantic description, and, the concept is called meanings, each concept explaned by several meanings original.HowNet the analysis set of emotional words in Chinese contains 3730 positive evaluation words, 3116 negative evaluation words, 836 positive emotional words and 1254 negative emotional words, view words, degree level etc.six parts.National Taiwan University Sentiment Dictionary (NTUSD), an emotional dictionary, which is organized and published by Taiwan University has traditional Chinese and simplified Chinese with 2810 positive words and 8276 negative words [17].The paper select the simplified Chinese version as the emotional dictionary of feature extraction.

B. The features of emoticon
Sina microblog platform provides some default emoticons, "emoticons" in crawl down in the text is in the form of being parentheses.For example " " is the expression of the corresponding text "[happy]."Amessage may contain multiple emoticons.

C. Network language features
With the rapid development of Internet, Internet in the process of communication also generates enormous novel online language network language.We collected 16 positive emotional network words, 24 negative emotional network language.Those words are shown in Table 2.

III. IMAGE EMOTIONAL FEATURES OF MICRO BLOG
Low-lever features such as color, texture and shape can express rich emotional information, different color, texture and shape can arouse people's different associations and emotional reaction.However, not every a low-level features are our concerns and needs, the image low-level feature selection has the vital effect on the [18] high-level affective semantic expression of the image.

A. Color feature
Color is the visual feature of object surface, which is the basic element of the content of the image, and is one of the main perceptual features of human recognition.It can be said that in all the visual features of the image, The color is the www.ijarai.thesai.orgmost emotional features.Generally, The obvious color was able to attract people's attention and make people have a certain subjective feelings.
Color is represented by color space and HSV color space conform to human visual and psychological feelings, and the color feature don't be affected by illumination and observation angles, also HSV color space quantization results can also be in line with the color feature smaller dimension of visual feature.So the color feature of the image is represented by the HSV space model is appropriate when the semantic image emotion is classified.In this paper, we use the 64 bit histogram method based on the HSV color space to represent the color feature of the image.According to the visual discrimination ability, the tone H, saturation S and brightness V were divided into 16, 4 and 1 respectively.Specific quantitative formula is shown below: Can be seen, in the three component, the human visual system sensitivity of V, S, H is increased in turn, so according to the H, S, V quantitative series, for three characteristic component , calculation of weights combination can get one dimensional feature vector L can be obtained, it can expressed by the following equation:

B. Texture feature
Texture reflects the homogeneous phenomenon of visual features which existing in image, it usually performs irregular in Local but regular in whole, as the clouds, distant lakes.The coarseness, concave-convex and other characteristics of texture can evoke psychological reflection and emotional perception.Material determines the organizational structure of the object surface, so that objects of different materials would create a different psychological feeling.
Tamura texture features is proposed based on the basis of human visual characteristics of texture perception Psychology Research [19] , it divided into six components correspond to six properties of Psychology perspective texture feature, which are coarseness ,contrast, directionality, linelikeness , regularity and roughness.Since the first three components have an intuitive visual sense ,and can directly engender psychological changes and emotional reactions.Therefore, this paper is to extract the coarseness, the contrast and the directionality of Tamura texture to represent the texture feature of images .The article [20]describes the calculation method of the three texture features.

C. Analyze of Image features in different corpus
These images contain a wealth of features and other semantic information [24], and if color feature, texture feature can distinguish micro-blogging, for the above assumptions, the paper was verified by experiment and found that the image features, texture features in different micro Bo has the ability to distinguish between apparent emotion class, micro-blogging emotion classification has a good effect.Fig. 1demonstrates how each visual feature distributes over different emotions the proposed model.For example, in the Happiness category, images tend to have high saturation and bright high contrast, which both bring out a sense of peace and joy.On the contrary, images in Sadness category tend to have lower saturation and saturation contrast, which both convey a sense of dullness and obscurity.Sad images also have low texture complexity, which gives a feeling of pithiness and coherence.The distribution during features value of two types of micro-Bo corpus is significantly different on the color and texture , these two features have a clear distinction.Then use the two features to classify with good results.

IV. MICRO-BLOG FEATURE REDUCTION AND FUSION
Each feature of micro-blog in different degree reflects the partial information of the researched question, but features redundancy will increase the amount of calculation and increase the complexity of the research problem.Therefore , I hope through quantitative analysis, using less feature subset to express more information, feature selection method is based on the purpose.In this paper, in order to fully exploit the advantages of each feature selection method, a new feature fusion algorithm with CCA, through a combination of two types selection methods to obtain effective integration features.The classification process is as follows:  Feature selection methods can be divided into two categories: supervised and unsupervised feature selection methods.Supervised feature selection commonly include: Document Frequency, Information Gain, Chi-Square Statistic, Mutual Information and other methods [21].IG, CHI and MI is important to measure the degree of correlation of the feature item, but IG is for the category as a whole to consider the importance of a feature item.The DF methods use thresholds to select characteristics which are representative and strong distinguish ability, that use class discrimination lever to measure the importance of the features.These four methods can be used to get have a major impact on the classification of features from different levels, but its drawback is that the calculation of the metric associated with the Corpus categories marked.
The article first with DF method to select the class distinction between good features, then CCA method [22]to reduce redundancy between features and information as soon as possible to preserve the original features information , after feature dimensions ,text and image feature emotional semantic correlation is maintained .CCA always is used in cross-media retrieval [23].
Different types of multimedia data can be co-expressed similar feelings semantics, such as the "smog" image and text(The fog is too terrible ) data.In the statistical sense, "terrible smog" there is a correlation between the corresponding image data and text data, this section uses canonical correlation analysis, X  , additional samples n, q-variable denoted () nq Y  , in order to maximize to extract the main features of the correlation between X and Y as a criterion, extract from a combination of variable X in L, extracted from a combination of variable Y in M, as follows: Where X W and Y W are the feature vector space, also know as the canonical variables.According to equation ( 1) , the relevant variables with more field variables X and Y is a combination of less variables between L and M interrelated, through the distribution of values to determine the form of space-related distribution of X and Y. Instead, the value X W , Y W determined corresponding variables importance.So the question boils down to how to get the canonical variables, the correlation coefficient is defined under p = r(L,M), constraint in equation ( 3), so that the correlation coefficient is optimized.

B. The result and analysis of experiment
For each dataset of microblog, 70% of dataset is randomly selected as training set respectively.The other 30% is used as test set.Performance assessment adopts the percentage of the rightly classficated microblog in data sets, i.e,classification accuracy.20 experiments will do for each of the two algorithms in order to write down the classification accuracy and the average accuracy of each experiment.Matlab2011a toolkit must be used to realize simulation in the system of Windows7.Above, the horizontal axis represents classification number, a total of 20 times, vertical axis represents classification accuracy.Three methods of classification method respectively represent the text, the text -image method, CCA + TI classification method.Analysis of the results can be seen that the average accuracy between the text method and text-image classification algorithm is similar, although the introduction of image emotional characteristics, but the feature attribute redundancy and feature dimension increasing, make accuracy rate does not improve.By using CCA method to reduce the emotional characteristics dimension, getting maximum correlation characteristic matrix, and then classified.By comparing the accuracy can be improved by 4%.Illustrate after application of the method, statistical correlation can be maintained after dimension reduction between the text emotional characteristics and image emotional characteristics.Emotional characteristics further reduce redundancy, with improved accuracy.

VI. CONCLUSION
To solve the sparse problem of Microblog -Text emotional characteristics ，we propose a novel approach for Microblog sentiment analysis based on CCA cross-media model (CBM).Previous researches always focus on the text emotion , neglecting the effect of images emotion with growth of image in the message.Considering more and more people express their feelings through images in Microblog, thus we take images into account in our model.There are three advantages of our method.First, the sentiment of the messages is analyzed by combining the images and texts.Second, this model gives a unified representation of texts and images for cross-media sentiment analysis with CCA method.And the Finally, we use Logistic Regression to relax conditional independence assumption.Experiment results illustrate the effectiveness of our model, with classification accuracy 4% higher than the text-based method.

Fig. 1 .
Fig. 1.Microblog messages with images.Left: It is so lucky, beautiful fireworks, Liuyang fireworks awesome .right:People reflect the air hot eyes throat uncomfortable

Fig. 2 .
Fig. 2. Image interpretations.We demonstrate how each visual feature distributes over each category of images by the proposed model.The visual features include saturation (SR), saturation contrast (SRC), bright contrast (BRC), cool color ratio (CCR), figure-ground color difference (FGC), figure-ground area difference (FGA), background texture complexity (BTC), and foreground texture complexity (FTC)

Fig. 3 .
Fig. 3. classification process () np X  showing concentrated extract image data from micro-blog visual emotion feature matrix, () nq Y  showing data from text concentrated extract of emotional text feature matrix.So the definition of the correlation of two variables X and Y between the fields as follows: Field variable with n samples, p-variable denoted () np

.
X  and () nq Y  .Combined with formula (2) and (3) using the Lagrange multiplier method can be 12 And further according to the formula (1) to give the smallest variables combination of () nm L  and () nm M  , to maximize reveal correlation between () np X  and () nq Y  .V. EXPERIMENT This article is aim to do the positive and negative classification research about related text of most discussed topic on Sina micro-blog.And the data about film and television, people's life and products have been collected.Each field chooses 2000 micro-blog comments as linguistic data.And labels will be added artificially.The results are listed in

TABLE II .
THE PART OF NETWORK EMOTION WORDS

TABLE III .
THE STATISTICAL RESULTS OF CORPUS The text makes accuracy, recall rate, value of F and macro-averaging as the evaluation parameters.Suppose classify www.ijarai.thesai.orgA.Validation method