A Comprehensive Science Mapping Analysis of Textual Emotion Mining in Online Social Networks

Textual Emotion Mining (TEM) tackles the problem of analyzing the text in terms of the emotions, it expresses or evokes. It focuses on a series of approaches, methods, and tools to help understand human emotions. The understanding would play a pivotal role in developing relevant systems to meet human needs. This work has drawn significant interest from researchers worldwide. This article carries out a science mapping analysis of TEM literature indexed in the Web of Science (WoS), to provide quantitative and qualitative insight into the TEM research. To explain the evolution of mainstream contents, various bibliometric indicators and metrics are used which identify annual publication counts, authorship patterns, performance of countries/regions, and institutes. To further supplement this study, various types of network analysis are also performed like co-citation analysis, co-occurrence analysis, bibliographic coupling, and co-authorship pattern analysis. Additionally, a fairly comprehensive manual analysis of top-cited and most-used journal and proceeding papers is also conducted to understand the growth and evolution of this domain. As per the authors’ knowledge, this manuscript provides the first thorough investigation of TEM's research status through a bibliometric examination of scientific publications. Expedient results are recorded that will allow TEM researchers to uncover the growth pattern, seek collaborations, enhance the selection of research topics, and gain a holistic view of the aggregate progress in the domain. The presented facts and analysis of TEM will help the researchers’ fraternity to carry out the future study. Keywords—Emotion mining; emotion models; bibliometric analysis; science mapping analysis; co-citation analysis; network analysis


I. INTRODUCTION
With the frenzied profusion of social media services in recent years, the amount of data stored in electronic media is exponentially increasing. In this era of digitization, most people have an online life too apart from their daily routine activities where an insatiable desire is seen among them for sharing their opinions, thoughts, ideas, and feelings. This has created a lot of User Generated Content (UGC) to which researchers are paying active interest [1]. This user data is a topic of paramount importance among computer science researchers as it is a key to unlock the great potential of computing where machines can understand the highly emotional human being and respond and assist accordingly. A great deal of online social media communication is textual and hence, Text Analysis, Opinion Mining, Sentiment analysis, and Emotion Mining take their role. All the above areas are enough mature except Emotion Mining [2].
Emotions are affective-cognitive states that are fundamental to the human experience that show their existence in every single communication and mining of these emotional states is indeed an interesting topic with wide theoretical and practical applications. In the neurosciences, emotion mining can assist a deeper understanding of the mental health of a patient, detection of stress, anxiety and depression levels, mental health disorders which can help to adapt medications and prevent suicides in extreme cases. In the field of customer service, customer satisfaction is the utmost priority for a company selling its product and services. Emotion mining can't only help to gauge customer satisfaction, but also it can help employ improvement measures and study its impact on users as well. A successful attempt at mining user emotions can lead to the smart user interface of computers that can understand and respond according to human emotions.
According to psychological studies, every human action has one or more emotions(s) attached to it, for example, writing, reading, facial expressions, speech, music, body movements, and gestures, etc. Emotion Mining can be done from each of these media and is a separate field of study with its research challenges. Research efforts in this domain, date back to the early '90s, however, limited to data having audio and video aids captured using various sensors for the study [3]. With the advent of web 2.0, most of the user data is in the form of text and its great potential in affective computing has kicked forward the growth of Textual Emotion Mining (TEM). Emotion Mining from text is of the utmost challenge than from any other media because of the absence of any kind of aid which is implicit in audio and video data. This paper only takes into account the problem of TEM.
Research in the field of TEM started getting the attention of affective computing researchers with the work of Alm et al. [4] in 2005. They targeted narrative text of children's fairy tales for automatic emotional classification of fairy tale sentences into one of Ekman's six emotional categories. Their work is followed by a massive amount of literature targeting the classification of textual emotions from a variety of data domains including news headlines, news articles, web blogs, novels, chat messages, microblog texts, and suicide notes, etc. Due to this gargantuan growth, the existing literature in the concerned domain opens up many research avenues along with information overload making it difficult to obtain a clear picture of the process of TEM. Taking into account the substantial accomplishments of TEM research and the www.ijacsa.thesai.org supremacy of bibliometric and scientometric techniques [5]- [8], this paper aims to chart a landscape of the TEM domain visually and to scrupulously check the evolution of research in this sector. Specifically, the present study is an applied scientific method that intends to carry out a systematic bibliometric analysis of the TEM-related academic publications over the past 15 + years (Jan 2005 ~ Apr 2020). The results will enable concerned scholars to understand the knowledge structure as well as the recent trends in the TEM research and to decide or alter further study.
Currently, there is no scientific and comprehensive analysis of TEM research based on quantitative and statistical perspective. Therefore, this article employs different bibliometric methods [7], [9]  Besides these research questions, this manuscript also presents a manual analysis of top-cited papers of this domain to discuss the major approaches, emotion models, data sources used in their studies. It also reports the level at which emotion analysis was done listing the dataset and lexicon utilized. The motivation and major contributions of the proposed work are as follows: The article attempts to satisfy the above-mentioned questions. The answers to these queries may prove to be of significant importance in deriving an understanding of the emergence and development of the field of TEM. It will provide a nice visualization of the evolution trends of the domain and grab an understanding of various aspects of TEM research. The readers of this manuscript will be able to trace the panorama of the TEM research field.
The contribution of this paper is four-overlay. Firstly, it attempts to make readers understand the concept and terminology of emotion. For many years, the term "emotion" was not properly understood or synonymically used with terms like sentiment, mood, etc. Second, it demonstrates the progress of the TEM domain in various demi-decades since 2005. Third, the use of various bibliometric indicators in the study, shed light on TEM literature from various angles by documenting most popular authors, publication venues, top institutions, etc leaving newcomers with an indication of venues that welcome the topic. Fourth, by reviewing the top-cited papers according to WoS it tries to show the hallmarks of TEM research.
This paper is sorted out in six sections. Section 1 starts with the introduction of the field giving insights into the basic definitions and discusses the motivation behind this study. In Section 2, we discuss the preliminary background enlisting fundamental concepts that ground the TEM literature. Section 3 explains the methodology used to collect data and analyze it. Section 4 describes the empirical findings from the science mapping of the TEM field. A comprehensive manual analysis of the TEM field is provided in Section 5. Section 6 presents the conclusion, with a discussion of the limitations and highlights future work.

II. RELATED WORK AND BACKGROUND
This section intends to present the preliminary concepts that describe the origin and significance of this domain. It also presents the related work describing the previous survey articles on TEM published so far.

A. The Concept of Emotion
Before recognizing emotions in the text we should seek an answer to a very important question "what we understand by emotion". This is considered to be the first step towards developing any effective emotion mining system. Kleinginna and Kleinginna [10] reviewed 92 different definitions of emotions and suggested this broad formal definition of emotion: "Emotion is a complex set of interactions among subjective and objective factors, mediated by neural/hormonal systems, which can (a) give rise to affective experiences such as feelings of arousal, pleasure/displeasure; (b) generate cognitive processes such as emotionally relevant perceptual effects, appraisals, labelling processes; (c) activate widespread physiological adjustments to the arousing conditions; and (d) lead to behaviour that is often, but not always, expressive, goal-directed, and adaptive."

B. Emotion-Related Terms
Socrates [11] wrote 'The beginning of wisdom is the definition of terms'. Research in the area of emotion mining revolves around a lot of words that look synonymous with each other but carry a lot of difference in their meanings. These include subjectivity terms like opinions, sentiments, feelings, emotions, and affect which are commonly used interchangeably in most literature. However, a proper understanding of these terms and a clear differentiation among these terms is crucial. Scherer [12] also stated that inconsistencies in the definitions of emotion-related terms lead to failure in their proper apprehension and usage. It is also noted that blurred definition boundaries often lead to the introduction of unwanted noise into the scientific investigation and hence, lower the performance of automatic emotion detectors. Hence, after understanding the fungible aspect of the above mentioned emotion-related terms, this part of the section explores these terms with an attempt to distinguish between them. Table I presents a comparison of these terms (affect, opinion, sentiment, emotion, and mood) for better understanding and proper apprehension.

C. Related Work
This study intends to present an exploratory analysis, investigating the field of textual emotion recognition by pulling together most of the existing literature of this domain. Although there exist some surveys devoted to the topic of TEM, these lack the perspective of bibliometric inspection of literature.
One of the earliest surveys on TEM is the contribution of Kao et al. [13]. They presented a classification of emotion mining works into three categories namely keyword-based, learning-based, and hybrid methods. Another work by Binali and Potdar [14] discussed all the current emotion theories and techniques that lay the ground for textual emotion recognition. They also designed an evaluation framework for the meticulous evaluation of existing approaches. Jain and Kulkarni [15] presented a review of TEM literature enlisting some information retrieval methods utilized for research in text mining and then, suggested a system TextEmo. Tripathi et al. [16] reported the different approaches, datasets, and lexicons that have been used by TEM researchers to bring about a collective understanding of this domain. Another detailed survey article dedicated to the current domain is given by Yadollahi et al. [2] where they presented the current state of text sentiment analysis starting from opinion mining to emotion mining. Their study documented the sentiment analysis literature from a new and different perspective i.e., with an emphasis on emotion mining. The paper begins with the taxonomy of sentiment analysis through which they shed light on different tasks under opinion mining and emotion mining and then presented a thorough survey of publications discussing popular computational resources i.e. datasets and lexicons. A somewhat recent yet comprehensive review article on emotion mining is the contribution of Sailunaz et al. [17]. They focused on reviewing emotion mining research efforts based on text and speech and hence presented a very detailed survey covering various models, datasets, techniques, their features, and possible extensions for a better outcome.
Yet another addition to TEM surveys by Apte and Khetwat [18] covered various aspects of emotion detection like feature extraction/reduction techniques, approaches utilized for emotion analysis including the challenges encountered in the studied domain. The most recent and widest review article by Nourah and Mohamed [19] studied the implicit and explicit approaches to emotion detection: Keyword-based, Rule-based, Machine Learning based, Deep Learning based, as well as hybrid approaches. They also report best performing feature sets and point some open challenges.
III. METHODOLOGY The current study uses the method of science mapping to examine the TEM research domain. Science mapping-"a general process of domain analysis and visualization" -aims at detecting the intellectual structure of a scientific domain [6], [7] This method typically applies several bibliometric analysis techniques for visualizing significant patterns and trends within a large body of literature. This section is documented to cover the following phases in our study-study setup and data collection, data pre-processing, science mapping tools selection; and the procedure used for further analysis.

A. Data Collection
The current study uses the bibliographic data obtained from the Clarivate Analytics Web of Science(WoS) database [20], [21]. More specifically, the WoS Core selection is used in this analysis. This is because compared to other databases like Google Scholar, Scopus, and Research Gate, WoS is internationally recognized among the research community for accommodating the highest quality articles [22]. Bibliometric analysts find the WoS to be a valuable database for both finding and assessing various types of publications since it offers a collection of essential metadata including abstracts, references, citations count, authors, institutions, and countries.
To search for articles in the WoS database, keyword selection was done with the aim of search optimization to locate every related article. We use the "Topic" filter to get the www.ijacsa.thesai.org maximum number of appropriate TEM related documents. "Topic" in WoS tells that the record will be shown based on the presence of supplied search terms in Title, Abstract, Author Keywords, or Keywords Plus. The search data range was fixed to 2005-2020 and only articles published during these 15+ years were taken into account. We used several search strings to collect the published literature in WoS. Table  II depicts the search queries used and the statistics of the data downloaded.
Although the present article focuses on research literature covering the domain of emotion mining, we can see that the topic is getting attention in the general public as well. For obtaining a clear and luminous picture of public interest, searches were made with different search strings (refer to TS in Table II) in Google Search Engine. Fig. 1 illustrates the year-wise increase in searches on Google.

B. Data Preprocessing
EM_DS is punctiliously preprocessed to detect and fix viable typographical mistakes that may be present in the title of the publications, names of the authors, and date of publications. After this, the complete content of the paper including the title, abstract, and the author-supplied keywords is manually verified to check whether the search term is effectively present or not. Papers giving negative results are excluded. Once the preprocessing phase has been completed, only 280 articles remain in the dataset, and these documents are used to mine the knowledge required to perform the bibliometric analysis. This new preprocessed dataset is named as TEM_DS.state the units for each quantity that you use in an equation.
TEM-DS includes journal articles (~87%), proceeding papers (~5%), reviews, editorial materials, and book chapters (~8%). Each article in the WoS is assigned to one or more subject categories. As TEM is a subfield of 'computer science', these statistics are in line with the main venue of publications in the computer science subject category (~93%). Other major subjects include engineering, telecommunications, linguistics, management science, information science, library science, business, and economics.

C. Selection of Tools and Metrics
Analysis of TEM_DS is done in the following manner. The exported "Plaintext" files are first converted to CSV format and then imported into mongoDB (version 4.0) database. Then, we merge the data into a single collection through mongo shell scripts, followed by the execution of various aggregation and find queries. The results are then fed to mongo shell scripts to obtain the desired outputs. Further analysis was done through Microsoft Excel.
As for the visualization tools, this study opts for the popular information visualization software VOSviewer. VOSviewer 1.6.15 [23], [24] is employed to handle WoS data, which is then used to perform network analysis based on the information related to the co-citation of references and journals, co-authorship, co-occurrence of keywords and the bibliographic coupling of cited references. The visualizations presented in figure …. Are created through this software. Microsoft Word and Excel were used for the manual investigation of content along with python scripts.
Additionally, we use Google Trends, a web facility powered by Google which provides the data related to the frequency of usage of a search term.
Apart from this we also employ various standard bibliometric indicators, as described below-

Emotion Prediction Emotion Detection Emotion Recognition Emotion Mining Emotion Analysis Emotion Identification Emotion Extraction Affect Analysis Affect Detection Emotion Classification
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 5, 2020 222 | P a g e www.ijacsa.thesai.org

D. Analytical Procedure
The analysis procedure involves both computational as well as manual investigation of publications. As depicted in Fig. 2, computational analysis of TEM_DS is done using three different techniques present in WoS publication records, viz occurrence-based, content-based, and network-based analysis. All three types of analysis uncover a different aspect of the concerned domain leaving scholars with a wealth of information necessary to grasp the perspicuous evolution footprints of TEM authors. For example, relevant researchers may obtain valuable information about the authors, countries, and the affiliating institutes that are influential and productive.

IV. SCIENCE MAPPING ANALYSIS
In this section, the task of computational analysis of the TEM_DS dataset is described, along with the various bibliometric indicators used. The subsections below present details of various types of analytical methods used along with the tables and figures illustrating the results.

A. Occurrence-Based Analysis
The Computational analysis using occurrence-based metadata aims at observing year-wise research publications trends as well as predominant institutions, countries, and authors.

1) Annual publication distribution:
Firstly, we have measured the total number of published articles on TEM for each of the years from 2005 to 2020 (till 15.4.2020). Fig. 3 shows the total publication count in TEM on a year-wise plot. The increase in the number of publications can be observed since 2005. The lesser count of articles in 2020 is justifiable since it is the ongoing period and also some of the published works from 2020 are yet to be incorporated in WoS.
2) Country-wise distribution: Table III presents the 10 most productive countries/ regions in terms of the total publications (TP). China has emerged as a leading contributor to TEM research and is far ahead of other countries. The USA (42), Japan (33), and India (18) stand at the second, third, and fourth positions respectively.
3) Institute-wise distribution: Predominant institutions which contributed remarkably to the field and the study during the time frame of 2005-2020, are considered important for visualizing the development dynamics at the institution-level. Table IV lists the most influential institutions in the decreasing order of the publications count (TP). Tokushima University, for example, contributes the largest number of research publications. Three of the top-performing institutions are located in China which again depicts the country's dominant rank in this research domain. National Institute of Informatics from Japan observes the highest citation count (TC) and the highest ACPP is recorded by National Research Council, Canada.

4) Most influential authors:
The authors who are responsible for a significant count of published literature over the studied period are referred to as highly productive. Similarly, authors whose published articles got cited the most, are named to be the top-cited authors of the domain. We have also analyzed the TEM-DS dataset to recognize the most productive and cited authors (refer to Table V). We can observe that during the study period, Ren Fuji is the most active author of TEM and Saif M. Mohammad is the most cited author, in terms of total citations followed by Yanghui Rao and Quing Li.

B. Content-Based Analysis
The keywords of academic publications represent the core content of the paper and hence provide an opportunity to understand the content characteristics and the direction of academic research. Keywords may be derived from a publication's title and description, or they can be obtained from the list of keywords supplied by the author. In the older literature, keywords were restricted to individual words. Over time, keywords started including multiple words. In this section, we first report 10 frequently used author-supplied keywords (refer to Table VI) and then present the keyword cloud in Fig. 4. Word clouds offer an interesting visualization of the summary of the text. The bigger the size of the keyword in the cloud, the frequent will be its use.

C. Network-Based Analysis
A bibliometric network is composed of edges and nodes. The nodes may be, for example, authors, keywords, journals, or publications. The edges demonstrate relationships among pairs of nodes like citation relations, keyword co-occurrence relations, and co-authorship relations.

1) Co-citation network:
If there is a third publication that cites both publications, two publications are co-cited [6], [25]. The greater the number of publications that are co-cited to two publications, the better the co-citation relationship between the two publications. Co-citation represents the semantic relationship between the two articles. Small and colleagues proposed an approach of visualizing relations between www.ijacsa.thesai.org documents by using co-citations. Lately, co-citations are used to analyze relationships among authors and journals as introduced by, respectively White et al. [26] and McCain and Katherine [27]. The co-citation network of publications is presented in Fig. 5.
2) Bibliographic coupling: Bibliographic coupling which is the reverse of co-citation is a measure to establish the similarity relationship between the reference lists of two articles. The existence of a publication cited by two publications creating a bibliographic link between two published documents [8]. The larger the set of overlapping references between any two documents, the stronger the bibliographic connection between them. Fig. 6 shows the bibliographic coupling network of publications.
3) Keyword co-occurrence network: The count of cooccurrences of two keywords is the count of documents in which both keywords co-exist in the title, abstract, or keyword list [28], [29]. In this section, the Keyword co-occurrence network is created (see Fig. 7) to provide a graphical visualization of potential relationships between keywords and hence their publications. This kind of network which helps to explore the research hotspots in this domain is given in Fig. 7.
4) Co-authorship network: Lastly, we briefly discuss coauthorship based bibliometric networks. Authors, their affiliating institutions, or countries in these networks are connected on the basis of the count of documents they have jointly published. These networks have been widely studied but the analysis of these networks has gained very little attention. Fig. 8 presents a co-authorship network.    To create a more spectacular investigation, the growth of TEM is observed by dividing the observation period into three demi-decades (refer Fig. 9). Period 2005 to 2009 is referred to as the first demi-decade representing the origin of research work followed by second and third demi-decade showing the periods of 2010-14 and 2014-19 respectively. The year 2020 is termed as the latest period of observation. Despite the gargantuan growth observed in this sector, most authors believe that the field remains in a nascent stage. Hence, demi-decade serve as a good observation period since a decade seems too large for such a field.
TEM literature during these years primarily used three different kinds of approaches/methods: lexicon-based, learning-based, or hybrid (lexicon and learning-based). Also, work on emotion analysis has been carried out on a variety of data sources (for example, blogs, microblogs, news headlines and news articles, literary texts, and discussion forums, etc.) creating a list of benchmark datasets and lexicons which can be utilized for further research and experimentation (for a detailed list of these computational resources, refer Naurah and Mohamed [19] recent survey. In line with the above aspects of TEM research, this section reports a thorough investigation of the text of top-cited and most-used articles(journals and publications) from every demi-decade to identify which of the publications in TEM_DS use which kind of approach, data source, dataset, and lexicon (refer to  Tables VII to IX). Additionally, this section analyses the level at which emotions were mined in the respective publications i.e. word-level, topic-level, sentence-level, paragraph-level, and the document-level and reports the emotion model utilized (Categorical and Dimensional) [30]. We also analyze the latest period (2020) for understanding some of the recent trends of TEM research by analyzing a few latest publications of this on-going year. Table X presents an analysis of the latest papers. After observing the Tables VII to X, the following findings can be reported. First, machine learning seems to be the most popular choice of approach. Second, most works use a categorical model for emotion classification. Also there has been a substantial increase in the number of dataset and lexicons. The latest period (2020) has witnessed a shift from conventional machine learning to deep learning for developing automated emotion recognition systems.

VI. CONCLUSIONS
This manuscript presented an overview of TEM research by conducting an exhaustive science mapping analysis based on the dataset of 280 publications obtained from the WoS for the years 2005 to 2020. This analysis was conducted using various bibliometric indicators, taking into account the various dimension of analysis including countries/regions, institutions, authors, and keywords. In succeeding to answer the queries mentioned in section 1, this manuscript brings out the scathing investigation of TEM literature published to date. In this study, two different kinds of analyses are combined to shape the logical structure of the field of TEM. This would provide the community of social scientists and researchers, with the knowledge they need to illuminate the development path and start underpinning strategies to tackle the challenges prevailing till date.
Several findings can be extracted based on the presented work like:  There is an increase in the annual publications in every demi-decade. The year 2019 recorded the highest peak.
 China is the most influential country recording the highest TP.
 Tokushima University contributed the highest number of papers in this domain.
 Machine Learning emerged out as the favorite approach for the TEM research fraternity with the recent focus on deep learning.
 Most top-cited publications utilized categorical approach for emotion modeling and a variety of datasets and lexicons have been explored to date.
Regardless of its contributions, this study experiences the following limitations. As the analysis depended on the dataset collected from WoS, therefore might be influenced by any inherent impediment of WoS's coverage of publications. Thus, the outcomes may not completely reflect the entire literature on TEM. Another impediment is the search phrases that we utilized, which may lead to a reduction of some relevant data.
If an article about emotion detection didn't use the keywords we utilized for search, it doesn't show up in our data collection. Future research may, however, build upon the research work presented and try to address the shortcomings by utilizing data from varied databases, and a larger set of indicators to assess influence, quality, and inter-connections in the literature.