Intelligent Parallel Mixed Method Approach for Characterising Viral YouTube Videos in Saudi Arabia

In social networking platforms, comprehending virality, exemplified by YouTube, is of great importance, which helps in understanding what characteristics utilised to create content along with what dynamics involved in contributing to YouTube’s strength as a platform for sharing content. The current literature surrounding virality problem appears sparse concerning development theories, investigations regarding empirical facts, and an understanding of what makes videos go viral. The overarching objective is to understand deeply the phenomena of viral YouTube videos in Saudi Arabia, hence we propose an intelligent convergent parallel mixed-methods approach that begins, as an internal step, by a qualitative thematic analyses method and an NLP-based quantitative method independently, followed by training an unsupervised clustering model for integrating the internal analysis outputs for deeper insights. We have empirically analysed some trended YouTube videos along with their contents, for studying such phenomena. One of our main findings revealed that boosting entertainments, traditions, politics, and/or religion issues when making a video, that is associated in somehow with sarcastic or rude remarks, is likely the preeminent impulse for letting a regular video go viral. Keywords—Virality; text mining; sentiment analysis; social media analysis; mixed method approach


I. INTRODUCTION
Knowing how digital content can get rapidly spread worldwide, such as viral video, is of great importance in perfecting our e-services. In the scope of social networking platforms, virality can be loosely defined as the ability of content to spread rapidly in society from one person to another. Given the present time's propensity for communication via electronics, content spreads like wildfire thanks to the Internet. This virality is exemplified by YouTube, whose user-generated content allows users to freely create and share content both on its own platform and on other social media platforms [1]. Given YouTube's success, there exists an interest in understanding what characteristics utilised to create content along with what dynamics involved in contributing to YouTube's strength as a platform for sharing content. Although scholars agree with the characteristics that constitute/make up viral content, there exists less certainty with what makes a video extremely popular [2]. Despite a growing interest in this field, the current literature surrounding this topic also remains sparse with respect to development theories, investigations regarding empirical facts, and an understanding of what makes videos go viral.
Understanding why and how a video becomes extremely popular (i.e., how it goes viral) can maximise how consumers can benefit from a video's popularity along with how users can deal with the threats associated with virality such as spreading rumors or violating others' privacy. Analysing a large amount of data from YouTube's video collection would also allow for a deeper understanding of social behavior, dynamics, and processes at play when people consume and create content.
Broadly speaking, there exists two principal conceptual analysis when it comes to research on virality, formulated coherently in a valuable theoretical framework by [3]: a top-down mechanism which considers virality as the result of highly influential individuals who can use their power in promoting their videos by (e.g., existing mainstream media); a bottomup mechanism, which argues that virality relies instead on the characteristics of the content that factually engage individuals to spread the content in a self-motivated way [4]. Interestingly, [5] (citied in [6]) mention that the latter mechanism is more often prompting virality.
In a general sense, this research attempts to contribute to the bottom-up mechanism by solely focusing on Arabic videos, particularly, videos that have gone viral. The overarching objective behind our attempt is to provide an intelligent based solution to help in understanding deeply the phenomena of viral YouTube videos in Saudi Arabia, which can be used in future research as a guideline or for comparison purposes. Thus, we propose a convergent parallel mixed-methods approach that begins, as an internal step, by a qualitative thematic analyses method and an NLP-based quantitative method independently, followed by training an unsupervised clustering model for integrating the internal analysis outputs for deeper insights. To be more precise, the proposed complex approach depends on (1) our optimised lexicon-based Bag of Words sentiment classifier for analysing viewer's shared • A qualitative study on a variety of video's categories and themes propagated in Saudi Arabia. • A lexicon-based Bag of Words sentiment classifier, where the novelty here lies on our optimised algorithms, implemented in Java, that support any texts written in Arabic without translation. • An innovative idea of utilising unsupervised machine learning technique, depending on distance matrix and hierarchical clustering, for integrating our internal findings. This thought could be a promising research paradigm that fundamentally contributes to social media intelligence approaches.
The next section seeks to investigate prior scholarship on phenomena that have gone viral, examine gaps in the literature regarding the virality process, and present noteworthy questions of the current research. The following section introduces our methodology utilised in the examination and presents the subsequent analysis and results. Lastly, the final section discusses the conclusions of this research and outline our intention for future research to take.

II. REVIEW OF RELATED LITERATURE
This section provides an overview of the phenomenon of viral content by drawing on scholars who have sought to understand the processes and dynamics of virality. In 1997, the firm Draper Fisher Jurvetson coined the phrase "viral marketing" to describe Hotmail's use of advertisements to promote the fact that its emailing service was free [7]. [8] then noted that viral marketing was described as a type of marketing that infects its customers with an advertising message that passes from one customer to the next like a rampant flu virus (p. 93). More generally, "viral marketing" and "viral content" have since become catchphrases for online advertising success. A variety of other definitions have also been offered for virality, each coupled with a specific approach in examining its nature.
According to [9], examinations and definitions of virality can be categorised in three ways. The first seeks to examine how the content is accessed, disseminated, and propagated over a short time period. The second seeks to examine how virality is spread via electronic sharing (i.e., word of mouth) by focusing on the content shared. Lastly, the third concentrates on users' behaviors and engagement with the viral content in question and gauges their likes/dislikes, shares, and comments. [10] argued that the term "virality" includes a host of aspects and exchanges such as the number of people who have access to the content, the appreciation of the content, and how many people have liked or shared the content. The popularity of the content depends exclusively on those who share it and the reactions it garners (positive, negative, and, to a lesser extent, neutral). The current research defines viral content as that which spreads to the greatest degree possible over the shortest amount of time.
YouTube has been chosen as the topic of study for the present research due to the double-sided nature of its platform (i.e., the ability to share and participate through comments as well as to react to content via word of mouth). Sharing content on YouTube requires interacting with others online, which in turn affects the popularity of said content. Content spread online generates greater audience numbers than content spread through some other means. YouTube also affords the distinct opportunity to study both the activities of YouTube users' interactions and their social network ties. According to [10], a number of elements play a part in this sharing process, including the nature of the shared content, the nature of the user who shares it, the nature of the audience who receives it, and the structure of the network through which the content is spreading. The present research aimed to provide an AIbased solution to help in understanding the phenomena of viral YouTube videos.
These studies used several methods to collect data. While some of them utilised questionnaires to obtain users responses, others used data-mining tools. A few studies manually conducted content analysis. Most of the previous researchers developed their own models to explain the phenomenon of virality. Only two studies have borrowed theories from other fields to explain virality; these theories included uses and gratifications theory, the persuasion model, and the memorybased model ( [9] and [32]). Thus, the current study aims to fill the gap in the field by proposing a convergent parallel mixed-methods approach that begins, as an internal step, with a qualitative thematic analysis method and an NLP-based quantitative method (used independently), followed by training an unsupervised clustering model for integrating the internal analysis outputs for deeper insights in order to provide a deep understanding of the phenomena of viral YouTube videos in Saudi Arabia.

III. RESEARCH METHODOLOGY
The original work carried out in this paper was to better understand the rapid spread of viral YouTube videos in Saudi Arabia. We have considered a variety of video's categories and qualitative themes as input factors for our experiment that conducted on a dataset collected from the top 13 viral videos, trended between 2016 and 2017 as reported in Think-with-Google 1 Through the stages of this study, we have investigated the importance of sentiment analysis and mining opinions from YouTube comments, which allows us to classify the viewer's The convergent parallel mixed-methods approach, presented in this paper, integrates a qualitative thematic analyses method for analysing video content view with a quantitative method of NLP-mining opinion for investigating viewer's textual comments. The outcomes from these independent methods (i.e., qualitative and quantitative methods) are then integrated and fed into our an unsupervised machine learning (i.e., Hierarchical Cluster Analysis) model for comprehensive understanding and more accurate predictions. The overall flow proposed approach is described in a three phases, illustrated in Fig. 1. In the rest of this section, we first discuss our data collection methodology, including the selection criteria of YouTube videos for experiments as shown under Phase 1. We next introduce our analysis methods for both viewer's comments and video's contents for answering our research questions, see Phase 2 and Phase 3 of Fig. 1.

A. Acquiring Data for Experiments
YouTube is the second-most popular video-sharing website in the world, according to Alexa website 2 . It provides an official API 3 Services to access and fetch specific data that are available under their authorisation credentials. The publicly available data (i.e., free to fetch with restrictions) include general video meta-data, comments thread, limited user profile details, etc. We have developed a Java application with a mySQL database as a back-end to fetch/store only available public data.
We crawled all obtainable data related to those top 13 trended YouTube's videos in Saudi Arabia 1 , uploaded/posted within a one-year timeframe (i.e. between 2016 and 2017). The gathered datasets includes more than 51, 697 comments and all the available details about reviewer profiles, such as location and used devises for posting comments. Critical demographic variables such as user-age and gender are unfortunately not

B. NLP-Sentiment Analysis for Classifying Textual Comments
As the standard YouTube's API does not provide sentiment information correlated with each posted comment, we implement our own multi-classes sentiment classification algorithm for text written in Arabic. Our sentiment classifier algorithm is modelled using an optimised version of bag-of-words approach and analyses deeply sentiment scores in five-pole scale (i.e. Positive, Negative, Mixed, Criticism, Neutral) taking into consideration their polarities. The bag-of-words approach is popular in natural language processing, which is a machine learning method of feature extraction with textual data [35]. Rather than measuring only the presence and/or frequency of known words for a given textual comment, we also consider the sentiment score of each matched word from our predefined lexicon dictionary. We build a rich Arabic lexicon dictionary that includes more than 72, 000 sentimentally classified units, some of them have been extracted from publicly available datasets such as SemEval [36,37] and from review repositories of some domains 4 , including Movies, Hotels, Restaurants and Products [38]. In principle, these units have been generated by mining varieties of Arabic texts that are currently in use, and the average of their accuracy is approximately 72%.
Moreover, our text classifier algorithm allows performing a detailed analyses of viewer's comments by predicting usergenders as well as classifying comments into another three high level categories (i.e. Information, Conversation, Nonresponse comments) using a specific predefined keywords. In this paper, these high level categories, introduced and explained in [39], could give influential facts that help in understanding the currently dominated phenomena of Saudi society.  Inputs: Comments = {c 1 , · · · , c k }, k ∈ N: a set of all extracted comments from the datasets. Outputs: Bag: a set consisting of a cleaned bag of Arabic words, such that each word t has a numerical attribute t count for holding the number of comments the t appear in. Begin 1: Bag := ∅ : initialising the empty bag set for creating distinct words. 2: for each c i posted comment ∈ Comments do 3: t cleaned ← clean (c i ) : remove all non-Arabic characters, conjunctions, punctuation, and repeated stressing characters from c i except empty spaces. 4: t tokenized ← Tokenize (t clean ) : tokenizing the passed cleaned text by splitting it on single spaces. 5: for each t i a cleaned token ∈ t tokenized do 6: if t i / ∈ Bag then 7: Bag ← t i : append the token t i to the list Bag.

8:
if exist and first count (c i , t i ) then 8: count how many comments the t i appear in Comments, i.e., at most once for each c i . 9: set t icount = t icount + 1 10: return Bag. End The proposed algorithms are given explicitly in (Algorithms 1 and 2). Given a broad set of textual comments, Algorithm 2 Lexicon-based Bag of Words Sentiment Classifier.

Inputs: Lex is a lexicon dictionary
Bag is the created bag of word from the Algorithm 1 tc and pc are the total number of comments and a specific posted-comment respectively. Outputs: S class is the classified class that has the maximum sentiment probabilistic scores from the five-pole scale (i.e., Positive, Negative, Mixed, Criticism, Neutral).
compute the sentiment score for each row in Lex according to their probabilities using formula Equation 3. 9: S class ← dataF rame.maxSentimentScore() : summing the total sentiment scores for each class in Lex and then returns the class with the highest value. 10: return S class End our approach begins by generating a bag of Arabic word using Algorithm 1 from all observed comments. We then generate a data-frame, representing a two-dimensional arraylike structure, by mapping each token (word) from the bag with our predefined lexicon dictionary. Here, all columns are vectors of equal length, such that the first two vectors contain token-values and their occurrences respectively. The followed vectors correspond to measurements of the sentimental classes, obtained from our lexicon dictionary, see Figure 2 for illustration with a self-explanatory example. The generation of the data-frame is stated in Algorithm 2, see the first line.
To be more precise, rescaling these data values for the probabilities of each sentimental class is expressed as follows: where cp max and cp min are determined vertically in vectors that hold sentimental token's probabilities. Finally, the predicting sentimental class for c, stated in line (9) see Algorithm 2, is chosen according to the highest total scores of their bsize i=1 cp i , where b size is the size of the bag. Consider the example shown in Figure 2, the chosen sentimental class, the algorithm will classify the mentioned impolite comment to be Mixed in accordance with the total probabilities (i.e., 1.63). However, our optimised solution gives more precise classification as it takes into consideration the rates of token's importance, see the correct prediction by choosing Negative with total score of 406.98.
The same algorithms (i.e., Algorithms 1 and 2) are applied for classifying comments into three high level categories (i.e, Information, Conversation, Non-response comments), but with using different lexicon dictionary that is manually defined. Here, we carefully collect a large set of keywords that are often used in each category. For instance, comments that consist of WH-questions (as predefined keywords) at the beginning will likely be classified into Information category [39]. Furthermore, we have a rich database dictionary of male and female names, and we use it for predicting user-genders from usernicknames.
Tables II and III show the generated predictions when applying our Algorithms 1 and 2 on the collected data, summarised in Table I.

C. Manual Inspiration for Quantitatively Analysing YouTube Content View
To our knowledge, there has been no idealistic method for performing video content analysis directly at the visual level. Accordingly, we have implemented a generic subjective method of interpretations by a panel of three reviewers, including the authors of this paper, moderately related to QualCA research method [41]. In essence, this subjective method involves three core independent phases: 1) The identification of the (global) most expressive themes and video categories that characterise the intentions deduced from the audio and/or visual components of video contents.
2) The coding frame, formulated in a two-dimensional thematic vector that maps the identified themes th i with each observed video vid i by a five-level Likert scale (i.e., from 1 to 5) [42]. 3) Checking the validity of the constructed thematic vectors.
The selected 13 videos were distributed to the authors of this paper individually, and they were instructed to identify the global themes depending on what they observed in the video, regarded as a whole. Subsequently, the authors have held several remote meetings to unify all the agreed themes embedded in the video contents, wrapped into 10 distinct themes, described in Table IV. This phase was carried out during the month of January 2018. In the coding phase, the authors were requested individually to re-observe each video and scale all the 10-themes. To tackle the conflicting problems in scaling the same theme th i vs. vid i by the authors, the average scales has been calculated, and then rounded to the  shown in Table IV) to a panel of three reviewers for checking and validating the identified list of themes as well as the coding scales for each video, and no critical comments were noticed.

D. Unsupervised Learning Model for Integrating the Quantitative and Qualitative Findings
The third phase of our concept-level mixed-methods design, shown in section III, makes the quantitative and qualitative findings interdependent rationally. It involves the integration of the NLP-Sentiment analysis (cf. subsection III-B) and the thematic analysis (cf. subsection III-C) outputs as the centerpiece inputs to our unsupervised machine learning model. This unsupervised learning model gives a more indepth insight into the relations between the quantitative and qualitative variables, allowing to better understand the nature of viral videos. The proposed model, in the third phase, is designed based on Distance Matrix 5 and Hierarchical Clustering [43]. In data mining, distance matrix is typically essential for building a hierarchy of clusters. Here, we consider Cosine similarity formula (i.e., usually used to measure the degree of angle between two variables) to generate our distance matrix. The formula is expressed by a dot product [44] as follows: where A i and B i are pairwise vectors containing values from two compared variables (e.g., a sentiment type as a variable vs. a specific user-gender).
1 # i m p o r t i n g t h e r e q u i r e d l i b r a r i e s . . 2 import p a n d a s a s pd 5 Distance Matrix is a mathematical square matrix that contains the numerical distances between the items in two-dimensional-array 3 from s c i p y . s p a t i a l import d i s t a n c e m a t r i x 4 from s k l e a r n . c l u s t e r import A g g l o m e r a t i v e C l u s t e r i n g 5 . . .

# L o a d i n g t h e d a t a s e t a s a CSV f o r m a t and 7 # c o m p u t i n g t h e d i s t a n c e m a t r i x u s i n g t h e '
s c i p y ' l i b r a r y . 8 d a t a S e t = pd . r e a d c s v ( ' c l e a n e d −d a t a s e t . c s v ' ) 9 d i s t a n c e M a t r i x = pd . DataFrame ( d i s t a n c e m a t r i Python code fragments for computing distance matrix and generating a hierarchical clustering model To clarify more, we have implemented a Python script to integrate all the inferred information acquired during the second phase (i.e., presented in Table II, Table III and Table IV). In Listing 1, we give a descriptive code fragment for creating a distance matrix between the qualitative thematic variables across the quantitative variables. Additionally, we have used an existing interactive data analysis tool called Orange 6 (i.e., a visual Python programing language for data analysis) to generate a distance matrix and hierarchical clustering figures, see them in Figure 3 and Figure 4. In principle, both figures illustrate the relations between the qualitative thematic variables across the quantitative variables. However, Figure 4 divides relations at different levels represented as a tree structure. We expand this and discuss our findings in the discussions section.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
This paper is set out to investigate the types of comments posted on viral YouTube videos in Saudi Arabia, proposing a thematic classification schema to understand the rates of concern of social community in Saudi Arabia. Without paying to much attention to our technical contribution to this study, which includes implementations of our own optimised learning algorithms and metrics, the focus in this section is to give revealing insights into the figures reported in Figure 3 and  3) Predicting the next shift wave of concerns of social commenters through observing all the event' times (i.e., time of posting or replying to a comment) on the timestamp.
After exploring the above-aforementioned perspectives in the next subsection, we discuss the threats that can impact the validity of our findings, and then we give a brief consideration regarding the ethical issues. Figure 5 shows four different distributions of our thematic categorisations, resulted by clustering our internal outputs (i.e., obtained after performing the qualitative and quantitative analysis parts independently). By taking a closer look at the mentioned figures as well as the disparity in percentages, one can observe, at a glance, the following points:

A. Understanding the Categorisations and Concerns of Saudi Society
• The highest three categories in terms of social community concerns lie in (Sport and Entertainment, Traditions and Sarcastic), which constitute roughly half of the society's concerns in a total percentage of 49% (i.e. 18% + 16% + 15%), see (A) at the top left of Figure 5. • The differences between males against females, as shown in (B), look slight by an average of about 11% except Celebrities and figures scandal, where they look more common among females than males by an approximate of 26%. This result is in line with the clustering dendrogram, presented in Figure 4. Here, the clustering figure gives different analytical readings, one of which is the overall behaviors of males against females. Roughly speaking, the produced clusters indicate that males appear more involved in making positive, mixed and criticism comments than females. These comments seem associated with all categorical themes apart from Traditions and celebrities and figures scandal. In contrast, however, females tend to post more negative and neutral comments associated with only traditions and celebrities and figures scandal categories. • Interaction between commentators and their responses to each other is high in issues related to (Political, or Sarcastic issues), and gradually decrease in the other categories. Unsurprisingly, this is visibly analogous with the high presence of irrelevant comments, which can be a result of the exploitation of advertising owners in these categories, see (C) and (D). • No much attempt is made by the commenters to delve into and engage in issues related to (Political, Religious, or Opposite Sex issues). However, there appears an advertising focus on these categories, which could be the reason behind boosting the level of communications between commenters.  In order to entirely understand the phenomena of viral YouTube videos, one should collect data from several reliable resources that provide, e.g., tracing data of sharing videolinks through external social networking platforms (or through existing mainstream media) or providing data that describe how much robot software tools being used for spreading video-links globally. Since such data are outside the scope of YouTube platform, let us assume a hypothesis with a typical scenario where the genuine reason that led a particular video to go viral is just the content. This trivial hypothesis simplifies our understanding of this phenomena by preciously examining one aspect (i.e., video's content in addition to its comments) while neglecting all other aspects that are difficult to obtain. In this context, we see the leading cause, confined to having an attractive positive or negative content, is the implication of what is in line with (1) the main interests of regular viewers or (2) with things that advertising organisations care about. Referring to such rational grounds, we figure out, from the results reported in Figure 5, that the prevalent categorical themes are Traditions and Sarcastic, thus supporting these categorises may contribute significantly to make an extremely viral video. Furthermore, what seems attracts the social community, in particular, is the promotion of entertainments and/or traditions issues associated with sarcastic or rude remarks. Advertisers, however, are keen to exploit contents correlated to politics, religion, opposite-sex issues and, in the meanwhile, surrounded by also rude remarks. Therefore, boosting these circumstances are likely the main reasons behind letting regular videos go viral.
Concerning our prediction for the changes in the distributed thematic categorisations, we have conducted a specific experiment to measure the changes. The concept of this experiment lies in adding event's times as an additional dimension to our dataset. To avoid the ambiguity, we firstly have broken the time-line down into several equal intervals, such that all our selected videos were accessible on-line during the first interval. Then, for each interval i, we generate a distance matrix by extracting comments, posted within i, and analysing them using our sentiment classifier. In (A) of Figure 6, we describe how the classified comments are fluctuated over time. By computing the generated set of distance metrics, using Forecast and Trendline equation 7 Table I) over time. (B) illustrates the predicted change in each thematic category would be inconsiderable of around 6%, except Sport and Entertainment that is expected to get progressed by almost 10%.

B. Discussion and Threats to Validity
The feasibility of using our complex AI-based approach in analysing the behaviors of YouTube communities depends primarily on the quality of the collected data, and this fact probably applies to most of in-use machine learning solutions. Consequently, our thought here is that viral YouTube videos can be considered as a fertile place for extracting high-quality dataset, resulting in producing accurate readings after correctly conducting required analysis. Concerning the soundness of our experiment, we discuss the threats that can impact the internal as well as the external validity of our results.
In internal validity (i.e., related to aspects that could have affected our finding), the threats may include (1) inaccurate predictions by our sentiment classifier, and (2) the improper use of cosine metric for generating distance matrices and hierarchical clustering (i.e., different metric may fit better in our approach such as Manhattan or Euclidean). For inaccurate predictions issue, we are not claiming that our sentiment classifier would give 100% correct predictions (no text-classifiers could reach this percentage), but accepting a specific prediction would often be based on a predefined threshold for a particular domain. The threshold considered in this paper is relatively close to the lowest accuracy of our lexicon units (i.e., about 63%) as we did exclude all lexicon units that have poor accuracy. This mean, the accuracy of our predictions should be above 63%, and such percentage is acceptable from the author's point of view. Regarding the use of cosine formula, intuitively using different formula will generate different results. However, cosine metric has been widely used for measuring preciously lexical similarity, and it is a typical metric for examining short text [45].
Threats to external validity investigate the scope of generalising the research findings. Here, a potential threat is represented by having incomplete data, collected from a limited number of (13) videos. While our approach deals with a single social networking platform (YouTube) in collecting data, there still relevant data left unconsidered, e.g., data from other social networking platforms as well as from chatbots software tools. However, as explained in the previous subsection, obtaining such data from external resources (i.e., outside the scope of YouTube platform) is not possible. Hence, we attempted to apply a robust and sophisticated research methodology using unsupervised machine learning for in-depth analysis and understanding. Besides, we are aware that our findings are based on a small number of carefully selected viral videos, but for ensuring a proper generalisation of our findings, we have extracted all shared comments (i.e., more than 51, 697 comments, see the details in Table I) within a one-year timeframe.

C. Ethical Considerations
Emotional feeling is individual privacy, and mining such individual privacy of a particular person evokes a significant concern regarding ethic legitimately. As intelligent machine learning systems are becoming more powerful and superior at understanding a human conversation, and their relationships, they could go beyond human ability in revealing their privacies, and hence raising critical questions to be addressed around security/privacy [46]. Technically speaking, mining what people express emotionally in the virtual social media worlds, as conducted in our experiments, is prone to random errors in disclosing the reality of the physical world. This means the predicted information, by mining algorithms, is not highly reliable and, therefore, could result in making illinformed decisions.
Text mining and sentiment analysis approach on public resources of social media, as a knowledge-driven technique, are meant to give high societal level analytics. Despite this fact, our proposed approach is not designed to support oppressive regimes for identifying dissents and/or applying censorship. In this research, the collected datasets from YouTube's API are public and do not contain any details related to the identity of commenters. However, we have no attempt to use the inferred information, such as user-genders, to evaluate the private intellectual orientations of commenters.
To the best of our knowledge, no standard ethical guidelines exist to be implemented during the development of an artificial intelligence tool. However, there appears a promising attempt, which is not finalised yet, by a research group called Partnership on AI (PAI) to study the regulations and create such important guidelines [47].

V. CONCLUSION
This paper contributes a convergent analysing approach that can be applied, with negligible customisation, to any social video platform for in-depth analyses and comprehension. The principle underlying this approach depends on an unsupervised machine learning technique that integrates the internal outputs, obtained by applying qualitative and quantitative methods independently. For the latter method, we have introduced an optimised version of a well-known Bag of Words algorithm to sentimentally classify any given Arabic text into a five-pole scale using a rich lexicon dictionary. Our work also rationalised the importance of artificial intelligence (including NLP and machine learning) when dealing with a complex dataset that requires text mining analysis or analysing user behaviors.
We have empirically analysed 51, 697 comments, left on 13 trended YouTube videos along with their contents, for studying the phenomena of virality in Saudi Arabia. One of our main findings revealed that boosting entertainments, traditions, politics, and/or religion issues when making a video, that is associated in somehow with sarcastic or rude remarks, is likely the preeminent impulse for letting a regular video goes viral.
In the future, we intend to further optimise our parallel mixed-methods by semi-automating all the internal parts in a web-based application. We will be investigating on also optimising our sentiment classifier by taking into consideration the linguistic structure and grammar of texts.