Post Classification in the Social Networks using the Map-reduce Algorithm

Wrongdoing is increasing through social media. Detecting them requires highlighting the most interesting topics in the posts. This essential part in the characterization of social network users could be done by a classification of posts. For this, we use a tuple of keywords and the Map-reduce algorithm for data collection and extraction. The main purpose is to achieve success on software realization which will establish a network between social networks to extract data and to speed up the classification of posts. The proposed method consists of verifying a sequence of keywords in the posts, following a grammar in order to determine classes. It allows the categorization of posts and monitoring of social networks. The categorization facilitates research of a particular post containing specific words. Thus, we contribute to increase capacity for wrongdoing prevention and strengthening cyber-security. Keywords—Big Data; map-reduce; social network; cybersecurity; classification


I. INTRODUCTION
Now-a-days, data is playing essential roles in analysis for the enterprises, in taking decisions. Data has become the basic resources for all the applications. Particularly, the applications based on the techniques of deep Learning and Data mining need more data.
Inversely, the internet of Things, the social networks produce enormous quantity of data, named Big Data. Due to the number of posts, the social networks propose a diversity of Big Data with several types such as texts, images and video.
Big Data involves several levels of problems such as the problem of data collection, data processing and data visualization. The problem of Big Data visualisation has been analyzed by Alexandre Perrot in his thesis in [1].
Data extraction is essential to build the collections of data on infrastructures for analysis. Frederic and others in [3] used Netvizz to extract data from Facebook to realize surveys on the 2015 presidential election in Burkina Faso.
Thus, Big Data processing needs the parallelization of data and tasks in order to reduce processing time. Since 2004, Dean and others in [2] proposed the map-reduce framework to count words in enormous documents. The techniques of processing big data have increased more and more with several implementations such as Hadoop , Spark in [5] and in [7]. The map-reduce has been also used in [6] for data-mining and in [4] for parallel algorithmic . A survey on the mapreduce framework has been proposed by N.Alamelu Menaka and others in [10].
Both the map-reduce algorithm and classification are the subjects of several works in [8], [9]. For example, in [8] Ouatik and others have tried a classification of student into four classes, scientific, literary, technical and original, in using the map-reduce algorithm.
Hadoop has been improved to take image processing into account. For example, Hadoop Image Processing Image (HIPI) has been proposed by the University of Virginia Computer Graphics Lab, in 2016 : it increases Hadoop capacity with new functions, giving the processing of distributed images and in taking the techniques of image processing into account. SERE and others in [11] introduced an application of the Hough Transform based on the map-reduce algorithm to improve processing time in straight line detection in distributed big images. The works of SERE and others in [11] have been extended by Mateus Coelho and others in [12], to circle detection in using the map-reduce algorithm.
In others ways, in recent years, scientists have worked on multi-label classification to obtain web page categorization. In [13], Yaya and others have studied the multi label classification in using an ontology in order to classify the web pages.
Every day, users post regularly relevant informations on social networks that can be used to characterize them to determine their behaviour. The posts give a way to obtain information on the users, to predict their future behaviour, to profiling them and to know what they are doing or what they are planning to do.
A selection of keywords allows generally the detection of posts related to specific subjects such as wrongdoing, fake news, terrorism, covid 19. For example, a list of keywords in the posts leads to track the user behaviour.
A Hub between social networks implementing the mapreduce algorithm, using a set of keywords, to extract speedly data from social networks does not exist.
But, there are available interface layers such as Application Programming Interface (API) to extract posts from social networks using keywords. For instance, Twitter proposes an API in python named tweepy that allows to get all the posts extracted into a file.
In this paper, we address the issue of post categorization in social networks. The objective is to facilitate profiling by classifying the most discussed topics in social networks. To do them, the map-reduce algorithm to establish data parallelism is used. Also, in big data, to speed up data processing, we (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 use parallelism in data extraction. The purpose focuses on monitoring of social networks through a software connected simultaneously on several networks.
The paper is organized as follows: the Section 2 concerns preliminaries followed by the method description in the Section 3. At the end, the Section 4 deals with the simulation of both parallelism using the library thread and parallelism through the map-reduce framework.

II. PRELIMINARIES
This section is focusing on the concepts related to the theories of words, languages and explains the meaning of the mapreduce algorithm. That allows forward the best understanding of the following sections.
In the theories of words and languages, a word is a sequence of elements or symbols in the alphabet while a language is a finite set of words. A language can be defined by a regular expression or by a grammar. For instance, let A be an alphabet {a, b, c}. "aa", "abc" are the words of the alphabet A. A * is the universal language. It is the set of all the words defined by the alphabet A. Any word based on the alphabet A is a member of A * . Any language based on the alphabet A is a subset of A * .
In this paper, the keywords could be the words with any alphabet for any language such as French or English used to write posts.
The map-reduce algorithm is proposed initially by Dean and others, in 2004 [2], when working at Google. It is made up of three phases: the mapping phase, the shuffle phase, the reducing phase. The map function defines a transformation that accepts as input a single key value (k, v) pair and produces as output a set of intermediate (key, value) pairs (k i , v i ). At the end of all the map functions, several pairs (k i , v i ) are produced for the shuffle phase. For instance the map function transforms the pair (k, v) as input and produces the set of pairs The shuffle phase defines the groups of pairs that have the same k i . It plays the role of creating groups. Suppose the pairs produced by all the map functions. The shuffle creates three groups where each group along with its key k i such as summarized in the Table I:

Groups
Data Forward in the method description, the value v i will correspond to a post. k i will mean the common criteria for all the posts in a class.
The reduce phase concerns mainly the processing of each group, created by the shuffle phase. That means an operation is introduced by the reduce function to perform the value v i of each group. For example, considering the operation +, the values of previous groups become: Finally, the results of the map-reduce algorithm give (k 1 , There is a similarity between the wordcount (Dean and others in 2004 [2]) and classification proposed in our method.

III. METHOD DESCRIPTION
This section describes the technique of data extraction used through API layer connected on social networks and the classification of posts based on the map-reduce algorithm.
In the social networks, a post is made of texts, images and video. Often, a text comments an image or a video. That means, the text analysis is useful for the descriptions of an image or a video. A text is a set of sentences or a sequence of words. For instance, the following message posted by the "Ministère du Développement de l'Economie Numérique et des Postes" in French in Facebook presents two sentences such as the sentence 1 and the sentence 2: • the sentence 2 is: "Le lien pour l'inscription: https://bit.ly/3gRDsy0".
A sentence is a sequence of words. Inversely, a sequence of words is not always a right sentence. Formally, a sequence of words is represented by a tuple of keywords defined by (w 1 , w 2 ,..., w n−1 , w n ) where w i is a keyword. In the previous post, a tuple (w 1 , w 2 ,..., w n−1 , w n ) can be (panélistes , stratégie, développement).
Data extraction is based on the definition of a criteria.

A. Data Extraction
Data extraction concerns extraction of posts regularly in the social networks in following the keywords and to store them in a nosql database, in the perspectives to prepare the research phase on the posts. The problem is also how to determine the manner to realize Data extraction and collection, speedly.
Social networks propose facilities through API to access to the posts with certain constraints. The problems of Data There are no connections between social networks through a common hub to facilitate extraction and the collection of posts in using keywords. That leads to establish a layer to realize data extractions.
Due to the large volume of posts from social networks, it appears necessary to use appropriate techniques in the layer such as the parallelism of data or tasks to process all the posts. For instance, in Twitter an Application Programming Interface (API) allows extraction of posts in using keywords. A solution is to establish the parallelization of available functions in this API to accelerate data extraction.
Thus, the proposed method introduces a layer named the map-reduce algorithm that will play the role of data extraction and will transfer data to a nosql database for data collection. It consists of injecting codes into the function map and reduce to define criteria for data extraction. The layer will communicate with the API layer in order to access to the social network database of posts. For instance, the Fig. 1 illustrates the global architecture of data extraction and collection.

B. Data Classification
In classification, a criteria establishes conditional instructions that describe how to put together data in the same group : It gives the common characteristics or the properties of data or the conditions that data must respect to be considered in the same class. The criteria are based on the tuple (w 1 , w 2 ,..., w n−1 , w n ) and the grammar, defined by: An : −w n the grammar leads to respect the ordering of keywords, to align keywords syntactically. But is not considered as a keyword in this context. The method consists of the verification of a sequence of keywords appeared in the posts. The sequence of keywords produced by the criteria based on the tuple (w 1 , w 2 ,..., w n−1 , w n ). The grammar follows the forms (w i ,... w j ) where i < j. The tuple defines the keywords and establishes the order of keywords that is confirmed by the grammar. The grammar also generates the groups of keywords. For instance, four groups of keywords w 1 , (w 1 , w 4 , w 5 ), w 2 , (w 4 , w 10 ) will correspond respectively to four classes such as summarized in the Table II, which presents a similarity with the wordcount [2], corresponding to the Table I in preliminaries: • the class 1 is the set P1 of posts that match the keyword w 1 ; • the class 2 concerns the set P2 of posts that match the sequence (w 1 , w 4 , w 5 ) in the order; • the class 3 is the set P3 of posts that match the keyword w 2 ; • the class 4 concerns the set P4 of posts that match the sequence(w 4 , w 10 ) in the order. If two posts contains the same sequence of words, they will belong to the same class. Thus, the order of words in the tuples influes on the result of classification. The order of keywords ( w i ,... w j ) where i > j, is not considered in this analysis and could be studied in the perspectives.
In combinations of words, the number of combination for the keywords in the tuple (w 1 , w 2 ,..., w n−1 , w n ) is established by C 1 n + C 2 n + ... + C n−1 n + C n n and corresponds to the number of post classes.
The ordering of the keywords, in following the tuple, determines a semantic meaning associated to a post that matches these keywords. For instance, a post can be a member of the class w 1 and the class w 2 , if it matches the pair (w 2 , w 1 ), not ordered in following the tuple. In this manner, a post can become a member of two different classes.
The classification is realized by the map-reduce algorithm, introduced here as a template to implement the parallelization of data and tasks and to accelerate data processing. By analogy to the works of Dean and others on the wordcount, to count the number of words in a document [2], the proposed model performs the tuple of keywords to verify if they appears in the order in the posts, to output the classes of posts. For instance, the algorithms 1, 2 show respectively the content of the map function and the reduce function that deal with the parallelization of data and tasks. The grammar is implemented in the content of the map function. The operation of classification is automatically done by the shuffle phase. The reduce function creates as outputs the classes.
In the Algorithm 1, tab [1..N] is a vector of keywords where each element corresponds to a class: The keywords generated by the grammar are inserted into the vector tab [1..N]. That means N= C 1 n + C 2 n + ... + C n−1 n + C n n having a type vector < P ost >, in order to prepare the reducing phase. As illustrated in the algorithm 2, the reducing phase removes duplicate posts in a vector of posts. Simulations on the proposed algorithms will be studied in the Section IV.

IV. NUMERICAL SIMULATION AND TIME EVALUATION
This section evaluates the time according to the number of posts extracted on social networks.
Consider the tuple (w 1 , w 2 ,..., w n−1 , w n ) where n=2. That corresponds to the tuple (w 1 , w 2 ). Suppose that (w 1 , w 2 )=(terrorism, covid19 ). The   The number of classes is then N= C 1 2 + C 2 2 . That leads to three classes such as summarized in the Table IV. In considering the pair of keywords (terrorism,covid19), a layer, written in python, is used to extract data from Twitter and Facebook, implementing data parallelism: The layer has been connected to Twitter server and Facebook server in using API, respectively tweepy and Facebook.
Here, parallelism implementations have been performed by two alternatives, such as data parallelism based on the library Thread in python and data parallelism implementation through the map-reduce algorithm with spark streaming.

A. Data Parallelism using the Library Thread (in Python)
In this way, three threads such as task 1, task 2, task 3 have been created. The task 1 is connected on Twitter to search the posts that correspond to the keyword "Covid19". The task 2 also connected on Twitter, searches the keyword "terrorism" while the tack 3 performs the posts of Facebook. Thus, a common platform is created between Twitter and Facebook, to extract the posts. The Fig. 2 shows step by step according to the number of posts, processing time between the serial case and the parallel case. That also shows clearly, according to the number of posts extracted increasing, an improvement of processing time in the parallel case.

B. Data Parallelism using Spark Streaming
There exist many tools that could be used for data parallelism on real time. For instance, Apache storm [14], Apache Flink [15] and spark streaming [16] implement the framework map-reduce and can be connected directly to Twitter, as a data source, to obtain the posts. But here, the map-reduce in spark streaming, is connected on Twitter that has been implemented, in following the previous Algorithms 1 and 2: the keywords used in the function mapreduce are "terrorism", "Covid19". Processing time in the three cases (serial, parallel and spark streaming) are illustrated in the Fig. 3. That shows: • from 0 to 70 tweets, tweet collection time in spark streaming is inferior to the serial case one and superior to the parallel case one.
• from 70 tweets to more, collection time in spark streaming is decreasing on the curve • from 80 collected tweets to 100 collected tweets, the curve for spark streaming is going decreasing and stays under both the curve of the serial case and the curve of the parallel case.
Finally, the experimental results shows that data parallelism using a map-reduce tool in streaming, gives more posts in a short time. Thus, the map-reduce tool, spark streaming is better than the serial case and the parallel case, in post extraction in the term of processing time, in considering the number of posts increasing up to 80.
The future research will focus on Deep analysis of network frames produced by spark streaming. We'll study in details the content and the functionalities of spark streaming, noticeably the number of dynamic threads generated for scalability control according to the number of post increasing. It contributes to the best understanding of advantages with spark streaming case than in using directly the library thread.
Moreover spark streaming extension or others tools to take others social networks will lead easily to a software for monitoring social networks about post contents. Because, many social networks don't accept unfortunately connection to the posts in streaming status through a map-reduce tool.

V. CONCLUSION
This paper has proposed the map-reduce algorithm, to extract speedly posts from social networks such as Twitter and Facebook, in order to realize a classification of posts, based on a tuple of keywords and a grammar. The tuples has defined the order of keywords while a grammar has generated the groups of keywords, corresponding to the classes. The keywords also play the role of class indexation.
The experimental resuls reveal that spark streaming is better than the parallelism using the library thread (in python) in post extraction in the term of processing time, in considering the number of posts to extract increasing, for instance superior to 80 posts in our study.
In perspectives, data extraction will be extended to more social network to have more posts, according to API layer availability to connect on social network. Through spark streaming for instance, it will be interesting to introduce more social network accessibility to the posts, not only with Twitter, but with proposed API Layer for each social network in order to strengthen the common platform as the main purpose.
The future works will also focus on content analysis of spark streaming functionalities and the study of class indexation, to facilitate research on stored posts to highlight data characteristics.
The tuples of keywords (w 1 , w 2 ,..., w n−1 , w n ) could be the parameters for the tuples of ontologies in a specific domain. That lead to others deep analysis with the map-reduce algorithm, in perspectives.