A Decision Support System for Detecting Age and Gender from Twitter Feeds based on a Comparative Experiments

Author profiling aims to correlate writing style with author demographics. This paper presents an approach used to build a Decision Support System (DSS) for detecting age and gender from Twitter feeds. The system is implemented based on Deep Learning (DL) algorithms and Machine Learning (ML) algorithms to distinguish between classes of age and gender. The results show that every algorithm has different results of age and gender based on the model architecture and power points of each algorithm. Our decision support system is more accurate in predicting the age and the gender of author profiling from his\her written tweets. It adopts the deep learning model using CNN and LSTM methods. Our results outperform those obtained in the competitive conference s CLEF 2019. Keywords—Decision support system; age detection; gender detection; author profiling; deep learning; machine learning


I. INTRODUCTION
Recently, many studies have focused on the field of Information Retrieval (IR). The primary target of IR is to extract critical information of a field by predicting the behavior, preference, and person"s characteristics. Subsequently, filtering these massive amounts of data to convert it to useful information. This information will be used in the decision support systems. These systems can be used for creating marketing strategies [1], giving viewing suggestions to the user [2], filtering and translate languages [3]. Furthermore, support investigations by studying the sensitive text for national security that detects the source of a threat and helps police in detecting characteristics of the criminal from his\her linguistics [4] [5].
In addition, a decision support system (DSS) plays a great deal nowadays. For bidding websites, there is a needed to know age and gender of the user in order to propose to him\her the service or the clothes. It is becoming a challenging problem. For this reason, the decision support system is a needed by using author profiling.
The aim of this work is to solve the problem of author profiling (AP) by proposing a decision support system (DSS) for detecting age and gender from Twitter Feeds. Author profiling means determining the author"s gender and the author"s age group of a text by examining his or her writing and identifying the stylistic features. English tweets were taken as input from the PAN-AP-2019 dataset, then generate age and gender based on extraction features distinguish between gender and age class. To achieve the best result, multiple algorithms were implemented such as Deep Learning (like "Convolutional Neural Network (CNN)", "Deep Neural Network (DNN)", and "Long Short-Term Memory (LSTM)" and Machine Learning (like, "Naïve Bayes (NB)", "k-Nearest Neighbors (KNN)" "Decision trees (DT)", "Support Vector Machines (SVM)", "Neural Network", and "Random Forest (RF)". The structure of the paper is organized as follows; Section 2 presents a brief review on author profile prediction methods. Section 3 introduces our method used to discriminate between authors. Section 4 presents the experiments as well as the evaluations. Finally, the summarization with the conclusion will be mentioned, following by future work.

II. LITERATURE REVIEW
Text classification is depending on statistical procedures and text mining that provide outputs from calculations of extracted terms regularity. Author profiling distinguishes between classes of authors studying their socialist aspect, that is, how people share language. This helps in identifying profiling aspects such as gender, age, native language, or personality type. Author profiling is a problem of increasing significance in applications in forensics, security, and marketing. Table I shows up-to-date researches about detecting age and gender based on DL and ML algorithms.
In addition, there is a need to cover some works that had done by pioneer scientists. The following are some of them: The author(s) in [14] discovered the possibility of the author automatically classified, which depending on the writer's gender or age using the AP method. AP is predicting features linked to the text author. It includes many dimensions for example gender, age, native language, personality, education level, etc. As stated by [14], men who use more determiners to classify things (this/ this/the, an/an, etc.) and quantifiers (more, little, two, etc.). However, women are more concerned in relations and, thus, using personal pronouns (I, me, her, you, etc.) more than latter. Advanced, [14] modeled the author profiling which is a prediction system. He proposed a technique in his name called the Koppel algorithm. It www.ijacsa.thesai.org contains quantifying the duplication of 467 English keywords (such as too, a, too, their, yourself, etc.) in a text to compute the gender of its writer. All writing styles were analyzed from textbooks, fiction, and tests. Then, the program was able to deliver four of five correct answers. The author in [15] worked on automatic classification of messages in the Arabic language and English languages. They used datasets for three demographic and four psychometric. Authors trait from the linguistic features were trained by the WEKA toolkit and machine learning which includes: 1) decision trees J48, 2) Random Forest, 3) lazy learners IBk, 4) rulebased learners JRip, 5) Support Vector Machines AMO, 6) ensemble/meta-learners Bagging, 7) AdaBoostM1. Bagging looks to income from feature choice, where the SVM based SMO algorithm does not add progress when joint with feature choice. As a result, SMO and Bagging give the best performing for all traits examined. The research got a percentage of 81.5% of well-classified documents about the sex and of 72% about age.
The author in [16] considers the first five problem as a text classification problem. Second, the influence of stylistic features (e.g., word lengths) on predicting the gender. They used a dataset from a chat server (Heaven BBS), when users have peer-to-peer communication via textual messages. The dataset gathering of messages logs storing the users" outgoing messages. They used two different techniques according to gender classes, which are style-based classification and termbased classification. In the Term-Based Classification, they used supervised learning and employed four algorithms (k-NN, naive Bayesian, covering rules, and backpropagation). Every experiment separately was repeated five times. In the Style-Based Classification, they used many stylistic features from the message dataset. The chat messages include emotioncarriers called smileys and emoticons, and they research the word and phrase lengths as a computer-mediated text.
In regarding to the decision support systems (DSS), many systems have been proposed or improved for some areas such as diagnosis of periodontal disease, sales, business ideas competitions, etc. These systems are a computer program used to support determinations, judgments, and ways of action in an organization or a commercial. It collects and analyzes a huge of unstructured data to predict information that can help to get right step in the future or solve problems [36] [37]. As mentioned above, the proposed decision support system will help to solve the problem of author profiling. In the next section, our data and methodology will be explained in detailed.

A. Corpus
The data was collected from the CLEF (Cross-Language Evaluation Forum) initiative. It is a self-organized entity whose key goal is to foster science, creativity and the advancement of structures for access to knowledge, with a focus on multilingual and multimodal information at diverse aspects of structure. Since 2010, CLEF has housed evaluation labs for PAN (Plagiarism, Authorship and Near Duplicate Content), offering excellent locations at all times. Therefore, our corpus is taken from the PAN-AP-2019 dataset. The feeds are taken from Twitter in English language. The dataset is split into two subsets: "training" and "test". The considered dataset is labeled with classes of ages and genders [6]. For age classification, there are four classes (i.e.," 18 -24 "," 25 -34", "35 -49" and "50 and more". For gender classification, there are two classes ("Female "and "Male"). There are '14166' tweets that are unbalanced distributed as shown in Table II. www.ijacsa.thesai.org

B. Architecture of the System
The proposed system contains the next steps (see Fig. 1).  Text Processing: There are two main problems with the Corpus. First, the model cannot take the tweets directly as input. Secondly, the text could be messy (raw data). As a result, there is a need to converting the raw data into a clean data set by removing noisy data, such as HTML tags, white spaces, inflection words, symbols that would reduce the accuracy of detecting [17].
 Text Mining and Features Extraction: In this step, information that has patterns and trends from the text was extracted by implementing linguistic and statistical processing that study word frequency distributions occurred in the corpus. Then, it sorted these words by frequency [17]. After that gathering these terms into thematic classes to distinguish between age group and gender classes depending on Stylistic features and those based on content [18]. To determine the most word frequency in the corpus and to reduce redundant attributes and to computational capacity to process and to enhance classification performance, TF * IDF were calculated for each class and iterative this step until getting the most discriminating attributes [17], as shown below in Eq.(1) [19]:  Classification Models: As mentioned above, DL and ML algorithms were used to implement the classification step to gender and age files. Next is explanation for these methods.

1) Deep learning (DL):
It is primarily used as a statistical tool for categorizing patterns based on sample data and applying multi-layer neural networks. A group of input units, such as words or pixels, are present in the input layer. The hidden layer includes hidden units (the deeper a network is said to be, the more such layers), and ultimately the output units. With ties running between those nodes. After a while, a back-propagation algorithm allows a process called gradient descent to set the links between units that used a process, so that the same output is generated by any given input [20]. Deep learning methods are used include the following:  Deep Neural Network (ANN): An artificial neural network technology simulates human brain neural networks activity. It contains an input layer, an output layer, and multiple hidden layers that finds the correct mathematical manipulation to turn the input into the output and flows without looping back [21].
 Convolutional Neural Network (CNN): It is a technology used in image recognition. It consists of fully connected networks that mean each neuron in one layer linking to all neurons in the next layer to analyzing it by using unsupervised learning and adding weights to the loss function [22]. www.ijacsa.thesai.org  Long Short-Term Memory (LSTM): It is a class of artificial recurrent neural network (RNN) that address the problem of training RNNs with the technique of back-propagation through time (BPTT). It can process long sequence of dependencies such as speech or video. So, it is considered as the solution to short-term memory in RNN by using distinct units called memory blocks in the recurrent hidden layer [23]. The memory blocks contain [24]: o Memory Cells: remembering the historical state of the network. Also, it has weights input, output, and the internal state as following:  Input Weight: Weight input for the present time.
 Output Weight: Weight output from last time.
 Internal Weight: Internal state applied to calculate the output. Input gate and forget gate using to updating of internal state. The output gate is the last filter for output. All these gates with consistent data flow called constant error carrousel (CEC) that help to save cells neither exploding gradient (values of the model"s weights quickly become very large during training) or vanishing gradient.
Most of LSTM advantages are achieved by using backpropagation through time (BPTT) supervised learning algorithm that is used to modify weight to minimize the error of output compared to the expected output in the response of corresponding input [24].

2) Machine learning (ML):
It is a research field of artificial intelligence that extract knowledge from data by analyzing them to predict and make the design. The applications of machine learning models become in everyday life from automatic recommendation movies to what food to order or recognizing people in photos and many other applications that contain ML models [25]. The following are ML techniques used in our approach:  K-Nearest Neighbors (K-NN): It can be defined as a classification algorithm that can predict a new data point by discovering the nearby training set relaying on fixed number 'k' to that point and allocate it to the training set. It is the simplest algorithm to using and understand. It offers best performance without any significant adjustment. Nevertheless, it is lazy and cannot handle several features [25].
 Naïve Bayes Classifiers (NB): They are types of classification algorithm based on the statistical theory of Bayes. They assumed that there is a specific feature in a class that is not related to the presence of any other feature. Their advantages are predicting fast and dealing with large datasets [25].
 Decision Trees (DT): They are types of classification and regression algorithms. They performed classification without requiring much computation. In addition, they provide a clear indication of which fields are most important for prediction or classification. The target of this model is predicting the value of a target variable based on several input variables. They are unlike linear models; they are adaptive to solve any type of problems (classification or regression) [25].
 Random Forest (RF): It is defined as a classification algorithm. It has several decision trees, where each tree is a little different from the others. It solves the problem of over-fitting in training data. The overfitting can be reduced by being an average of their results. [25].
 Support Vector Machine (SVM): An efficient supervised ML algorithm delivers better performance for both classification and regression exercises. It is accurate and works on a small amount of revised data, and it does not work on a large amount of data because it requires more training time. SVM adapts training data containing features and its preferred class, it contains two stages: o Learning Stages: where SVM detects closest data points to the decision boundary known as Support Vectors (SVs), which forms the best separation among the classes [26].
o Prediction Stages: These SVs are applied to predict the class of test records [26].

C. Evaluation of the System
By using the confusion matrix shown in Table III, precision, accuracy, recall, f-score can be calculated with the equations (2), (3),(4), and (5) that evaluate the performance for our unbalanced dataset on ML algorithms [27]. The accuracy for DL algorithms were calculated [28]. Where TP (True Positives) and TN (True Negatives) positive and negative correct labeled predictions. In Opposition, False Positives (FP) and False Negatives (FN) are positive and negative incorrect labeled predictions.
 Accuracy: Measure how the system is correctly predicted [34]. www.ijacsa.thesai.org  Precision: The number of true positives divided by all (positive) predictions [34].
 Recall: Percentage of positive predictions in the system that correctly identified [34]. (4)  F_score: Measure the harmonic between recall and precision [34].

IV. EXPERIMENTATION AND RESULTS
In this research, DL and ML algorithms were implemented to detect the age and gender of an author. DL algorithms results are more efficient for our model rather than ML algorithms. In the first time of implementing our approach, we did not get as expected results from CNN age"s code, also LSTM age and gender codes.
Notice that using either "sigmoid" or "softmax" as an activation function will not affect gender because we have only two classes called "Binary Classification" [31]. On the other hand, we used "softmax" for age"s code "Multi Class Classification", but we had the same accuracy and loss in each step in both models. According to this problem, the solution is mentioned at [32]. We had changed the activation function to "sigmoid", but this provides loss results in negative. We decide to change the loss function from "binary_crossentropy" to "mean_squared_error". This provides good accuracy as shown below in (Table IV). As a result, the searched was done for the reasons that causing these problems and solved them as shown below:

A. LSTM Gender's Code Problem and Solution
By using 50000 words as top frequent words in LSTM. In addition to 10 hidden cells in the "Dense" layer. Had not provided good accuracy as in CNN algorithm. As a result, the number of hidden cells was increase to 20 "Dense" layer and number of frequent words was decreased into 5000 as shown in this problem [29]. Our approach had increased the accuracy from 0.5068 to 0.9906 as shown below in (Table V).

B. Age's Code Problem and Solution
Convolutional Neural Network and Long Short-Term Memory algorithms had provided great results for gender code according to the balance of gender tweets as shown in Table VI. Table IV shows how we implemented "binary_crossentropy " loss function to the model, that is used when there are only two label classes (Male and Female) as this was mentioned at [30]. In addition to using "sigmoid" as ''activation'' function.
Several classification models were tried as shown in (Table IV) and (Table V); the best result is achieved by implementing Convolutional Neural Network CNN as the first place and LSTM algorithms as the second place, in both age and gender detection. According to our search, we believe that if our model had a larger dataset size; the LSTM algorithm will be more powerful rather than the CNN algorithm, as mentioned in [35]. In Table VI, our approach is compared with the best accuracy result achieved of gender detection of competition "PAN at CLEF 2019" implemented by Valencia et al team for English tweets. It is obvious that our model system obtains more accurate results than others in predicting gender from author profiling.

V. CONCLUSION
In this research, DL and ML algorithms were used to solve the PA problem from English tweets that was taken from the PAN-AP-2019 dataset. To achieve the best result, multiple algorithms of Deep Learning were implemented, which are DNN, CNN and LSTM. In addition, Machine Learning algorithms were implemented too, which are "KNN, NB, DT, SVM, Neural Network, and RF". The results showed that every algorithm has different results of age and gender based on the model architecture and power points of each algorithm.
Moreover, our decision support system and model are achieved good result than others, who participated in the competitions CLEF 2019 (Table VI). It is based on deep learning model using CNN and LSTM methods. It is more accurate in predicting the age and the gender of author profiling from his\her written tweets. Then, it will be great if it is adopted in any bidding websites that need to know the age and the gender of their users in order to propose to them the service such as book or clothes, etc.