Detecting Hate Speech on Twitter Network Using Ensemble Machine Learning

—Twitter is habitually exploited now-a-days to propagate torrents of hate speeches, misogynistic, and misandry tweets that are written in slang. Machine learning methods have been explored in manifold studies to address the inherent challenges of hate speech detection in online spaces. Nevertheless, language has subtleties that can make it stiff for machines to adequately comprehend and disambiguate the semantics of words that are heavily dependent on the usage context. Deep learning methods have demonstrated promising results for automatic hate speech detection, but they require a significant volume of training data. Classical machine learning methods suffer from the innate problem of high variance that in turn affects the performance of hate speech detection systems. This study presents a voting ensemble machine learning method that harnesses the strengths of logistic regression, decision trees, and support vector machines for the automatic detection of hate speech in tweets. The method was evaluated against ten widely used machine learning methods on two standard tweet data sets using the famous performance evaluation metrics to achieve an improved average F1-score of 94.2%.


I. INTRODUCTION
Twitter is a popular microblogging social networking service platform invented for the central purpose of connecting geographically dispersed people to seamlessly collaborate, communicate, microblog, network, socialise and share information. It is recently used for fostering business entities as a way of reaching out to a throng of clients and retaining them. However, despite its popularity and usefulness, there is a rapid rise in its usage for propagating hateful speeches and aiding torrents of invectives against innocent people. The level of anonymity of the accounts granted by social media networking platforms has made them havens for promoting hateful, discriminating, and vulgar speeches. Considering that Twitter generates a high volume of tweets daily, hate speech propagation should be curbed to avoid people deactivating their accounts and quitting the network platform. Human annotators are currently employed by Twitter and Facebook to delete nocuous tweets perceived to be hateful in curtailing the excessiveness of hate speech propaganda on social media platforms. In addition, the public is requested to report nocuous tweets to the service providers. However, these manual methods are laborious, sentimental, and susceptible to a subjective human judgement of what truly constitutes hate speech [1].
The repercussions of hateful tweets, limitation of legislation, and ineffectiveness of human annotators have created the necessity to apply machine learning methods for automatic hate speech detection. Classical and deep machine learning methods can be employed to automatically detect hate speech in text documents. The classical machine learning methods mostly use the vector-based representation of handcrafted features, which is time-consuming to craft and is typically incomplete [2]. Moreover, the vector space model often fails to effectively capture the semantic and syntactic representations of text. Deep learning methods generally allow for more accurate prediction through auto-generation of suitable feature representations. Recurrent neural networks (RNN) are deep learning methods that can preserve the sequence information over time. The contextual information can be considered in the task of object classification using deep learning methods [3]. However, deep learning requires a large chunk of data to obtain accurate results. Furthermore, the endto-end mechanism through which deep learning methods make decisions may not be suitable for text processing in the discipline of natural language processing because of the lack of interpretability. This is particularly pertinent to hate speech detection, where a manual appeal process is needed for a system that censors the speech of a person [4].
Research studies in machine learning have evolved to ensemble learning methods that agglutinate multiple learning methods to improve the performance of a detection system. This allows for harnessing the strengths of multiple learning methods and optimisation of classical machine learning methods in an object classification task. In general, ensemble learning methods can be classified appositely into four main categories of bagging, boosting, stacking, and voting [5]. The predictions from many decision trees are combined in a bagging ensemble learning method. Boosting involves correcting the performances of prior classifiers and adding them sequentially to the ensemble. Since every classifier is obliged to fix the errors in the predecessors, boosting is sensitive to outliers which are considered a disadvantage. Learning how to best combine the predictions from several inducers is achieved through a stacking meta-learning method. Like all meta-model ensemble methods, stacking is simply not feasible in many real-world situations because of a lot of reasons [6]. Predicting a class with the most votes by adding the votes of crisp class labels is called a voting ensemble that works by combining the predictions from multiple classifiers. The majority vote in the task of classification is predicted by Different ensemble machine learning methods have been effectively applied to diverse application domains such as speech emotion recognition [8,9], product image classification [10], and lung cancer prediction [11]. However, it is more challenging to process highly unstructured text documents with the orthodox machine learning methods that are well developed for numerical data processing. Consequently, a voting ensemble machine learning method that agglutinates logistic regression, support vector machines, and decision trees is proposed in this study for hate speech detection in tweets. Logistic regression has shown positive results on binary text classification because of its ability to be easily tuned to accommodate new data. Support vector machines are widely used for many types of classification problems because of their ability to work in high dimensional spaces to address the overfitting logjam. Decision trees have shown promising results in dealing with highly unstructured data because they do not require data scaling.
In general, tweets are short messages, and their meanings are often rife with idioms, onomatopoeias, homophones, phonemes, and acronyms [12]. Hence, the work reported in this paper agglutinates the strengths of logistic regression, support vector machines, and decision trees in a voting ensemble learning method for hate speech detection in tweets. It is envisaged that support vector machines will bring stability to the voting ensemble because it is not influenced by outliers in a data set. The process of carefully choosing and configuring the parameters for an ensemble learning method is still an open area. The parameter configuration in the proposed voting ensemble learning was carefully fine-tuned for optimal performance. This research study is aimed at enhancing the performance of hate speech detection systems using the method of voting ensemble learning and testing its performance against numerous baseline methods.
This paper is compactly structured as follows. In Section II, the related literature on hate speech detection is briefly reviewed. In Section III, the materials and methods of the study are discussed. In Section IV, the experimental results and discussion are explicated. The concluding statements are delineated in Section V of this paper.

II. RELATED LITERATURE
Hate speech detection is an automated task of determining whether a given piece of text content contains hateful utterances or not. It is a difficult problem in the fields of natural language processing (NLP) and artificial intelligence (AI) for which the classical or deep learning methods experimented. The classical machine learning methods heavily depend on a complex process of feature engineering where features from an input text are rigorously extracted. Deep learning methods eliminate the need for feature engineering by automatically learning features from the input text [7]. There is ongoing research to increase the accuracy of text classification methods owing to the unstructured and complex nature of NLP problems. The review of related literature is planned under the themes of classical learning, deep learning, and ensemble learning as explicated in this section.

A. Classical Learning
The classical machine learning approach uses the established vector-based model such as n-grams and bag of words for text representation, while support vector machine (SVM), decision tree (DT), and logistic regression (LR) are traditionally deployed for text classification. The SVM was originally designed for binary classification tasks [7], but its usage has long been extended to a multiclass classification problem by breaking a given classification problem into several binary sub problems. The binary classification method divides n-dimensional space features into distinct regions that correspond to two specified output classes [13]. Its performance is attributed to the ability to model nonlinear decision boundaries and it is robust against overfitting [14]. DT can achieve a good performance in several classification tasks while producing easily interpretable decisions. The knowledge learned by a DT during the training session is represented in a hierarchical structure that allows for easy comprehension and interpretation by non-experts. LR method uses a probability function or a sigmoid cost function whose output is limited to values between 0 and 1 to make it well suited for binary classification problems. Davidson et al. [15] used a crowdsourced hate speech lexicon to collect and label tweets containing hate speech. They trained six classical learning methods to distinguish three classes of hate speech as contained in their data set. Their best result was an F1-score of 90.00%.

B. Deep Learning
Deep learning methods learn through a series of interconnected network layers wherein each layer receives input from a prior layer and provides input to a subsequent layer [2]. The raw data in a deep learning text classification task are vectorised to produce the desired input sequence [14]. The size of the input layer is defined by the number of inputs. The additional layers improve the learning capability to obtain a stable output. The output layer provides a result in the form of probabilities of the output classes and has the same number of neurons as the output classes [16]. The long-short term memory (LSTM) can model an ordered sequential input such as textual data [17]. The LSTM was specifically developed to address the vanishing gradient problem faced by the vanilla version of recurrent neural network (RNN) [14] and it has been used in many classification tasks [1,16,18,19]. It has been proven to work well with text data, but it requires a large amount of data for training and validation [17]. Convolution neural network (CNN) uses the pooling technique to minimise the outputs of network layers, but it is prone to high dimensionality in a text processing task. Mutanga et al. [20] explored the use of a transformer method to detect hate speech to obtain the best accuracy of 92.00% and F1-score of 75.00% using DistilBERT.

C. Ensemble Learning
It is promising to harness the strengths of different machine learning methods through the framework of ensemble learning for improving the performance of hate speech detection systems. Popular ensemble learning methods include bagging, boosting, and stacking. Bagging minimises variance by combining the verdicts from different decision trees [21, www.ijacsa.thesai.org 22]. It has led to the development of many other decision treebased ensemble learning methods. The idea behind the bagging ensemble is to create numerous subsets of data from the training sample picked arbitrarily with replacement. Each of the subsets created is used to train its decision trees, resulting in an ensemble of different models. However, the bagging approach does not necessarily lead to improved performance. It can result in performance declination, for example, when a model already has low variance. In addition, empirical evidence has suggested that bagging can push an unstable method towards an optimal performance [23][24][25]. Conversely, it may lead to a declination in the performance of stable methods. Models are sequentially added to an ensemble in boosting, where each model rectifies the error made by the prior method in the sequence [26,27]. However, one apparent hiccup of boosting is that it is highly responsive to outliers because each method is required to address errors in the predecessor method. The stacking ensembles are generally used to learn how to best combine predictions from multiple inducers. Stacking ensembles, like all meta-model ensemble learning methods, are not feasible in many real-world applications because they can be expensive to train, deploy and maintain.
There are relatively few studies conducted on hate speech detection using ensemble machine learning methods. In their work, MacAvaney et al. [4] evaluated the efficacy of support vector machines, bidirectional encoder representations from transformers, and an ensemble of neural networks for detecting hate speech. They trained their model on four hate speech data sets to achieve the best F1-score of 91.18% obtained using an ensemble of neural networks on a hate speech tweet data set. Ahluwalia et al. [19] used an ensemble learning method of LR, SVM, random forest (RF), and gradient boosting machine (GBM) to detect English hate speech against women. They trained their model on a data set of binary classes and a data set of multiple classes to achieve the best accuracy of 65.10% for binary classification and an F1-score of 40.60% for multiclass classification. The said works have employed ensemble learning methods for hate speech detection, but it should be noted that none of them has combined logistic regression, decision trees, and support vector machines in an ensemble architecture despite the efficacy shown by the algorithms when used solitarily [4,15,19]. The contribution of this paper is the development of a new robust voting ensemble method that harnesses the capabilities of LR, DT, and SVM [14,15] to address overfitting, accommodate new data, and allow for interpretability of a hate speech detection system.
The review of the related literature has generally indicated that relatively few studies have focused on using ensemble learning for hate speech detection in online spaces. Most of these few studies have reported performance results that require further improvement. The current method based on voting ensemble learning gave the state-of-the-art results of 94.20% accuracy and an F1-score of 94.21% surpassing the results of earlier studies that used the same data set. The results have reflected an improvement over the F1-score of 90.00% reported in [15] and the highest benchmarked accuracy result of 92.00% reported in [20].

III. MATERIALS AND METHODS
The materials and methods used in this study are lucidly presented in this section based on experimental data sets with baseline methods, and essential steps of the proposed voting ensemble method.

A. Experimental Materials
The publicly available data sets of hate speech offensive (HSO) language and Kaggle were used for experimentation in this study. The HSO data set comprised of 11310 tweets that were labelled as 'Hate' or 'Neutral' as made available on the GitHub repository [15]. The Kaggle data set is made up of 8778 neutral tweets and 1155 hate tweets. The data set was grossly imbalanced, and it was important to measure the performance of machine learning methods on a smaller data set. Consequently, the data set was reduced programmatically to 2300 tweets to test the performance of the experimental methods on a smaller data set. The balanced version of the data set consisted of 1150 hate tweets and 1150 neutral tweets that were used for experimentation in this study.
The baseline experimental methods and the proposed voting ensemble method were all implemented using the Python programming language. The Keras library was used to implement the deep learning methods, while the scikit-learn Python library was used to implement the baseline classical and ensemble learning methods. Specifically, sklearn.tree, sklearn.linear_model, and SVM submodules were used to implement DT, LR, and SVM respectively. All the baseline ensemble learning methods were implemented using the sklearn.ensemble submodule. The Keras library was used to implement the CNN and LSTM deep learning methods. Several experiments were faithfully conducted on a computer machine running Windows 10 operating system with configuration of Intel (R) Core (TM) i5-8250U CPU @ 1.60GHz (8 CPUs), 1.8GHz, 8 GB RAM, and 500 Gigabytes of a hard disk drive.

B. Proposed Method
The proposed voting ensemble method comprises the phases of pre-processing, feature representation, and feature classification. The essential steps of the pre-processing include the removal of special characters and punctuations, normalisation of hashtags, lowercasing of the characters of the input text, removal of short words, and text tokenisation. The feature representations were based on the widely used bag of words and word embedding. They were applied after preprocessing to convert the raw tweets data into a useful form amenable to machine learning processing. The bag of words representation converts a text document into a fixed-length vector of occurrence of words in the input text and it was used to implement the classical learning methods. The regularity presented by specific keywords has provided a solid foundation for a bag of words representation to focus on specific words in a data set [28]. Since hate speech is generally expressed through largely homogenous words, it is envisaged that a bag of words representation can effectively capture and represent the vocabulary of known hate words such as black, white, Indian, Jews, foreigners, strangers, enemies, and so on. www.ijacsa.thesai.org Word embedding is a more promising text vocabulary representation that is used by deep learning methods to encode meanings of words into a real value vector such that highly similar words are closer in the vector space. It is a foundation for sentence embedding that presents a huge advantage over the bag of words vector model. It can capture word context, syntactic and semantic relationships with words in a text document. Moreover, it eliminates the sparse representation hiccup often associated with the bag of words representation. The word embedding approach follows the distributional hypothesis, where semantically similar words are found in the same context [2]. The word embedding layer for text classification is usually the first data processing layer of a deep learning model and word embedding methods have been demonstrated to perform well in different NLP tasks [29][30][31]. In this study, word embedding was implemented using the Keras embedding layer of deep learning because of its ability to capture contextual words and syntactic similarities to enhance the interpretation of tweet meanings.
The basic idea behind the proposed method of this study lies in the selection of an optimal bias-variance trade-off. The presence of high variance can lead to the problem of overfitting, while high bias may result in underfitting. Due to the nature of tweets, variance is likely to occur, particularly in fora that focus on a specific type of hate speech. The Islamophobic for instance may express hate speech in largely similar terms that are difficult to detect using a learning method. The proposed voting ensemble method aggregates the decisions from three classical inducers, which are LR, DT, and SVM to obtain accurate classification decisions. Fig. 1 shows the architecture to illustrate the steps of the proposed voting ensemble learning method. The base inducer of DT is used when the dependent variable is qualitative as in the episode of a text classification task. DT is highly interpretable, fast to train, and works well with decision boundaries [14]. The inclusion of the DT method in the proposed ensemble is based on its appropriateness when dealing with categorical data such as distinguishing hate tweets from innocuous tweets. Earlier studies have investigated the use of DT methods in hate speech detection tasks and recorded satisfactory performance [32]. The important parameters in DT to perform the grid search cross-validation technique are max_depth and random_state. The max_depth parameter determines the depth of a tree. The deeper the tree, the more splits it has, and it captures more information about the data. In our experiments, the max_depth value for optimal searching was 10 trees. The depth parameter is also used as a regularisation scheme to prevent overfitting. This step is crucial in our study because tweets are generally regarded as noisy and highly dimensional. The random_state parameter that controls the random choices for the training sample was set at 42.
The LR inducer attempts to find a probability-based relationship between the independent variable and class label in each data set. It aims to create a probability function that uses features as inputs and returns the probability of that instance belonging to a given class [33,34]. The LR does not require scaling of input features and it requires comparatively fewer computation resources [14]. The regularisation parameter (L2), and 'squared magnitude' of coefficient as a penalty to the loss function were used for optimisation. The 'fit_intercept' parameter was set to 'True' to incorporate the intercept value to the LR method. The 'Solver' parameter that defines the method to be used in the optimisation problem was set to 'sag' which is compatible with the L2 penalty.
The optimisation of the SVM inducer employed the grid search cross-validation scheme to come up with the best parameters for model fitting. The optimal value for parameter C that defines the tolerance threshold for misclassification was set at 0.1. Moreover, the linear kernel that works on the assumption that input data is linear was applied. Thereafter, the auto deprecated gamma setting, which is the recommended default value was used in conjunction with the linear kernel. The optimal value of the degree parameter was set at 3. The learning rate for the proposed ensemble learner was specified in the Python program before training. The low learning rate specified in Table I was used for preventing the ensemble model from converging to an undesirable optimum [35]. The tolerance setting is a stopping technique that stops the iteration process once the specified value is reached and it affects the training time of a model [36]. The parameters for the inducers and voting ensemble (VE) method are succinctly summarised in Table I. The results computed by the voting ensemble learning method can be based on either hard or soft voting. The class probability score for each classification method that the current sample belongs to, is considered soft voting [34]. At that point, soft voting criteria determine the class with the highest probability by averaging the individual values of the inducers [37]. Hard voting involves summing the votes for crisp class labels from the other inducers and predicting the class with the www.ijacsa.thesai.org most votes. The class label Y can be decided by the majority voting of each classifier C as in the following example.
If the predictions from c1, c2, and c3 are 'hate', 'neutral', and 'hate', respectively, the final prediction will be 'hate' according to the principle of majority voting. Consequently, The hard voting scheme is suited for predicting distinct class labels, while soft voting is appropriate for predicting continuous values. This study was based on tweets labelled under distinct categories. Hence, it implies that hard voting is more desirable for this study as compared to soft voting.

IV. RESULTS AND DISCUSSION
This section presents a discussion of the comparative results of the proposed voting ensemble learning method against ten widely used machine learning methods. The baseline methods are AdaBoost, AdaBoost-DT, Bagging, Bagging-SVM, CNN, DT, LR, LSTM, RF, and SVM. The experimental data sets of Kaggle and HSO were each split into training and testing data in the ratios of 80:20 and 70:30. Although the proposed voting ensemble learning method is comprised of LR, SVM, and DT inducers, each inducer was implemented separately to establish a comparison with the proposed voting ensemble learning method. In addition, other widely used machine learning methods were evaluated against the proposed method. The performances of the learning methods were analysed and discussed in terms of four functional metrics of accuracy, precision, recall, and F1-score. In addition, the performances of the learning methods were evaluated and discussed in terms of non-functional metrics of kappa, hamming loss, Jaccard similarity, and execution time.

A. Accuracy Results
This section presents the analysis of the accuracy of the experimental results of the proposed voting ensemble method along with several baseline methods. The accuracy scores calculated for the two experimental data sets are listed in Table II. It can be observed that accuracy scores computed by the proposed voting ensemble learning method are consistently higher than the scores computed by other learning methods across the two experimental data sets. The proposed ensemble learning method recorded the highest average accuracy score of 94.212% across both data sets, irrespective of the test split. It is worth mentioning that the voting ensemble learning method had the highest accuracy scores across the two data sets under the different train and test splits. Expectedly, all methods performed better with the larger HSO data set as compared to the smaller Kaggle data set, with the proposed voting ensemble learning method giving the highest accuracy score of 96.739% under the bigger data set using the 80:20 train test split. This trend is attributable to the fact that bigger data sets allow methods to learn data patterns more comprehensively during training, thereby impacting overall performance, particularly in the case of deep learning methods, which generally require large data sets to perform well.  Fig. 2 shows the plot of average accuracy scores computed by the learning methods across the experimental data sets to visually illustrate the extent to which one learning method gives better accuracy than another. This result implies that the voting ensemble method can detect all the correct cases better than any other method, while AdaBoost performed worst in this case. The voting ensemble method is therefore the most useful when all classes are equally important while AdaBoost is not useful in this scenario.

B. Precision Results
This section presents the precision scores computed by the proposed voting ensemble learning method against the baseline learning methods explored in this study. It can be observed in Table III that the voting ensemble learning method www.ijacsa.thesai.org outperformed other learning methods across the two experimental data sets. The proposed voting ensemble learning method recorded the highest average precision score of 93.779%. The LSTM performed relatively well, scoring the second-highest precision score of 93.457%. The exceptional performance of the LSTM may be linked to its ability to capture long-term dependencies. This property makes it suitable for text classification tasks such as hate speech detection, where the semantics of a tweet can be derived from the arrangement of words in the tweeted document.  The least average precision score of 89.743% was recorded by the AdaBoost method with the default parameter setting. The combination of AdaBoost with another classifier in ensemble learning improves the system performance. The AdaBoost method with DT as base learner outperformed the default AdaBoost because it gave an average precision of 91.006%, which is higher than 89.743% recorded by the default AdaBoost method. Most methods performed better with the 80:20 train test split as compared to the 70:30 split. However, only RF and DT performed better with a 70:30 split. The precision computed by DT and RF fell when the training data set was increased by 10% on the larger HSO data set. The drop-in performance may be the result of overfitting because both DT and RF are susceptible to overfitting [17]. Table III shows the precision scores of all the learning methods experimentally compared in this study. Fig. 3 shows the plot of the average precision scores computed by the learning methods across the experimental data sets to visually illustrate the extent to which one method gives better precision than another. This result implies that the voting ensemble method can correctly detect hate speeches from the predicted class of hate speeches better than any other method, while AdaBoost performed worst in this case. The voting ensemble method is therefore the most useful when the cost of false positives is high while AdaBoost is not useful in this scenario.

C. Recall Results
This section presents an evaluation of the learning methods investigated in this study based on the recall metric. Results from Table IV show that the proposed voting ensemble learning method gave an average recall value of 94.210%, which is superior to that of baseline learning methods used in this study. In addition, it can be noted that default meta classifiers of Bagging and AdaBoost were outperformed by their ensemble variants, which used different learning methods as base learners. The recall score for the default AdaBoost is 89.408%, while the recall score for the AdaBoost-DT is 90.959%. In addition, the recall score for the bagging method is 90.552%, while bagging-SVM had a recall score of 92.937%. It is obvious from these results that combining meta classifiers with different learning methods can lead to improved performance as shown in Table IV.  Fig. 4 shows the plot of average recall scores computed by the learning methods across the experimental data sets to visually illustrate the extent to which one learning method gives a better recall than another. This result implies that the voting ensemble method can correctly detect cases of hate speeches from all the actual classes of hate speeches better than any other learning method, while AdaBoost performed worst in this case. The voting ensemble method is therefore the most useful when the cost of false negatives is high while AdaBoost is not useful in this scenario.

D. F1-score Results
This section presents the results of the overall F1-score for the learning methods explored in this study. Table V shows that the proposed voting ensemble method consistently outperformed the baseline learning methods investigated by recording the highest average F1-score of 94.208%. The solitary bagging ensemble learning method recorded an average F1-score of 90.564%, while the bagging-SVM ensemble method recorded an average F1-score of 92.948%. Furthermore, the mean score of 89.897% of the average F1scores for both AdaBoost and Decision tree learning methods is inferior to the average F1-score of 90.692% for AdaBoost-DT ensemble learning method. The analysis of the F1-score for LSTM and CNN deep learning methods has shown that LSTM consistently outperforms CNN. It can be perceived in Table V that LSTM recorded an average F1-score of 93.035%, while CNN recorded an average F1-score of 91.439%. This difference in performance may come from the capability of LSTM to capture long-term dependencies that are necessary when extracting word meanings in a text. The superior performance of the ensemble learning methods as compared to any solitary methods, including deep learning has illustrated that agglutinating multiple learning methods through ensemble learning is highly promising for reducing the error rate of the final learner in a hate speech detection system.     5 shows the plot of average F1-scores computed by the learning methods across the experimental data sets to visually illustrate the extent to which one method gives a better F1score than another. This result implies that the proposed voting ensemble method can better detect incorrectly classified cases better than any other learning method, while AdaBoost performed worst in this case. The voting ensemble is therefore the most useful when the classes are imbalanced while AdaBoost is not useful in this scenario.
Results from the functional metrics used in this study indicate that the voting ensemble outperformed the benchmark algorithms used in the study. It is worth noting that the individual performance of the meta classifiers was inferior to that of the proposed voting ensemble model. This superior performance of the proposed model may be attributed to the minimal overfitting, model extensibility, and interpretability features from each of the base learners [14,15]. These results confirm the literature position that ensemble learning outperformed individual classifier algorithms in hate speech detection [38,39].

E. Non-functional Results
This section presents the evaluation of all the learning methods investigated using the non-functional metrics of Hamming loss, Jaccard, Kappa, and execution time. The proposed voting ensemble method recorded the best Hamming loss, Jaccard, and Kappa scores as shown in Table VI. This result shows that the proposed ensemble learning method can maximise predictive capability while concomitantly minimising misclassification errors better than any of the baseline learning methods investigated. However, the proposed voting ensemble learning method recorded the second-highest training time of 0.095 hours, which is a tradeoff decision to consider between efficiency versus accuracy. The long execution time taken by the proposed ensemble method was because each inducer was trained separately, and the final aggregated decision was achieved through the principle of majority voting.
It can be observed from Table VI that SVM recorded the lowest Kappa score indicating a low level of inter-annotator agreement. This may be the result of minimal parameter tuning applied to the SVM method. The learning method recorded the worst Hamming loss of 10% to suggest a poor selection of parameters for the method. It is interesting to observe that ensemble learning methods such as bagging-SVM and voting ensemble took more time to train than the deep learning methods. This implies that although they perform better, ensemble learning methods are computationally expensive. However, the benefits of an improved performance can outweigh the need for increased resources in critical applications like hate speech detection.

V. CONCLUSION
The primary contribution of this study is the construction and validation of a voting ensemble learning method to improve the automatic detection of hate speech in tweets. This is challenging open research because of the anaphoric, synonymy, and polysemy nature of the slang of tweets that make the interpretation of hate speech ambiguous, difficult, and controversial. The voting ensemble learning method has been demonstrated in this study to yield the best performance when compared to other learning methods.
However, one apparent curb of this study is the bag of words representation of features that suffers from the anaphoric, synonymy, and polysemy nature of words. In addition, a bag of words representation presents the inability to capture important information about interdependencies that exist among words. Moreover, word embedding representation can fall short in making machines draw adequate inferences from certain classes of sentences. In the future, a knowledge-based method with sentence embedding will be introduced for tweet hate speech detection and compared the results against those of the existing word embedding learning methods. This envisioned novel method will circumvent the intrinsic curbs of the bag of words and word embedding representations. It will significantly increase the confidence level of social media prosecutors to genuinely regulate whether a given tweet is of hate speech or not.