Multimodal Age-Group Recognition for Opinion Video Logs using Ensemble of Neural Networks

With the wide spread usage of smartphones and social media platforms, video logging is gaining an increasing popularity, especially after the advent of YouTube in 2005 with hundred millions of views per day. It has attracted interest of many people with immense emerging applications, e.g. filmmakers, journalists, product advertisers, entrepreneurs, educators and many others. Nowadays, people express and share their opinions online on various daily issues using different forms of content including texts, audios, images and videos. This study presents a multimodal approach for recognizing the speaker’s age group from social media videos. Several structures of Artificial Neural Networks (ANNs) are presented and evaluated using standalone modalities. Moreover, a two-stage ensemble network is proposed to combine multiple modalities. In addition, a corpus of videos has been collected and prepared for multimodal age-group recognition with focus on Arabic language speakers. The experimental results demonstrated that combining different modalities can mitigate the limitations of unimodal recognition systems and lead to significant improvements in the results. Keywords—Multimodal recognition; opinion mining; age groups; word embedding; acoustic features; visual features; information fusion; ensemble learning; Arabic speakers


I. INTRODUCTION
Due to the increasing adoption of mobile and web technologies, people tend to share their opinions and interact online on various aspects of their lives through a variety of social media platforms and websites, e.g.reviewing products, rating movies, or evaluating services [1]- [3].Examples of major social media platforms and blogging websites include Twitter, Facebook, Google+, Instagram, Pinterest and LinkedIn.Over the past years, there has been a growing interest in social media analysis ranging from simple stats dashboards, to more advanced sentiment analysis and topic trending, to more incredible recommendation systems.The aim is transformation of available raw data into insightful information of relations and content to support decision making and guide strategic planning.This plays important role in business intelligence to set up plans and strategies to leverage marketing campaigns and enhance customer satisfaction.
Most of the state-of-the-art techniques for sentiment analysis have focused on textual data analysis of people's comments or feedback.Sentiment analysis is concerned with analyzing, evaluating and understanding the opinions, attitudes, appraisals towards different entities, aspects or features [4].These techniques are based on using natural language processing, text mining, and computational linguistics to identify subjective information, determine opinions polarities (e.g.positive or negative) or affective states (e.g.happiness, sadness, fear, anger, surprise or disgust) for a given text, recognize sentiments on different aspects of a product, etc. Lots of work have been carried out in this regard.For a rigorous survey on sentiment analysis, we refer interested reader to [5], which reviews over one hundred articles published from 2002 to 2015 and organizes them based on tasks, approaches and applications.As Twitter has been one of the most prevalent microblogging platforms, several studies have focused on sentiment analysis of tweets.In [6], the authors presented a recent survey of algorithms and sentiment related tasks for Twitter such as tracking sentiments over time, irony detection, emotion detection, and tweet sentiment quantification.
Due to the limitations and challenges facing textual based sentiment analysis, researchers have been more recently attracted to consider other sources of information that are becoming more popular in social media such as voice, images and videos.For instance, there has been a great interest in information fusion for affective computing utilizing more than one modality or information channel [7].Several factors can negatively affect the unimodal analysis and recognition systems, including noisy sensor data, non-universality, and lack of individuality.Each modality has its own challenges.For example, recognition systems based on voice might be affected by different attributes such as low voice quality, background noise and disposition of voice-recording devices.Text-based recognition systems also suffer from several issues related to morphological analysis, multi-dialects, ambiguity, temporal dependency, domain dependency, etc.This is also true regrading recognition systems based on visual modality, which can also suffer from illumination conditions, posture, cosmetics, resolution, etc.In consequence, this leads to inaccurate and insufficient representation of patterns.So, incorporating different modalities for an entity can overcome such issues because each source of information can replenish each other.This might result in developing more accurate and robust recognition systems.It provides several evidences for the same identity which can lead to significantly improving the performance as compared to unimodal systems.
Several approaches, resources and techniques have been provided, designed and conducted to address sentiment analysis.Current sentiment and opinion mining based approaches evaluate peoples opinions in different analysis levels including document, sentence and aspect/ feature using different approaches: lexicon-based, machine learning based and hybrid based approaches.However, they don't take into account the impact of users' age on the analyzed opinions.Multimodal age recognition systems can be beneficial in such applications to tune the analysis and decisions towards particular age groups in order to meet their needs.For example, some products are specific for young people and reviews on these products by elder people are biased and may provide incorrect indicators for decision makers.Governments can also benefit from these systems to explore political decisions or services related to their citizens according to their age groups.Adaptive educational systems will be smarter when they consider age of the learner alongside the emotion.
Detecting users' age-groups through emotional modalities makes the problem more interesting and significant, especially with the revolution in social media platforms nowadays.Several social media platforms are being used to support opinion videos such as YouTube, Vimeo, Twitter, Facebook, Instagram, Flickr, etc.Thus, it is highly important to exploit such data for mining significant information and insights.Profiling user identification such as recognizing age group from emotional modalities is a challenging task because it relies on several attributes that are hard to model such as feeling, thought, behavior, mood, temperament [8].The research of user profiling identification and detection for Arabic language is even more scarce [9], [10].This is another motivation for this study to build a dataset of multimodal age-group identification for Arabic opinion videos and present a multimodal age-group identification system.To our knowledge this is the first study to present a multimodal age-group identification approach specific from opinion videos.Additionally, it is the first study to present a multimodal for Arabic videos in this concern, in general.It evaluates the capability of audio, textual and visual features individually to detect age-group for the same entity.Then, it presents an ensemble neural network method to fuse different modalities in order to improve the performance of the individual modalities.Several experiments are conducted to evaluate the proposed approach.
The rest of the paper is organized as follows.The most related work is briefly reviewed in Section II.Section III describes the methodology.Section IV presents the experimental work and results.Finally, Section V concludes the paper.

II. RELATED WORK
Age identification is considered as a task of user profiling detection and has received a growing attention in social media and human-computer-interaction systems with rising need for personalized, reliable, and secure systems.In the literature, this problem is addressed in a variety of ways.Some research work considered it as a classification problem to predict age group of a given user, e.g.[11], whereas others addressed it as a regression problem to predict the age in years, e.g.[12], [13].Most of existing methods have mainly focused on single modalities or single mediums including texts [14], [15], images [16]- [19], voice/speech [11], [20], and meta-data of users on Twitter [21].
Safavi et al. [11] presented a method to detect age-group from children's speech using the OGI Kids dataset.They applied Gaussian Mixture Model-Universal Background Model (GMM-UBM), and Gaussian Mixture-Support Vector Machine (GMM-SVM) with i-vector systems.Regions of the spectrum containing important age information for children are identified by conducting Age-ID experiments over 21 frequency subbands.The main findings were the GMM-UBM and i-vector system significantly performed better than the GMM-SVM system with an accuracy of 85.77%.An approach for age estimation from telephone speech patterns based on i-vectors was also presented in [13].Each utterance was represented by i-vector and Support Vector Regression (SVR) is applied to estimate the age of speakers.Bocklet et al. [20] present a method to detect a person's age and gender from his\her voice.As acoustic features, they applied Mel Frequency Cepstrum Coefficients (MFCCs), Perceptual Linear Prediction (PLPs) and Temporal Patterns (TRAPS)-based features.Different models were generated and combined at feature level and score level fusion and evaluated using GMM.They reported that combining different acoustic models led to improve the results with minor differences between feature level and score level fusions.
Multimodal recognition systems are still in their early stage and started applying for different tasks as gender detection [22], sentiment analysis [7], [23].Our work differs from the literature in several aspects.First, it recognizes age-groups from three modalities for the same user and compares the effectiveness of these modalities with each others.It explores different features for representing modalities such as word embedding based features for textual modality, dense optical flows for visual modality and a combination of several types of acoustic features.It builds a corpus for opinions videos of Arabic speakers.Moreover, it explores a novel ensemble of a neural network approach to combine different modalities.

III. METHODOLOGY
In this study, the age-group recognition task is addressed as a classification problem.This can be useful in applications such as targeted marketing which is directed to certain age groups rather than specific ages.For example, companies can tune their products to meet the needs of a specific age-group of people.Fig. 1 depicts the general framework of the proposed multimodal age-group recognition system.Some preprocessing operations are conducted to come up with three modalities for each video: audio, text, and visual.Each audio input is in WAV format with 256 bits, 48000Hz sampling frequency, and a mono channel.This is followed by the transcription task to generate texts corresponding to each video.Each video input is then resized into 240 × 320 after detecting faces.A feature extractor is constructed for each input source.The acoustic feature extractor constructs feature vectors of 68 features for each input.Moreover, a textual feature extractor is implemented to extract textual features with a dimensionality of 300 features for each instance.The visual feature extractor generates 800 features for each input.
A fusion method based on ensemble neural network is proposed to combine the different modalities.It is based on two levels; the first level is trained using the training dataset and gives a score for each age group from each modality (visual, text and audio).The resulting scores from the first www.ijacsa.thesai.orgFig. 1.Multimodal age identification system from opinion videos level are combined using a meta-learner in the second level to produce the final scores.The predicted age group is determined corresponding to the maximum final score.

A. Multimodal Age-groups Recognition Dataset
A video corpus is collected from YouTube.It is composed of 63 opinion videos expressed from both females and males in different domains including reviews of products, movies, cultural views, etc.Using various settings, the collected videos were recorded by users in real environments including houses, studios, offices, cars or outdoors.Users express their opinions in different periods.The videos are segmented into 524 utterances.The age-groups are specified as four classes as described in Table I.The instances are manually labeled into the considered age-groups carefully and systematically.First, for the well-known speakers, we looked for their ages in their profiles and assigned their age by subtracting date of recording videos from their birthdays.For the remaining speakers who we couldn't find their birthdays, three human annotators were involved to assign their age-group labels, using majority votes to break ties.

B. Feature Extraction 1) Acoustic features extraction:
The input audio is split into frames with size of 50 millisecond with a frame step of 20 millisecond.For each generated frame, a set of 34 features are computed: (1) ZCR (Zero Crossing Rate), (2)Energy, (3) Entropy of Energy, (4)Spectral Centroid, (5)Spectral Spread, (6) Spectral Entropy, (7) Spectral Flux, (8) Spectral Rolloff, (9-21) MFCCs (22-33) Chroma Vector, and (34) Chroma Deviation.Then statistics are computed from each audio's segment to represent the whole audio using one descriptor; in our study we used the mean and standard deviation.Thus, each input audio is represented by 34 × 2 = 68 features.This process of audio feature extraction is illustrated in Fig. 2.
2) Visual feature extraction: Two main steps are involved: face detection and visual feature extraction.The general frontal face and eye detectors [24] are utilized to detect the face of the speaker and segment faces from the rest of given frame based on HAAR features [25], which increasingly combines more complex classifiers in a cascade to detect the face.In addition, an eye detector detects eye positions which provide significant and useful values to crop and scale the frontal face to a size of 240 × 320 pixels in our case.
Then, optical flow is considered to extract the visual features from the videos processed in the previous step.Optical flows are, first, computed for each frame in a video and then used to compute histograms.They measure the motion relative to an observer between two frames at each point of them.At each point in the scene, the magnitude and the direction values are obtained which describe the vector representing the motion between the two frames.This leads to N oF × W × H × 2 dimensions to describe each video, where N oF represents the number of frames in a video and the W × H represents the resolution of the frame.In our case frames are scaled into the resolution of 240 × 320.To describe each video as a single feature vector (descriptor), a histogram of the optical flows per video is calculated.The scene is split into a grid of 10×10 with considering eight directions: {0 − 45, 46 − 90, 91 − 135, 136 − 180, 181−225, 226−270, 271−315, 316−360}.Consequently, each scene is represented by 800 features and to represent the whole input video the average of the histograms is calculated.The face detection and visual feature extraction is illustrated in Fig. 3.

3) Textual features:
The word embedding technique skipgrams word2vec [26], [27] is employed to extract textual features.Embedding techniques are recognized as an efficient method for learning high-quality vector representations of words/terms/phrases from large amounts of unstructured text data.They refer to the process of mapping words, terms or phrases from the vocabulary to real-valued vectors such that Word vectors are positioned in the vector space such that words sharing common contexts and having similar semantic are mapped nearby each other.Skip-grams (SG) is a neural network structure trained to predict a context given a word.Word embedding-based features have been adopted for different natural language processing tasks and achieved high results comparing to other traditional features [28].In our study, a skip-gram model trained from opinions expressed in Twitter with a dimensionality of 300 [29] is used to derive textual features.A feature vector is generated for each sample by averaging the embeddings of that sample [30].The main steps of textual feature extraction are shown in Fig. 4.

C. Classification Approach
This study deals with a multimodal identification system for three modalities.Therefore, seven different main models can be generated as follows.Three models are generated for audio, textual and visual modalities.Three other models are generated for the bimodal approaches of audio-textual, textual-visual, and audio-visual modalities.The seventh model is for the trimodal of audio, textual and visual modalities.
Due to the theoretical foundation underlying neural network research and recently-achieved strong practical results on challenging problems, neural networks have recently been rediscovered as a significant alternative to several standard classification techniques [31].However, the models need to generated well.Different systematic structures of neural networks are investigated to detect the age group from the considered standalone modality.Three models of feed-forward networks structures, Multilayer Perceptron (MLP) models, are applied.The first model is for visual modality, the second is for the audio modality while the third model is for textual modality.Several factors and decisions should be considered when configuring and setting up the neural network structures including: number of hidden layers to use in the neural network, number of neurons in each hidden layer, etc.Another issue for the multimodal approaches is: should the models be homogeneous or heterogeneous?; the former means using the same structure for each modality while the latter means using different structures.In the case of the heterogeneous models, The accuracy rate of audio-textual modality is significantly higher than the audio modality A vs. AV 0.00005 Reject H0.The accuracy rate of audio-visual modality is significantly higher than the audio modality A vs. ATV 0.00001 Reject H0.The accuracy rate of audio-textual-visual modality is significantly higher than the audio modality T vs. AT 0 Reject H0.The accuracy rate of audio-textual modality is significantly higher than the textual modality T vs. TV 0 Reject H0.The accuracy rate of textual-visual modality is significantly higher than the textual modality T vs. ATV 0 Reject H0.The accuracy rate of audio-textual-visual modality is significantly higher than the textual modality V vs. AV 0 Reject H0.The accuracy rate of audio-visual modality is significantly higher than the visual modality V vs. TV 0.279427 Accept H0. combining visual modality with textual modality has no effect comparing to visual modality V vs. ATV 0 Reject H0.The accuracy rate of audio-textual-visual modality is significantly higher than the visual modality AV vs. ATV 0.010979 Reject H0.The accuracy rate of audio-textual-visual modality is significantly higher than the audio-visual modality what are the considered attributes.In this study, two hidden layers are considered for each model while several criteria are considered to determine the number of neurons in each layer: • Same number of neurons in each hidden layer with the same structure.
• Number of neurons is assigned according to the size of inputs for each structure.Several cases are considered including: the number of neurons in a hidden layer is calculated using: where i is the size of input, o is the number of classes the number of neurons in a hidden layer is calculated using: the number of neurons in a hidden layer is calculated using: Other parameters are selected and remain the same for all structures to be: activation function = "relu", alpha = 0.0001, batch size= "auto", learning-rate = 0.001, tol = 0.0001, momentum = 0.9, epsilon=10 −8 ).
Consequently, five different structures are defined from the aforementioned criteria.The first structure is denoted as NN1 and uses a constant number of neurons in both hidden layers for all modalities.Since the three modalities are trained and evaluated using the same structure, this type is homogeneous.The second structure is denoted as NN2 and uses a number of neurons equals to N h1 in the first hidden layer and equals to N h2 for the second hidden layer.The third structure is denoted as NN3 and uses a number of neurons equals to N h1 in both hidden layers.The fourth structure is denoted as NN4 and uses a number of hidden layers equals to N h2 for both hidden layers.The fifth structure is denoted as NN5 and uses a number of hidden layers equals to N h3 for both hidden layers.So, all structures except the first one (NN1) rely on the size of the input and are heterogeneous for all modalities.Those structures are considered as baseline models/classifiers.
Another MLP model is constructed as a meta-classifier to ensemble all modalities base models.In this study, a simple structure of one hidden layer is adopted in the second stage.It  As mentioned above, four models can be generated when combining the three different modalities: audio-textual, textualvisual, audio-visual, and audio-textual-visual modalities.In case of bimodal approaches, the size of meta-classifier input is eight while its size is 12 in case of multimodal approach.

IV. EXPERIMENTS AND RESULTS
The proposed models are evaluated using 10-fold cross validation mode.A prototype is implemented and evaluated for each standalone modality and for the ensemble model in Python using the scikit-learn machine learning package [32].Several well-known measures are reported to evaluate and compare the performance of various models: Precision (P rc), Recall (Rec), and F 1 , which is a weighted average for precision and recall, and is a preferred performance measure for imbalanced class distributions.These measures are computed as follows for each class c i : Rec i = # instances correctly classified as class c i # instances actually in class c i (5) Besides the per-class performance, we reported the weighted and macro-averages of all classes in each case.The macro-average is unweighted of each class metric without taking class imbalance into account.This measure may overemphasize the low performance of infrequent classes.Hence, we also report weighted average where each class metric is weighted by the support (i.e. the number of true instances for each class).
Table II shows the results for each modality with different base neural network structures for unimodal age-group identification approaches.The results are presented in terms of precision, recall and F 1 for each class as well as weighted and macro averages for all classes.Audio modality achieves the highest results comparing to text and visual modalities in all cases.For audio modality, NN1 achieves the highest results with a weighted F 1 average of 85.12%, followed by NN5 which reports a weighted F 1 average of 84.73%.However, NN4 achieves the lowest results for audio modality.However, the overall lowest results are obtained using the textual modality.The best performance in the case of textual modality is obtained using NN1 with a weighted average of 57.02%.Regarding the visual modality, the highest results are achieved using NN4 with a weighted average of 63.83%.

B. Bimodal and Multimodal Age-group Recognition Results
The best structures evaluated in the first level classification are then used to represent each modality and fed into the second level classifier.For audio modality and textual modality, NN1 is used while for visual modality NN4 is used.NN1 for audio modality, NN1 for text modality and NN4 for visual modality are fused using the meta-structure.Table III shows the results obtained using bimodal and multimodal approaches, The highest results are achieved using audio-visual (A-V) approach with a weighted average of 88.55% and then audio-textvisual (A-T-V) approach with a weighted average of 87.98%.It can be seen that significant improvements are reported over the baseline performance in Table II.For example, in the worst cases, the highest weighted F 1 obtained for baseline textual modality is 57.02% and for visual modality is 63.83% whereas after combining them the results are improved to be 69.25%.In the best cases, the highest weighted F 1 obtained for audio modality is 85.12% and for visual modality is 63.83% while combining them leads to improving the results to be 88.55% It is important to perform statistical test to provide evidence that the improvement of combining different modalities is significant and not by chance.To do so, we re-run the 10-fold cross-validation 10 times for each model.We then used the pairwise t-test to determine how significant is the improvements.Table IV shows the results for the performed t-tests using 95% confidence interval.The reported p-values are less than 0.05 for all the cases except one.Thus, the null hypothesis is rejected and significant improvement is obtained except when textual modality is combined with visual modality (no statistically significant improvement is observed).

V. CONCLUSION
We have presented a novel multimodal ensemble neureal network model for detecting users' age-group from opinion videos.For evaluation purpose, three modalities are extracted, namely: audio, text and visual from videos expressed in Arabic language with different dialects.Various ways are adopted to construct different neural network structures for the unimodal recognition as baseline.Then, all modalities are combined using the proposed ensemble neural network approach.For standalone modalities, the audio-based model has achieved the highest performance with the smallest number of features.
However, text modality reported the lowest results.Combining different modalities has led to significant improvements in the results in nearly all cases.The highest results have been achieved using the bimodal audio-visual and trimodal audiotextual-visual approaches.As future work, the authors are exploring the impact of age knowledge in opinion mining and sentiment analysis.

Fig. 3 .
Fig. 3. Face detection and dense optical flow features extraction process.