An Improved Deep Learning Approach based on Variant Two-State Gated Recurrent Unit and Word Embeddings for Sentiment Classification

Sentiment classification is an important but challenging task in natural language processing (NLP) and has been widely used for determining the sentiment polarity from user opinions. And word embedding technique learned from a various contexts to produce same vector representations for words with same contexts and also has been extensively used for NLP tasks. Recurrent neural networks (RNNs) are common deep learning architecture that are extensively used mechanism to address the classification issue of variable-length sentences. In this paper, we analyze to investigate variant-Gated Recurrent Unit (GRU) that includes encoder method to preprocess data and improve the impact of word embedding for sentiment classification. The real contributions of this paper contain the proposal of a novel Two-State GRU, and encoder method to develop an efficient architecture namely (E-TGRU) for sentiment classification. The empirical results demonstrated that GRU model can efficiently acquire the words employment in contexts of user’s opinions provided large training data. We evaluated the performance with traditional recurrent models, GRU, LSTM and Bi-LSTM two benchmark datasets, IMDB and Amazon Products Reviews respectively. Results present that: 1) proposed approach (E-TGRU) obtained higher accuracy than three state-of-the-art recurrent approaches; 2) Word2Vec is more effective in handling as word vector in sentiment classification; 3) implementing the network, an imitation strategy shows that our proposed approach is strong for text classification.


I. INTRODUCTION
Automated sentiment classification is the process of extracting the opinions of various user expressed in texts, that are contains emotions or opinions behind the sentences, and identifies the positive or negative aspects of comments, which is very useful in analyzing user generated contexts. The issue of sentiment classification is a popular area in natural language processing (NLP) tasks and recently has gained a plenty of concentration. There are lots of data publically are available on various online platforms, that allow us to performs a sentiment classification for the favors of academic attention. Sentiment analysis is done at both at phrase level and document or paragraph level. Both the levels offer unique challenges and hence require different techniques to tackle them. Word embedding is a fundamental task in NLP and commonly used method to encode words into a high dimensional space. In recent years, they have taken excellent attention to the base of extract both semantic and syntactic information. The main concept of word embeddings is a long history but it has become well-known since Bingio et al. efforts [1] in which each word is shown by word vector and the concatenations of many previously word vector are used to predicts the following based on a language model. Traditionally, to represented an each word as a vector with high-dimensional sparse vector similar to the number of unique terms in the vocabulary using distributional approaches [2] such as vector context-based techniques [3] [4] and the Hyperspace Analog to Language (HAL) model [5]. Recent, a latest distributed words representations training technique, identified as word embeddings [6] [7] [8] has been established to illustrate words as low-dimensional vector for text representation of real numbers, which can effectively capturing semantic and syntactic word similarities from big datasets. These word embedding approaches has been excellently applied for several tasks included entity recognition [9], dependency parsing [10], text classification [11], and speech recognition [12].
Word embedding, the logical meaning of words that trained from specific contexts tends to produce same vector representation for word with same contexts. This technique performs better for semantic-oriented applications but its problem for sentiment classification because words with same vectors representation due to same contexts may have an contrary sentiment separation, as in the examples of happy-sad referred in [13] and positive-negative in [14]. Furthermore, most of the authors have investigated the unique features of the vector representation of words through machine learning [15]. The two well-known word embeddings representation methods of neural network language models, is called Glove word vector and TF-IDF were introduced by Pennington and Aizawa [16], [17]. However, these traditional context-based word embedding such as TF-IDF and GloVe usually unsatisfied to capturing appropriate sentiment information, which may result in words with same vectors representation having a reverse sentiment polarity. To handle this issue, in this research have suggested using Word2Vec, word embedding method to representation words in text as vectors for sentiment classification. www.ijacsa.thesai.org In recent years, deep learning architectures have gradually presented better performance in several data mining applications, such as text classification, entity recognition and sentiment analysis. These architectures effectively addressing the features representation issue because they learnt features from contextual data automatically. Between deep feedforward neural networks, convolutional neural networks (CNNs) [18] have been presented to capture from words or phrases, and recurrent neural networks (RNNs) [19] are capable to capture temporal dependencies in sequence information and have shown strong semantic composition approaches for sentiment classification [18]. Provided large texts in social media, there is absence of features. To achieve progressively important features, we further used appropriated concept of distributed illustration of words where each input is denoted by numerous features and each feature is included in several potentially sequential inputs. In particularly, we used the pre-trained Word2Vec word embedding method [6] for distributed representation of social media.
RNNs have been extremely used in recent years for the tasks of texts classification. The key benefit of RNNs is that they can be applied to extracts temporal sequential data with variable-length, which flexibilities generates in evaluating reviews of various lengths. Recent, simplified architecture through LSTM, known as Gated Recurrent Unit (GRU), was proposed by [20]. GRUs contains fewer parameters and explained by very simple set of equations, thus need significantly less computational power. Relation among GRU and LSTM effectiveness is an initiate problem and a domain of research.
In this research, we have investigated the effectiveness of GRU for sentiment classification with distributed representation in social media. Applied variant-GRU model for the aim of preventing the issue of gradient exploding or vanishing in an existing RNNs and also overcome the deficiency of standard GRU. In this work, we conducted the experiment on two benchmark datasets, IMDB and Amazon Products Reviews. We compared the performance of proposed Encoder Two-State Gated Recurrent unit (E-TGRU) with three traditional RNNs models namely are, Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM) and Bidirectional Long Short Term Memory (Bi-LSTM).
Our research work continues to further investigate standard GRU based on variant-GRU with encoder to automatic preprocessing data to provide an improved presentation of the inputs than original raw inputs. Based on previous work, our main objective is enhancing the standard GRU structure in order to increase accuracy of sentence classifications and minimize the information loss. In particularly, based on the above studied, the main contribution of this research is summarized as follows:  Proposing a variant of GRU is included encoded gated recurrent unit (E-GRU) for sentiment analysis. This method performs automatically preprocess text data through encoder.
 Proposed a Two-State Gated Recurrent Unit (TGRU) for sentiment classification.
 To proposed Word2Vec pre-trained word embedding method is applied for sentiment classification to illustrate words as vectors in long-short term.
 Experimental results demonstrate that the (our network) namely Encoder Two-State GRU (E-TGRU) architecture performs well on both tasks to takes advantages of the encoded local features extracted and captured the long-term dependencies, and further enhance the sentiment classification performance but significantly less computational expensive than Bi-LSTM.

A. Recurrent Neural Network
A standard RNN is kind of artificial neural networks in which connections among the units form a bidirectional cycle, and they perform the similar task for each element in the sequence. The RNN technique is better for sequential issue especially capturing temporal information in the loop. The RNN uses recurrent hidden state whose activation that particularly dependent on the previous timestep while performing the sequential information and the current state rely on the current input. Thus, the current hidden state makes complete usage of past information. In this way, standard RNN can handle variable length and computes sequential data in dynamic processes. The architecture of an RNN is shown in Fig. 1. When provided a sequential inputs X = [x1, x2 ... xt ... xT] of length T, process output vector Ot, an RNN defines the hidden state at the time t with the sequential of following equations: where ∈ ℎ × is the weights matrix connecting input layer to hidden layer, and ∈ ℎ × ℎ is the hidden layers weights matrix. σ is the sigmoid activation function and , , are parameters of the traditional RNN.
In Fig. 1, the input unit is X1 of time-step t, which shows the word vector of the t-th word in the text; ℎ is the final activation state of step t; shows the output at time t, the output is chosen according to the require of the network; U, W are the weights parameters of the networks are require to trained the model. However, RNNs are hard to train and suffer from vanishing and exploding gradients issue. Although the RNN is very powerful when dealing with sequential problems, it is difficult to train with the gradient descent method and suffer from vanishing and exploding gradient (explosion) issue [20]. On the other hand, variants of RNN have been developed to solve the above issues, such as LSTM and GRU. Between them, GRU avoids overfitting, as well as saves training time. Therefore, GRU is adopted in our method.

B. Long Short Term Memory
Long Short-term Memory (LSTM) is one of the RNN structure that was initially proposed by [21] and has largely applied in natural language processing. LSTM consist the gated mechanism and internal cell memory that help to addresses the well-known issues relevant to vanishing gradients or exploding. The fundamental concept of "gates" used in LSTM for the aim of handling the sequence contextual data. Furthermore, a common LSTM architecture units is composed of an internal memory cell, and three gates in recurrent connection that help the model to determine how much information pass away and extracting more information with timestep. In addition, by using three gates are generally developed to handle to insert or ignore the information in the memory cell. These gates help the model to regulate how to update the current memory cell and the current hidden state ℎ .
Furthermore, this addition process is completed through three steps. First, Sigm have used as a sigmoid activation functions to process the data which require to store to the internal memory cell. Second, the tanh function is taken to achieve a vector over ℎ −1 and . The final step, output gate is determined the task for choosing appropriate information from the memory cell to output.
The LSTM cells that are used in the transition functions are implemented as follows: where, Sigm is the logistic sigmoid function that produce the output between [0, 1]. The variables and bias to be calculated during the learning procedure are , , , ∈ × , , , , ∈ × , , , ∈ ×1 and * indicates the element-wise multiplication of the two vectors. Each gate is composed out of a sigmoid layer and has the capability to remove or add information from the memory cell. At the current time t, ℎ denotes hidden states, , , indicates as a input, output and forget gate. , , and represent the weight parameters of LSTM respectively, while , , refers to the biases of the gates.

C. Traditional Gated Recurrent unit
Gated Recurrent Unit (GRU) is another advance kind of RNN, and simplified variation of LSTM relatively development proposed by Cho et al. in [20]. GRU is simple variation of LSTM, and that contain only two gates update gate and reset gate that handle the flow of information inside the unit, while without having an individual memory cells are shown in Fig. 2. GRU also illustrates the powerful capability of modeling to capturing long-term dependencies between the elements of a sequence. GRU calculates two gates, which manages the flow of information through each hidden unit. Each hidden state ℎ at time t, given input calculated using the following equations: where refers to the update gate, and refers reset gate, W, U and are weights matrix and vectors.
is a sigmoid activation function and tanh is the hyperbolic tangent. After developing the design of the GRU model, the learning technique for the GRU model requires to be determined. At currently, for RNNs such as GRU, become a common training method include back propagation trough time (BPTT) and real time recurrent learning (RTRL).
However, based on the recent studied there are two issues in standard GRU network. First, in data pre-processing phase much manual experience is needed to preprocess data for accurately classification and second one is high consumption of memory. Therefore, we proposed GRU variant, such as encoder E-GRU. Fig. 3 illustrates the structure of an E-GRU. The weights W, U and the activation function in the E-GRU are the similar as in the standard GRU. Furthermore, the variant E-GRU applies the encoder for automatically preprocess large amount of data. The encoder usually provides a best illustration of the input compared to original raw input, and the encoder is consistently compresses the input data in which select significant features for training.

III. PROPOSED METHODOLOGY
In this section, we demonstrate the particular description of the proposed methodology architecture that contains encoder based gated recurrent unit with word embedding. The proposed methodology apply Two-State GRU which learns to extract forward and backward context features through time steps, whose outputs are then given by encoded GRU model, and finally, followed by a softmax classifier. The description of each methodology elements to solving overall sentiment classification problem are follows:

A. Gated Recurrent Unit (GRU)
It consist a gating structure and a advance type of standard RNNs. It also illustrates the powerful capability to process of sequential data and capturing long-term dependencies between the elements by preserve previous state in the internal state of model through time step t.

B. Variant Gated Recurrent Unit
The first variant consists of binary gated recurrent unit (Bin-GRU), and local feature-based GRU (LF-GRU), and so on. Zhao et al. [29] introduced LF-GRU. In LF-GRU, first it extracts local features from segment or windows of time-series data. After that, LF-GRU base model is applied to learn with average weighed features from sequential of local features. However, the above approaches are requiring more manual experience for preprocessing massive data. Therefore, we proposed encoder GRU (E-GRU). E-GRU uses the encoder for automatically preprocess data efficiently. In this way, the output of encoder (E-GRU) becomes the input of Two-State GRU appropriately. In this paper, we mostly discussed Two-State GRU and E-GRU.

A. Traditional Auto-Encoder
Auto-encoder is a kind of unsupervised artificial neural networks that learns to effectively compress and encode data. The main purpose of auto-encoder is identifying further helpful and valuable features from huge amount of dataset without application of any dimensionality reduction method. Basically, Auto encoder usually performs in two stages namely encoding and decoding. Along with encoding stage converts the input features to a new representation [22] while decoding stage tries to convert this new representation back as near as possibly to its original inputs.

B. Encoder GRU (E-GRU)
The It contain on encoder GRU that used to reduce dimensionality from input data by applying the encoder part of the auto-encoder, it gives the best representation of the inputs than original raw inputs, and then the outputs of the encoder as become the inputs of the Two-State GRU.

C. Features Extraction
In sentiment classification feature extraction technique obtain significant role for identifying relevant features from raw data. It also includes to eliminate unnecessary features [31] and maintaining important features that consider to improve the accuracy of the model.

D. Automatical Preprocessing
It presents to the automatically capturing of features from the original raw data and removes manual interference.

E. Word Embedding Layer
In sentence classification process the initial stage is preprocessing the inputs sentence and sentiment context words. To superior representation the limited contents in long and short text. In this paper, we applied the pre-trained Word2Vec [23] word embedding method in embedding layer to extract the contextual correlation between words in training data. Word2vec performs as a predictive model to train their cooccurrence vectors to extracts the correlation among the target word and the context words in a simple way: it tries to extract the relevant semantic regularities by learning with the activation of the target word and its context words. The word embedding layer of the model changes words context into real-valued features vectors that captured semantic and syntactic data. Let L ∈ R V*d be the embedding query table produced by Word2vec, where d is the dimension of words and V is the vocabulary size. Assume that the input sentence contains of n words and the sentiment resource contains of m words.  ]. In this way, we can get the matrix = [ 1 , 2 , … . , ] ∈ * for context words and the matrix = [ 1 , 2 , … . , ] ∈ * for sentiment resource words. Fig. 4 show the process is simply concatenation of all words embedding in V.

F. The TGRU Network Architecture
The GRU recurrent layer has the ability to represent sequences e.g. sentences and very useful to capturing longterm dependencies between elements of a sequence. GRU can be applied for sentiment classification in the same manner as it has been used in Cho K [24]. GRU is a recent simplified variant of LSTM, that combine the forget gate and input gate into a single new update gate, which enhance the convergence time and iteration times of model training. First, an embedding layer generated a suitable size. The embedding layer will perform to show each word by a real valued vector of similar range to the fixed dimension. These values are the weights among the embedding layer and the hidden layer on top of it. These units are not only connected to the layer below and above them but also connected to units within their own layer. At the end of the hidden layer we attain the representation of the entire sequence which can be used as input to linear model or classifier. GRU structure are basically consists an update gate and reset gate. Reset gate ( ) determines that how much previous memory can be ignore from previous hidden state ℎ −1 .The update gate ( ) is determines how much previous state keep around and send among the existing state and new calculated state with parameter bias . as a p input vector dimension at time t , σ is the logistic sigmoid activation, and (q × p matrix), ℎ (q × q matrix), (q×1 vector) are determined size parameters which are common through an whole model.
where σ is the sigmoid activation applied for binary classification in the dense layer (output layer), and the value range of each element in the update gate are [0, 1].
The reset gate is calculated similarly to the update gate but with changed weights value: (q × p matrix), ℎ (q × q matrix), (q × 1 vector).
GRU reveals to the entire state of each iteration. In the same way, the candidate state ĥ is similarly computed to the existing recurrent unit. ĥ = ℎ( ĥ ) + * ĥ ℎ −1 + ĥ ) The candidate state ĥ at the current timestep t, the reset gate is handle the flow of the previous hidden activation ℎ −1 containing past information. If the reset gate is around zero, the previous hidden calculated state ℎ −1 will be removed.
Output state: ℎ = (1 − ) * ℎ −1 + * ĥ (17) The hidden state ℎ uses the update gate to update the previous hidden state ℎ −1 and the candidate hidden state ĥ . If the update gate is close to 1, the previous hidden state will be held and passed to the current instant. * is the Hadamard product between of the previous state "ℎ −1 with (1 − ) and element-wise multiplication * of the update gate with candidate activation state ĥ .
GRU can preserve memory substantially longer than existing RNN due to gating mechanism. However, based on the recent studies and practically observation, we find out that when GRU examines a word it only considers the forward linguistic context, so it is very needed for GRU to learn the contexts by backward pass. We also observe that the meaning of word in any language model is affected not only the forward pass but also on the backward pass.
Therefore, we proposed Two-State GRU to handle the above problem; the proposed TGRU network contain two directions, one for positive time direction namely (forward state), and other for negative time direction namely (backward state) as presented in Fig. 5. TGRU learns the contexts of a word from both directions. TGRU is inspired by the bidirectional recurrent neural networks (BRNNs) in [25]. In the training process, it splits each training sequential process into both individual recurrent networks forward and backward directions, and finally these directions are jointly combine into the output layer. The equations for update gate , reset gate , candidate state ĥ , and final output activation state ℎ " of the forward and backward GRU are presented as a follows: Forward Pass: In addition, we added backward pass to our proposed approach to discover more useful information.
Backward Pass: To the initiation of a word at time t: ℎ = [ℎ ⃗⃗⃗ , ℎ ⃖⃗⃗⃗ ] = for a random series ( 1 , 2 , … , ) consisting n words, each word shown as a dimensional vector at time t.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 1, 2020 599 | P a g e www.ijacsa.thesai.org The forward pass of GRU calculates "ℎ ⃗⃗⃗ that shows the left to the right contexts of the sentence while the backward pass of GRU phase take the right to left contexts ℎ ⃖⃗⃗⃗ respectively. Then contexts representation of forward and backward directions were conbined into a single layer. Fig. 5 presents the detailarchitecture of TGRU.

Generally, the complexity of an model is calculated by O(D) and the model estimated parameters is computed by D.
Two general pieces of information applied to calculate D is the dimension of the input vector p-dimension and hidden layer dimension q-dimension. Table I illustrates the detail of computed parameters of GRU, LSTM," Bi-LSTM and TGRU.
The complexity of TGRU double compare to standard GRU because the number of parameters are double. The performance of TGRU requires more time and resources for execution than GRU or LSTM while require less time and resoreces for execution than Bi-LSTM. However, our proposed E-TGRU approach is capable to explored useful information which greatly improve the accuracy of the sentiment classification.

G. Flowchart
In this sub-section we illustrate the flowchart of model for semtiment classification algorithm. Fig. 6 illustrates a flowchart of sentiment classification that contain three major phases. The 1 st phase contain on data formatting for sentiment classification purpose. The 2 nd phase consist of preprocessing data of model using variant GRU. In this phase, we apply the encoder to preprocess the text data. After that, the preprocessed data are utilized as input to the TGRU. The 3 rd phase is cross verification.  and dropout is 1 and recurrent dropout is 1 . 6. Insert the 2 nd GRU layer of L2 units with Sigmoid activation and dropout is 2 and recurrent dropout is 2 . 7. Forward Pass: 8. Initialize from the input layer do a forward pass over the network and take left to right contexts ℎ ⃗⃗⃗ of the sentence. 9. Backward pass: 10. Initialize from output layer to do backward over the network and take the right to left contexts ℎ ⃖⃗⃗⃗ of the sentence. 11. Step 3: Train and validate model 12. while initial stop condition is not met do 13.
while training dataset is not empty do 14.
Prepared a mini-batch dataset as network inputs. 15.
Update weights and bias using RMSprop optimizer algorithm. 17. end while 18. Validates network with validation set. 19. end while 20.
Step 4: Test model 21. Test fine-tuned hyper-parameters with test dataset. 22. return Evaluates result in test dataset." www.ijacsa.thesai.org It is observing the data preprocessing is very important for increasing the accuracy of the network, because the valueable features are attained by data preprocessing directly affect the final performance of the model. Therefore, we introduced variant GRU include encoder method to preprocessing data.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we conducted the different experiments to present how E-TGRU performs better as compared with three well-known state-of-the-art recurrent models on two benchmark sentiment classification datasets the Stanford Large Movie Review dataset (IMDB) and Amazon Product Reviews (APR) dataset.

A. IMDB
The results for this paper were achieved using the IMDB dataset originally collected by Andrew Maas [26]. It consists of the labeled dataset of 50,000 IMDB movie reviews, especially selected for sentiment classification. These reviews are divided 50:50 ratio into training and test data. We preprocessing the dataset following the implementation similar in [27]. Furthermore, it has 50,000 un-labelled movie reviews which we useful for unsupervised training.

B. Amazon Product Reviews(APR)
We used the dataset containing "Health and Personal Care product reviews from Amazon website that was available in University of California [28]. We trained the data set that contains 10,000 reviews included 50% positive and negative reviews are including to preparing it a binary classification. Additionally, these reviews are divided 15% of the dataset is applied for testing and 15% for validation aim.

C. Implementation Detail
The Data preprocessing and manipulate have performed in 3.6 Python version and anaconda. The network was trained through 30 epochs. All the simulation works were performed on Intel Core i7-3770XPU on a Windows PC with @3.40 GHz, and 4GB RAM machine. We called the name of proposed approach E-TGRU to pre-trained word vector for sentiment classification.
In this sub section, we describe each layer of E-TGRU model in detail. In order to improve the performance of the proposed model, that first step is improve the quality of the dataset, we enhance the quality of text dataset by preprocessing technique, and then gain 300-dimensional word vector by using pre-trained word2vec method selected from Google for sentiment classification. In our experiment, we used RMSprop optimizer to set their default optimal parameter setting with learning rate is 0.001 and decay factor is 0.9. The model is trained by mini-batch to performed gradient descent with batch size of 64. The sigmoid activation has applied as dense layer for binary sentiment classification and softmax activation function for multiclass sentiment classification. To avoid the overfitting problem, we have applied dropout strategy for TGRU layer with 128 memory units for each forward and backward direction. We set dropout rate of 0.2 uses by embedding layer, while the recurrent structure has a dropout of 0.4. After combining the forward and backward GRU, one more dropout layer was added to reduce 50% overfitting issue. Moreover, 10-fold cross validation has applied to minimize the arbitrary impact of the model. Fig. 7 is presenting the detail accuracy and loss function of the model. As illustrated in the figure, after 10 epochs the training and validation accuracy greatly improve over 86% and the training and validation loss reduce below 35%. And after 25 iterations the model finally achieved accuracy over 88% and loss decrease below 30%. Fig. 8 summarized the classification results on IMDB, and APR datasets. We evaluated the efficiently of our proposed E-TGRU model and compared it with three state-of-the-art exiting RNNs approaches GRU, LSTM and Bi-LSTM. The results prove that our proposed model is suitable for sentence level sentiment classification with higher accuracy of 89.37%. In our research, to train the model by using Word2vec word vector method for computing a real value vector representation of a word and performs excellent on the both IMDB and APR datasets. We fixed both the word embedding dimensions and number of units to 64.

D. Sentiment Analysis Results
Additionally, the outperformance of E-TGRU through encoder method and two-state GRU mechanism to demonstrate that our model is much better than other traditional models for the task of sentiment classification. Although, it can obtains the higher performance in both binary and multiclass sentiment classification tasks. In this paper, we evaluated the classification performance of the proposed models based on three evaluation metrics, namely the accuracy, F1-scores and mean square error (MSE). First, we conduct the experiment on IMDB movie reviews dataset, the performance evaluates between three baseline models are presented in Fig. 8.
In Fig. 8 shown that the proposed model GRU-Embed have achieved better performance on IMDB dataset with accuracy of 89.37%. We can see the continuously excellent performance of E-TGRU than GRU, LSTM and Bi-GRU. Next, we evaluate the classification performance of Amazon Products Reviews (APR) dataset in Fig. 9. The proposed model E-TGRU also achieved best performance on APR dataset with accuracy of 87.58%, while GRU achieved 83.08% LSTM attained 83.63% and Bi-LSTM obtained much better accuracy of 84.96% than GRU and LSTM. On the other hand, we compared F1-score performance of two traditional recurrent models is presented in Fig. 10. It is clearly showing that better performance is achieved over the application of pre-trained word embeddings. The F1-scores results show that the pre-trained word vectors are better general features extractor and can be used entire datasets.

E. Comparing Tgru With Recent Studies
This research have evaluated the networks performance compared with existing recent researches [18], [29] and [30]. Kim et al. [18] proposed static and non-static convolutional neural network trained on top of pre-trained word vector using Word2Vec for sentence level classification. They applied multiple channels have obtained 81.58% accuracy using same dataset. Socher et al. [29] introduced recursive matrix-RNN network that learns compositional vector representations for www.ijacsa.thesai.org phrases and assigns the word vector and matrix method to every node in a parse tree to achieved 79.00% . Zulqarnain et al. [30] propose gated recurrent unit based on batch normalization for sentence classification. They applied batch normalization technique in forward layer and used Glove word vector in embedding layer.     LSTM, GRU and E-TGRU were executed on the same datasets and same period with three approaches. Bi-LSTM (bidirectional LSTM), which is extended variation of LSTM that has been extensively applied in recent studied [31], is also included for further comparisons. We deployed LSTM and GRU on same dataset with similar parameters to those in [32]. Fig. 11 showed the result that our proposed E-TGRU model outperforms than other existing models. This improve development is justifiable because not only we build the encoder and Word2Vec embedding method in GRU, but we also combine the forward and backward contexts to learn more useful information.

F. Comparison Error Rate with Traditional RNNs
In this section, we perform to analysis an error rate of our proposed E-TGRU model with three state-to-the-art RNNs approaches such as standard GRU, LSTM and recent Bi-LSTM. Execution setup showed that with the continuous increase of epochs, the mean square error is continuously decreasing and the final MSE is 0.0162 on IMDB dataset and 0.2713 on APR. We fixed both the word embedding dimensions and number of units to 64 and execute the model for 30 epochs. We found that proposed model converged faster than GRU, LSTM and Bi-LSTM to achieved very lower error rate even after many epochs. To make these models comparable, we implement these models with the identical www.ijacsa.thesai.org structural design. Finally, we evaluate our E-TGRU model with state-of-the-art existing RNNs models on IMDB and APR datasets. Table II demonstrates the results that proposed E-TGRU model achieves much better performance in the term of the error rate than GRU, LSTM and Bi-LSTM.

V. CONCLUSION
Sentiment classification remains popular and significant area of natural language processing. In this paper, we investigated variant gated recurrent unit included encoder GRU (E-GRU) to preprocess the texts data for sentiment classification. E-GRU frequently provides an excellent representation of the input than the original raw input. Furthermore, we also developed Two-State Gated Recurrent Unit (TGRU) which is included forward and backward states, that is capable to learn more valuable information, especially for text processing issue. Then, we used Word2Vec pretrained word embeddings method that is possible to learn the contextual semantics of words from the text can be effectively classified. Based on experimental observation, we found that RNNs models, being a recurrent network it can effectively capturing the useful information from a massive array of sequential data and the best choice in terms of accuracy. We conduct the experiment on two benchmark sentiment analysis datasets, included IMDB and APR respectively. The proposed E-TGRU model achieved highest 89.37% accuracy on IMDB dataset and 87.58% accuracy on APR dataset. Our proposed model achieves much better performance in the term of an error rate than GRU, LSTM and Bi-LSTM, when increases the number of epochs.
In future work, there are many ways to extend this work. Future research can be dedicated the proposed approach using multiple sentiment lexicons and much powerful ranking approaches to enhance the sentiment classification performance and also reduce the computational complexity of the proposed model.