Deep Gated Recurrent and Convolutional Network Hybrid Model for Univariate Time Series Classification

Hybrid LSTM-fully convolutional networks (LSTM-FCN) for time series classification have produced state-of-the-art classification results on univariate time series. We show that replacing the LSTM with a gated recurrent unit (GRU) to create a GRU-fully convolutional network hybrid model (GRU-FCN) can offer even better performance on many time series datasets. The proposed GRU-FCN model outperforms state-of-the-art classification performance in many univariate and multivariate time series datasets. In addition, since the GRU uses a simpler architecture than the LSTM, it has fewer training parameters, less training time, and a simpler hardware implementation, compared to the LSTM-based models.


I. INTRODUCTION
A time series (TS) is a sequence of data points obtained at successive equally-spaced time points, ordinarily in a uniform interval time domain [1]. TSs are used in several research and industrial fields where temporal analysis measurements are involved such as in signal processing [2], pattern recognition [3], mathematics [1], psychological and physiological signals analysis [4], [5], earthquake prediction [6], weather readings [7], and statistics [1]. There are two types of time series: univariate and multivariate. The multivariate time series is more intricate and complex than the univariate time series because it has multiple varying variable dependencies over a period of time but the univariate has only one varying variable over time [1]. In this paper, we study the univariate time series classification.
There have been several approaches to time series classification. The distance-based classifier based on the k-nearest neighbor (KNN) algorithm is considered a baseline technique for time series classification. Mostly, distance-based classifier uses Euclidean or Dynamic Time Wrapping (DTW) as a distance measure [8]. Feature-based time series classifiers are also widely used such as the bag-of-SFA-symbols (BOSS) [9] and the bag-of-features framework (TSBF) [10] classifiers. Ensemble-based classifiers combine separate classifiers into one model to reach a higher classification accuracy such as the elastic ensemble (PROP) [11], the shapelet ensemble (SE) [12], and the collective of transform-based ensemble (COTE) [12] classifiers.
The present paper focuses on the recurrent neural network based classification approaches such as LSTM-FCN [5] and ALSTM-FCN [5]. These models combine both temporal CNNs and long short-term memory (LSTM) models to provide the classifier with both feature extraction and time dependencies through the dataset during the classification process. These models use additional support algorithms such as attention and fine-tuning algorithms to enhance the LSTM learning due to its complex structure and data requirements.
This paper studies whether the use of gated-recurrent units (GRUs) can improve the hybrid classifiers listed above. We create the GRU-FCN by replacing the LSTM with a GRU in the LSTM-FCN [5]. Like the LSTM-FCN, our model does not require feature engineering or data preprocessing before the training or testing stages. The GRU is able to learn the temporal dependencies within the dataset. Moreover, the GRU has a smaller block architecture and shows comparable performance to the LSTM without need for additional algorithms to support the model.
Although it is difficult to determine the best classifier for all time series types, the proposed model seeks to achieve equivalent accuracy to state-of-the-art classification models in univariate time series classification. Following [4] and [5], our tests use the UCR time series classification archive benchmark [14] to compare our model with other state-ofthe-art univariate time series classification models. Our model achieved higher classification performance on several datasets compared to other state-of-the-art classification models.

A. Gated Recurrent Unit (GRU)
The gated recurrent unit (GRU) was introduced in [15] as another type of gate-based recurrent unit which has a smaller architecture and comparable performance to the LSTM unit. The GRU consists of two gates: reset and update. The architecture of an unrolled GRU block is shown in Fig. 1. r (t) and z (t) denote the values of the reset and update gates at time step t, respectively. x i ∈ R n is a 1D input vector to the GRU block at time step t.h (t) is the output candidate of the GRU block. h (t−1) is the recurrent GRU block output of time step t − 1 and the current output at time t is h (t) . Assuming a one-layer GRU, the reset gate, update gate, output candidate, and GRU output are calculated as follows [15]: Where W zx , W rx , and W x are the feedforward weights and U hz , U hr , and U h are the recurrent weights of the update gate, reset gate, and output candidate activation respectively. b z , b r and b are the biases of the update gate, reset gate and the output candidate activationh (t) , respectively. Figure 3 shows the GRU architecture with weights and biases made explicit. Like the RNN and LSTM, the GRU models temporal (sequential) datasets. The GRU uses its previous time step output and current input to calculate the next output. The GRU has the advantage of smaller size over the LSTM. The GRU consists of two gates (reset and update), while the LSTM has three gates: input, output and forget. The GRU has one unit activation, but the LSTM has two unit activations: input-update and output activations. Also, the GRU does not contain the memory state cell which exists in the LSTM model. Thus, the GRU requires fewer trainable parameters, and shorter training time compared to the LSTM. Table I compares GRU and LSTM architecture components.

B. Temporal Convolutional Neural Network
The Convolutional Neural Network (CNN), introduced in 1989 [16], utilizes weight sharing over grid-structured datasets such as images and time series [17], [18]. The convolutional layers within the CNN learn to extract complex feature representations from the data with little or no preprocessing. The temporal FCN consists of many layers of convolutional blocks that may have different or same kernel sizes, followed by a dense layer softmax classifier. For time series problems, the values of each convolutional block in the FCN, are calculated as follows [4]: where x i ∈ R n is a 1D input vector which represents a time series segment, W i is the 1D convolutional kernel of weights, b i is the bias, and y is the output vector of the convolutional block i. z i is the intermediate result after applying batch normalization [19] on the convolutional block which then is passed to the rectified linear unit ReLU [20] to calculate the output of the convolutional layer out i .

III. MODEL ARCHITECTURE
As stated in the introduction, our model replaces the LSTM with a GRU in a hybrid gated-FCN. Our model is based on the framework introduced in [4], [5]. The proposed architecture actual implementation is shown in Figure 2. The architecture has two parallel parts: a GRU and a temporal FCN. Our model uses three layers FCN architecture proposed in [4]. We also used the global average pooling layer to interpret the classes and to reduce the number of trainable parameters comparing to the fully connected layer, without any sacrifice in the accuracy. The FCN 1D kernel sizes are 128, 256, and 128 in each convolutional layer, respectively. The weights were initialized using the He uniform variance scaling initializer [21]. In addition, we used the GRU instead of LSTMs that were used in [5] models to reduce the number of trainable parameters, memory, and training time. Moreover, we removed the masking and any extra supporting algorithms such as an attention mechanism, and fine-tuning that were used in the LSTM-FCN and ALSTM-FCN models [5]. The GRU is unfolded by eight unfolds as used in [5] for univariate time series. The hyperbolic tangent (tanh) function used as the unit activation and the hard-sigmoid (hardSig) function [22] is used as the recurrent activation (gate activation) of the GRU architecture. The weights were initialized using the glorot uniform initializer [23], [24] and the biases were initialized to zero. The input was fitted using the concept used in [5] to fit an input to a recurrent unit. We used the Adam optimization function [25] with β 1 = 0.9, β 2 = 0.999 and initial learning rate α = 0.01. The learning rate α was reduced by a factor of 0.8 every 100 training steps until it reached the minimum rate α = 0.0001. The dense layer uses the softmax classifier [26] using the categorical crossentropy loss function [18]. The number of epochs varys between 400 to 1200. In this paper our goal is to make a fair comparison between the LSTM-based model and our GRU-based model. Thus, we used the same number of epochs that was assigned by the original LSTM-FCN model [5] for each univariate time series.
The input to the model is the raw dataset without applying any normalizations or feature engineering. The FCN is responsible for feature extraction from the time series [4] and the GRU enables the model to learn temporal dependencies within the time series. Therefore the model learns both the features and temporal dependencies to predict the correct class for each training example.

IV. METHOD AND RESULTS
We implemented our model by modifying the original LSTM-FCN [5] implementation which we found on github https://github.com/titu1994/LSTM-FCN. We found that the fine-tuning algorithm has not been applied in the actual LSTM-FCN and ALSTM-FCN implementation on source code github which shared by the authors [5]. In addition, the LSTM-FCN [5] authors used a permutation algorithm for fitting the input to the FCN part which was not mentioned in their literature. Therefore, we re-generated the actual LSTM-FCN and ALSTM-FCN implementations to record the results based on their actual code implementation. The Keras API [24] with TensorFlow backend [27] were used in the implementation of the LSTM-FCN, ALSTM-FCN and GRU-FCN models. The source code of our GRU-FCN implementation can be found on github: https://github.com/NellyElsayed/GRU-FCNmodel-for-univariate-time-series-classification. We tested our model on the UCR time series archive [14] as one of the standard benchmarks for time series classification. We used 44 of the 85 different time series datasets. They were divided into training and testing sets. The 44 datasets have different types of collected sources: 13 datasets of image source, 3 spectro source, 4 simulated source, 11 sensor source, 9 motion source, and 4 ECG source. The number of classes in each time series, the length of both the training and test sets are shown in Table II based on the datasets description in [14]. Table II  TABLE II: UCR 44 dataset descriptions based on [14] and its usage in GRU-FCN implementation based on [5].
We divided the 44 UCR datasets based on the source of obtaining each dataset to show the accuracy of each classification model over different time series of same data source type. We did so to analyze the different classifiers over the different data source types separately. Figure 5 shows the accuracy of each classification model over image source datasets classification where the GRU-FCN shows the highest number of the highest classification accuracy compare to the state-of-the-art classifiers. Figure 6 shows the accuracy of each classification method on the spectro source datasets, where our GRU-FCN accuracy outperforms the state-of-the-art classifiers over all of the spectro source datasets. Figures 7, 8, 9, and 10 show the accuracy of each classification method on simulated, sensor, motion, and ECG source datasets. For the ECG source datasets, our model outperforms all the univariate time series state-of-the-art models except for the ECGFiveDays dataset. However, by increasing the kernel size of the FCN part in the GRU-FCN, the model can reach the highest accuracy in this dataset classification. We evaluated our model using the Mean Per-Class Error  (MPCE) used in [4] to evaluate performance of a classification method over multiple datasets. The MPCE for a given model is calculated based on the per-class error (PCE) as follows: where e m is the error rate for dataset m consisting of c m classes. M is the number of tested datasets. Table III shows the MPCE value for our GRU-FCN and other state-of-the-art models on the UCR benchmark datasets. The results obtained by implementing GRU-FCN and regenerating LSTM-FCN, and ALSTM models based on their actual implementation on github. For the other models, we obtained the results from their own publications. Our GRU-FCN has the smallest MPCE value compared to the other stateof-the-art classification models. This means that generally our GRU-FCN model performance across the different datasets is higher than the other state-of-the-art models. Figure 4 shows the critical difference diagram [29] for Nemenyi or Bonferroni-Dunn test [30] with α = 0.05 on our GRU-FCN and the state-of-the-art models based on the ranks arithmetic mean on the UCR benchmark datasets. This graph shows the significant classification accuracy improvement of our GRU-FCN compared to the other state-of-the-art models. Table IV shows the Wilcoxon signed-rank test which provides the overall accuracy evidence of each of the eleven      V. CONCLUSION The proposed GRU-FCN classification model shows that replacing the LSTM by a GRU enhances the classification accuracy without needing extra algorithm enhancements such as fine-tuning or attention algorithms. The GRU also has a smaller architecture requiring fewer computations than the LSTM. Furthermore, the proposed GRU-FCN classification model achieves the performance of state-of-the-art models and has the highest average arithmetic ranking and the lowest mean per-class error (MPCE) through time series datasets classification of the UCR benchmark compared to the stateof-the-art models. Moreover, the proposed GRU-FCN achieved the highest accuracy in all the spectro source datasets and in almost all the ECG source datasets comparing to the stateof-the-art models. Therefore, replacing the LSTM by GRU in the LSTM-FCN for univariate time series classification can improve the classification with smaller model architecture.