A Hybrid 1D-CNN-Bi-LSTM based Model with Spatial Dropout for Multiple Fault Diagnosis of Roller Bearing

— Fault diagnosis of roller bearings is a crucial and challenging task to ensure the smooth functioning of modern industrial machinery under varying load conditions. Traditional fault diagnosis methods involve preprocessing of the vibration signals and manual feature extraction. This requires domain expertise and experience in extracting relevant features to accurately detect the fault. Hence, it is of great significance to implement an intelligent fault diagnosis method that involves appropriate automatic feature learning and fault identification. Recent research has shown that deep learning is an effective technique for fault diagnosis. In this paper, a hybrid model based on 1D-CNN (One-Dimensional Convolution Neural Networks) with Bi-LSTM (Bi-directional Long-Short Term Memory) is proposed to classify 12 different fault types. Firstly, vibration signals are given as input to 1D-CNN to extract intrinsic features from the input signals. Then, the extracted features are fed into a Bi-LSTM model to identify the faults. The performance of the proposed method is enhanced by applying Softsign activation function in the Bi-LSTM layer and Spatial Dropout in the neural network. To analyze the effectiveness of the proposed method, Case Western Reserve University (CWRU) bearing data is considered for experimentation. The results demonstrated that the proposed model has attained an accuracy of 99.84% in classifying the various faults. The superiority of the proposed method is verified by comparing the predictive accuracy of the proposed method with the existing fault diagnosis methods.


I. INTRODUCTION
Roller Bearing (RB) is a key component of any rotating machinery where rotation is involved. It is widely used in various industries such as transportation, agriculture, aerospace, medical domain and so on. RB is more susceptible to damage due to its continuous rotation with varying loads and pressure. Due to which there's a break-down of the entire machine which results in magnificent economic loss and severe safety accidents [1]. Therefore, it is very much essential to diagnose the roller bearing fault accurately because each fault type exhibits distinct characteristics and the fault may exist in any of the components such as Inner Race (IR), Outer Race (OR) and Ball.
Traditional vibration-based bearing fault diagnosis methods involved mainly three steps as data pre-processing, feature extraction, and fault classification. The vibration signals collected from sensors represents the information about bearing condition. In order to classify and detect the faults, many signal processing techniques have been discussed through analysis of signal characteristics in various domains such as time, frequency and time-frequency domain [2]. Due to the nonstationary nature of vibration signal, various feature extraction techniques such as Short-Time Fourier Transform (STFT), Wavelet Analysis (WA), Empirical Mode Decomposition (EMD), etc. were applied to extract the features [3]. Once the features are extracted and selected then those features are fed into the network model for classification.
Recently, Deep Learning (DL) technology has gained more importance in various domains such as image processing, natural language processing, speech recognition and so on. It uses multiple layers of the network to learn and extract relevant features from raw data and identifies the pattern for classification or recognition problems. Roller bearing's vibration data has similar dimensionality as that of image or speech. Hence, DL architecture can be used to diagnose roller bearing fault by transforming vibration signal into the framework of pattern recognition problem. DL model has an ability in automatic feature learning and classification that involves automatic feature extraction and identification of the faults accurately [4][5].
In this research, a hybrid method based on 1D-CNN-Bi-LSTM with Spatial Dropout is proposed for multiple fault diagnosis of roller bearing. Initially, one-dimensional raw vibration signal is collected and input into CNN model. Then, CNN extracts feature information from the signals and these extracted features are provided to Bi-LSTM network model to acquire the failure information to identify 12 types of bearing faults. For experimentation, CWRU dataset is being used to analyze the effectiveness of the method.
The rest of this paper is organized as follows: In Section II related work is discussed; Section III describes proposed methodology architecture which includes one-dimensional CNN and Bi-LSTM models. Section IV illustrates an experimental setup of bearing data collection; and Section V shows the discussion of results and its analysis.

II. RELATED WORK
Many researchers have applied various deep learning models for fault diagnosis such as Deep Neural Networks [6], Long Short-Term Memory [7], Deep Belief Networks [8], Deep Auto-encoders [9], Gated Recurrent Unit Networks [10], www.ijacsa.thesai.org and Convolutional Neural Networks [11] and so on. Among these, CNN has received more importance in the study of roller bearing defect diagnosis. Abed et al. [12] proposed a robust approach for fault diagnosis of Brushless DC Motors through feature extraction and reduction using discrete wavelet transform (DWT) and orthogonal fuzzy neighborhood discriminant analysis (OFNDA) from vibration and current signals and RNN model was used for classification of faults. P. Zou et al. [13] focused on empirical mode decomposition (EMD) method which was combined with LSTM to obtain kurtosis value by extracting intrinsic mode functions (IMF) components and long-term dependencies from vibration signals to monitor the health status of an electrical machine. Cao, Lixiao et al. [14] constructed a fault diagnosis framework by extracting ten time-domain statistical features from vibration signals under varying load conditions and these features were fed into deep Bi-directional LSTM to identify the faults of Wind Turbine Gearbox. In [15], Mel Frequency Cepstral Coefficient features are obtained from vibration signals and given these features as input to Random Forest and eXtreme Gradient Boosting algorithms for diagnosis of roller bearing fault.
Zheng Wang et al. [16] discussed an architecture to obtain unsupervised H-statistic value from sensor time-series data based on deep LSTM and CNN for performance degradation valuation of roller bearing. Shichao and Haibin proposed a bearing fault diagnosis model in which 1D-CNN with LSTM is implemented, which adaptively extracted potential features from the original vibration signal and ensured the validity of the features through merging of pooling layers of max and average values to down sample the features. Then, LSTM was employed to acquire the dependencies among features of timedomain signals to perform fault classification [17]. Zhe Yuan et al. [18] presented a fault recognition approach for roller bearing using Multiscale CNN and Gated Recurrent Unit Network (GRUN) by providing multiple time scaled vibration data into the CNN to train the model and added the gated recurrent unit network to make the model predictive with an attention mechanism. In [19], the proposed adaptive anti-noise neural network architecture employed random sampling approach and boosted CNN with the exponential linear activation function to enhance the adaptability of the network without manual feature selection. GRUN was implemented to learn the features processed by CNN and classify the faults. This approach solved the problem of bearing fault diagnosis under changing load conditions and heavy noise.
Wenbing Yu et al. [20] discussed an intelligent fault diagnosis method for identifying ten different bearing faults based on lightweight MobileNet CNN by considering Western Reserve University dataset for evaluating the model and also computed average precision, recall and F1 score which resulted into 96%, 82% and 88%, respectively. Kai Gu et al. [21] discussed a novel diagnostic method to accurately identify the fault status of bearing based on LSTM and DWT for multisensors by obtaining fault details in both frequency and time scales through DWT and LSTM algorithm was used to characterize the long-term dependency information hidden in the time series data of a signal. In [22], a combined wavelet regional correlation threshold denoising (WRCTD) algorithm with CNN-LSTM was proposed for fault detection. WRCTD algorithm utilized the regional association of the wavelet decomposition coefficients and 3σ criterion to reduce noise in the raw sensor data and CNN-LSTM model reduced the hidden features of the pre-processed signal data to identify the fault type of the harmonic reducer under multiple working conditions. A novel fault diagnosis method was presented through application of sliding window processing to integrate the feature and time delay information from multivariate time series samples and then, the samples obtained were fed into the CNN-LSTM model to perform feature learning and capture time delay information to diagnose the fault of Tennessee Eastman chemical process [23]. A robust approach was proposed in [24] to predict the Remaining Useful Life (RUL) of roller bearing with combination of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and LSTM to detect the damage state and identify the abnormal state of bearing to estimate the RUL through feature extraction from signals. A new convolution-based bidirectional long and shortterm memory network method was proposed to predict RUL, in which CNN was used to obtain feature information and BI-LSTM to acquire time-frequency information from the signals to construct health indicators (HI) and the experiments conducted on the PRONOSTIA bearing dataset showed that the proposed method performed better compared to other methods [25].
This work uses deep learning technique for fault diagnosis and it's motivated by the fact that deep learning involve automatic feature extraction whereas Machine learning needs manual feature extraction, in which prior domain knowledge and expertise is required.

III. PROPOSED METHODOLOGY
In this research, a multi-class fault diagnosis method is proposed based on 1D-CNN with Bi-LSTM to classify various faults. The advantage of CNN lies in automatic feature extraction and Bi-LSTM in handling gradient loss and explosion. The main goal is to diagnose the 12 different fault types using 1D-CNN with Bi-LSTM model by collecting vibration signals from CWRU dataset which contains set of ball bearings having localized faults. The proposed 1D-CNN-Bi-LSTM model consists of four convolutional and pooling layers, a Bi-LSTM layer, a LSTM layer and one fully connected layer as shown in Fig. 1. Firstly, the raw input signals are input to the model, then convolution layers and pooling layers helps in automatic feature extraction. Next, these features are passed to Bi-LSTM and LSTM layers to highlight the features and finally, dense layer is used to classify the various faults. Bi-LSTM layer uses softsign activation function to improve the performance of the model. For verifying the effective performance of the proposed method, publicly available bearing dataset is considered [26]. The vibration data is measured for four operational conditions such as: 1) Normal Bearing-No fault, sampling frequency of 12kHz with 1797rpm (rotations/min).
2) Fault in Outer race -Sampling frequency of 12kHz with 1797rpm.
Details of each layer used in the proposed method is explained in the following subsections.

A. 1D-CNN
CNN is a deep learning algorithm which was originally proposed for processing of visual data. It is more effective in identifying image patterns in a stratified way from simple to complex features because of the two important properties such as weight sharing and spatial pooling. CNN consists of 3 layers namely convolutional layer, pooling layer, and fully connected layer. The convolution layer converts the input data into smaller feature maps through convolutional kernels by performing a summation of multiplications between the vectors of input data and weight coefficients [27]. In this paper, 1D-CNN is constructed, whose convolutional kernels and feature maps are all one-dimensional because of the one-dimensional characteristics of mechanical vibration signals.
Suppose ‗x' is an input to 1D-CNN, then the output of the convolutional layer is computed as given in (1): In equation (1), ‗f' represents an activation function, which is typically a hyperbolic tangent, ReLu (Rectified Linear Unit), or sigmoid function; ‗m' is number of samples ( ≤ i ≤ m); ‗p' is length of the convolutional kernels ( ≤ j ≤ p); ‗n' is length of the input data ( ≤ k ≤ n); * represents convolution operation; is the weight and is the bias.
The pooling layer is the sub-sampling layer to compress the size of feature maps. Down sampling is performed to minimize the dimensionality of the output from the previous convolution layer by moving the filter window from starting point to the end of feature map. Then a maximum or average of each part of the feature map is considered to represent each corresponding area. The role of pooling layer is to reduce the number of parameters and the computation in the network, so that it prevents overfitting and improves the generalization ability of the model. Max pooling is frequently used in the pooling layer which is computed as maximum of the previous feature maps. It is expressed as given in (2).
where, ≤ l ≤ m/2 Fully connected or Dense layer plays the role of classifier in CNN. For a multi-class classification problem, usually softmax is applied in the dense layer to ensure the range of output value lies between 0 and 1, and sum equals to 1. The predicted output represents the probability and value with the highest probability is considered as the final predicted result. The output of 1D-CNN is an input to the Bi-LSTM model to reduce variance in time series.

B. Batch Normalization
It is a regularization technique, which avoids model overfitting. In the training process of the deep neural networks, the distribution of inputs to the layers deep in the network which keeps changing for each mini batch as the weights are updated. This problem is known as -internal covariate shift‖. It delays the network to converge during the training phase. To avoid this problem, Batch normalization standardizes the input to each layer after every mini-batch and hence accelerates the network training. It is usually applied either before or after the activation functions of each hidden layer.
The process of batch normalization is shown in Fig. 2 [28]. It shifts the values of the input distribution to a hidden layer, such that the mean of these values is zero (zero centered) and then normalizes the inputs. It creates two parameter vectors for each layer, one with the scaled values and the second vector with the shifted values of the inputs to the layers.
The output scale vector ‗γ' and the output offset vector ‗β' are learnt through backpropagation. The final input mean vector ‗µ' and the final input standard deviation vector ‗σ' are estimated using exponential moving average during training.

C. Dropout
In a fully connected neural network, the probability of coadaptation among the neurons is likely higher. As a result, the features extracted by the neurons for learning is more or less similar. This co-adaptation makes the model to overfit the training data and generalize poorly on unseen test data. Dropout is a technique to overcome this problem. It chooses a specified percentage of neurons randomly to be dropped during training by making their connection weights to zeros. Repeated application of this technique creates an ensemble of network architecture with a different set of neurons and their weights are dropped in each architecture as shown in Fig. 3(a) and (b) [29]. The weights of these dropped neurons are not updated during backpropagation. When some of the neurons are dropped, the other neurons take the responsibility of propagating the features to the subsequent layers of the network in the forward pass. Hence it prevents a sort of coadaptation among the neurons and makes the network less reliable to the learning units, their weights and existence. All these factors help to generalize the model well on the test samples. Dropouts in convolutional layers are applied to the individual cells of the feature map/kernel and are called spatial dropouts. Dropouts applied to the hidden layers are regular dropouts.

D. Bi-LSTM (Bi-Directional Long-Short Term Memory)
Bi-Directional Long-Short Term Memory is a type of Recurrent Neural Network (RNN), which is a deep learning technique that is used to categorize and regress timeseries data such as audio, text forecasting and so on. Bi-LSTM combines LSTM layers from both directions. Hence, it captures longterm dependencies between signal patterns by making the flow of information in both forward and backward directions. There are 3 components in LSTM, namely i) forget gate, ii) update gate, and iii) output gate. The forget gate eliminates the irrelevant information which is received from the preceding unit. The update gate performs addition of information to the cell state, and the output gate selects the relevant information from the present cell state and gives the output [30]. The LSTM gating structure manages the information by enabling the memory cells to preserve long-term dependencies through selective passage. It avoids the problems of gradient loss and gradient explosion by strengthening the weight of relevant information and weakening the weight of irrelevant ones. The structure of the LSTM cell is shown in Fig. 4(a).
The LSTM network cannot make use of the full data while processing the time series signals because it processes the data only in one direction. Hence, the Bi-LSTM network is implemented, which contains LSTM layers overlaid on each other in reverse direction. It improves the performance by enabling the model to make efficient use of the main features. The unit structure of the Bi-LSTM network is shown in Fig. 4(b).
The internal processing of the LSTM cell is shown in Fig. 4(c). The inputs W i , W o , W f , W c represents weights and b i , b o , b f , b c represents bias vectors of input gate, output gate, forget gate and cell state respectively. The input of current state and output of the previous state is represented as x t and h t− . The input value C t ′ at moment 't' is calculated by applying the tanh activation function on the result obtained by computing the matrix product of vector [h t− , x t ] with W c and b c as given in (3). The parametric value for each gate i.e., f t -forget gate, i tinput gate, and o t -output gate at moment ‗t' is calculated by applying the activation function as shown in (4), (5) and (6).
An element-wise product of ‗ft' with the last cell state ‗C t−1 ' determines the info that is to be forgotten and remembered by realizing the control on C t−1 and the element-wise product of ‗i t ' with the current input cell state C t ′ determines the info in C t ′ that needs to be stored and used. The state value ‗ t ' of the hidden node at time ‗t' is calculated as given in (7).
The output value ‗h t ' at time ‗t' is computed as product of tanh function applied on unit state C t , and output gate , as given in (8).

E. Softsign Activation Function
The softsign function squishes its input to a range of -1 to +1 as like tanh. The function and its derivative are defined as given in (9) and (10).
Unlike tanh, this function has a flatter curve, its derivative descends slowly, and is less saturated. Functions that are more saturated, have their gradients vanishing quickly before reaching the initial layers of the network during backpropagation [31]. Hence, softsign solves this vanishing gradient problem better than tanh. Softsign converges in polynomial time whereas tanh converges in exponential time. Since softsign transforms the inputs between -1 to +1, the negative values enable the LSTM gates to delete the information when required.

IV. EXPERIMENTAL SETUP
As the benchmark study, CWRU bearing dataset has been widely considered by many researchers for condition monitoring and fault diagnosis. An experimental setup of CWRU is shown in Fig. 5. It consists of a 2-hp (horsepower) motor, encoder, torque transducer, dynamometer, electric motor and so on. The deep groove ball bearing was mounted on the drive end of the motor to support the shaft which needs to be tested. An accelerometer was positioned above the bearing base of the drive end to measure the vibration signals.  In total, 12 bearing fault types with respect to BF, IR, OR and normal bearing were considered in this work as given in Table I. To label the fault types, One-Hot coding technique was used. In this experiment, to confirm adequate training size, 80% of the data was randomly chosen as training set and 20% as test set. For validation of the model, 10% of the training set was selected randomly to adjust model parameters.

V. RESULTS AND DISCUSSION
In this proposed work, the vibration data is collected for 12 different bearing conditions that is provided by CWRU. The description of various fault types and count of samples considered for each fault class from the experimental setup is given in Table I

F. Parameter Settings for CNN-Bi-LSTM Model
The summary of the model's parameters set for the proposed hybrid CNN-Bi-LSTM architecture is shown in Fig. 7.
The first four convolutional layers used batch normalization and spatial dropout with a value of 0.25, which could effectively improve the performance of the network by preventing the overfitting problem. Softsign activation function was used as a classifier in Bi-LSTM with filter size of 256 and Adam optimizer for compilation. The loss function categorical cross-entropy and batch size of 32 was set to identify the fault state by setting a short time of 100 epochs.
The trainable parameters are those which are learnt by the model during the feature learning from the classification layers namely convolution, LSTM, and the fully connected layers. The non-trainable parameters are learnt by the model from the batch normalization layers.

G. Confusion Matrix
The confusion matrix is the representation of matching degree between the actual and predicted labels in the form of matrix. The confusion matrix for the proposed 1DCNN-Bi-LSTM model is shown in Fig. 8. The model has correctly classified 4111 samples out of 4118 by demonstrating an accuracy of 99.84%.

H. Learning Curve
A learning curve is a plot of model's learning performance over experience or time. The model is evaluated during training phase after each update based on the training and validation dataset. It gives an idea of how well the model is learning and generalizing. The learning curve for the proposed method is shown in Fig. 9. It demonstrates the training and validation accuracy versus number of epochs. A comparative analysis of the proposed hybrid model is made with other existing DL models as shown in Table II. The performance indicates that the proposed model accomplishes better results in classifying multiple faults as compared to other models.

VI. CONCLUSION
In this research, A Hybrid 1D-CNN-Bi-LSTM model with Spatial Dropout for Multiple Fault Diagnosis of Roller Bearing is proposed. Usage of Spatial Dropout technique and Softsign activation function in the proposed hybrid fault diagnosis method has shown an improvement in the accuracy by performing automatic feature extraction and preventing the problem of overfitting. 1D-CNN extracts the features from the raw signal and Bi-LSTM layer fuses the feature information to enhance stability of the model and classify the faults. The efficiency of the proposed model is analysed by considering CWRU bearing vibration data for experimentation. A comparative analysis of the proposed method is made with other existing models. The model has shown the performance accuracy of 99.84% in classifying 12 different fault types. Therefore, the proposed hybrid 1D-CNN-Bi-LSTM (Softsign and Spatial Dropout) is an effective multi-class fault diagnosis method with the prevention of model overfitting problem.