Transformer-based Neural Network for Electrocardiogram Classification

—A transformer neural network is a powerful method that is used for sequence modeling and classification. In this paper, the transformer neural network was combined with a convolutional neural network (CNN) that is used for feature embedding to provide the transformer inputs. The proposed model accepts the raw electrocardiogram (ECG) signals side by side with extracted morphological ECG features to boost the classification performance. The raw ECG signal and the morphological features of the ECG signal experience two independent paths with the same model architecture where the output of each transformer decoder is concatenated to go through the final linear classifier to give the predicted class. The experiments and results on the PTB-XL dataset with 7-fold cross-validation have shown that the proposed model achieves high accuracy and F-score, with an average of 99.86% and 99.85% respectively, which shows and proves the robustness of the model and its feasibility to be applied in industrial applications.


I. INTRODUCTION
Cardiovascular diseases (CVDs) happen due to malfunctions in the heart as well as blood vessels.CVDs are one of the main causes of global deaths as issued by the World Health Organization (WHO) [1], CVDs share 32% of the global death causes in 2019.ECG is an essential tool for CVDs diagnoses and treatment, also it is required for continuous heart monitoring.As WHO reported, 85% of deaths by CVDs located in developing countries where there is a shortage of professional doctors who are required to interpret the ECG for going through the proper medication and treatment [2].CVDs unleash the potential for the need to automate the ECG interpretation process to overcome the aforementioned challenges.
An extensive amount of research was done to introduce an efficient and sophisticated method for ECG classification.Cognitive algorithms are more suitable to process the ECG signals since the ECG parameters are not standard for all people, also it requires special types of models that are capable to handle sequential data efficiently due to the nature of the ECG signals [3].Using classical signal processing techniques, machine-learning-based classification methods are introduced to operate on manually extracted ECG features.Especially, deep neural networks present an outstanding performance due to the introduction of various new, wellstructured datasets.This paper introduces a novel method for ECG beat classification based on transformer networks and convolution operation for feature embedding to prepare model inputs.Two instances of the proposed model were trained independently by the raw ECG signals as well as the morphological feature R-R Interval (RRI), as illustrated in Fig. 1, where the output of both models is concatenated with the other to give the final prediction.The experiments in this paper were held based on the PTB-XL dataset [4] where the resampling methods were done to overcome the imbalanced data distribution of the used dataset.
The remaining sections of this paper are organized as follows: Section II involves the state-of-art previous related work, Section III introduces a detailed description of the proposed model, Section IV gives the experiments setup and the results, and Section V gives the conclusion about the paper main points and the future vision.

II. RELATED WORK
ECG beats classification is considered frequently in recent scientific literature.There are two main approaches to solving an ECG signal classification problem, one of them is to use classical machine learning approaches such as random forests [5], support vector machine (SVM) [6], ensemble SVM [7], etc.
Usually, classical machine learning algorithms are preceded by many phases to achieve an acceptable performance.These phases include the ECG noise elimination process which is mainly encountered by digital signal processing techniques such as low-and high-pass filters.Also, one of the most essential phases, that has a considerable effect on classifier performance, is feature extraction which can be based on signal domain transformations like different variants of Fourier transform [8] and Wavelet transform such as tunable Q-wavelet transform [9], the maximal overlap wavelet packet transform (MOWPT) [10], and continuous wavelet transform (CWT) [11]; features extraction can also be done based on some statistical measurements as skewness and kurtosis [9].Another way to do feature extraction is to depend on the morphological characteristics of the ECG signal; such methods acquire features mainly from the QRS complex component of the ECG.
On the other side of classical machine learning algorithms, the deep-natural-networks-based models provide exceptional performance over other machine learning algorithms, especially when a large amount of data is fed to the models.
Convolutional neural networks (CNN) were used in two ways of manner: 1-D CNN that accepts 1-D ECG signal as its input also 2-D CNN can be adapted after applying higher signal domain transformation to generate a representative visual representation of ECG signal as a spectrogram and then feed these spectrograms to the 2-D CNN which requires a considerable amount of computational effort.
Recurrent neural networks (RNN) also can be accommodated because of the sequential nature of the ECG data [12].Also, RNN variants such as gated recurrent units (GRU) [13] and long-short-term memory [14] are introduced to solve the ECG classification problem [15]; since these architectures solve gradient exploding and vanishing problems in the backpropagation algorithm in the network training process [14].The main disadvantages of the RNN and its variants are that RNN cannot handle long dependencies in the sequential data also because of its sequential behavior RNN does not benefit from parallel hardware accelerators [16] such as graphical processing unit (GPU) and field-programmable gate array (FPGA).A hybrid architecture can be combined of CNN and RNN or its variants can also present a sophisticated accuracy in ECG classification [17].

A. ECG Morphological Features Extraction
Morphological features of the ECG signals introduced an outstanding performance with different classifiers [18] since these features present critical information that is required to recognize different types of ECG signals.R-R interval (RRI) was chosen to be fed to the proposed model besides the raw ECG signals to optimize the model performance.The RRI was extracted by Pan-Tompkins algorithm [19] as it represents sophisticated performance measurements.The RRI and raw ECG examine two different independent paths of the same model architecture where the output of each model in concatenated before entering the classification head.The proposed model takes advantage of both the CNN and the transformer neural network [20].The proposed classification model consists of 2 paths each with 4 main stages as illustrated in Fig. 2. Each main stage is explained in detail in the following subsections and the internal structure of each main stage is illustrated in Fig. 3.

B. Feature Embedding
The feature embedding module consists of four convolutional layers as illustrated in Table I to generate a compact and concrete ECG features map, each of the two convolutional layers is followed by the Rectified Linear Unit (ReLU) activation function to provide the non-linearity in the ECG signals, while each of the last two layers is followed by maximum pooling layers which output a feature map with the highlighted features of the current ECG signal.Hence the fed data is a sequence of one-dimensional ECG signals, the output of the discrete convolution operation is given as Where x is the input signal, and w is the sliding kernel window.Also, the applied ReLU function can be computed with the following The size of the output of the feature embedding layer will be the same in the original transformer literature [20], which equals 512 (d model ).The size of the feature embedding output has to be the same as the output of the positional encoding, which is illustrated in the next subsection to provide the summation of both to the transformer encoder.

C. Positional Encoding
Hence the proposed model has no recurrence relation, it is a necessity to provide information about the absolute and relative positions of different timestamps in the ECG signals.
In order to introduce this information, the output of a positional encoding layer is summed to the output of the feature embedding module to be provided as an input to the transformer encoder and decoder [20].
Any periodic function is sufficient to implement the positional encoder, but in this work, the positional encoder is implemented by different frequencies of the sine and cosine functions because of their linear properties, which are feasible to be learned by the model [20].Positional encoding can be modeled by the following where pos represents the position of the sequence token, and i represents the spatial location of the ECG feature.

D. Transformer Encoder
The proposed model consists of four identical layers where each layer is subdivided into two sublayers which are a multihead self-attention sublayer and a position-wise fully connected feed-forward neural network.Residual connections exist around each sublayer to handle gradient flow in the network during the training time.To accommodate the residual connections each sublayer and the embedding dimension have a dimension of 512.The output of the multihead self-attention sublayer and its input via the residual connection is normalized to provide a sustainable training process to the network.Then, the data will flow through a fully-connected feed-forward network to provide a more convenient representation of the attention output.
A self-attention pooling [21] block was added to accept the output of the encoder block to reformulate the attention score tensor, from the dimension of [length of sequence, batch size, embedding dimension] to [batch size, embedding dimension], to be accepted by the multi-head attention sublayer in the decoder module where the attention function [20] can be written as where Q is the query, K is the key, V is the value, and dk=d model /h [20].

E. Transformer Decoder
The transformer decoder contains the same structure as the transformer encoder with the same sublayers proceeded with normalization as well as residual connection.
In addition to a multi-head self-attention sublayer and a position-wise fully connected feed-forward neural network, the decoder provides a multi-head attention sublayer to attend over the output of the transformer encoder module.Also, the multi-head attention sublayer is modified to attend only to the previous sequence tokens from the encoder output which is performed by masking the attention tensor [20].

F. Linear Classifier
At this stage, the outputs of each path of the raw ECG and RRI are concatenated into one tensor which is passed through this linear classifier to give the final prediction.The linear classifier consists of a flatten layer and two fully-connected feed-forward networks that are separated by a dropout layer for regularization [22].

IV. EXPERIMENTS RESULTS
All the experiments were carried out by Google Colaboratory where the dataset handling and the model implementation were held in Python 3.9.

A. Dataset Description
In the training process, we used the PTB-XL dataset [4].PTB-XL is the to-date largest freely accessible clinical 12-lead ECG-waveform dataset.The dataset covers a broad range of diagnostic classes including a large fraction of healthy records.A total of 2183721837 clinical 12-lead ECG records of 10 seconds length from 18885 patients are included in the dataset.The data is gender-balanced (52 percent male, 48 percent female) and includes the ages of 0 to 95 years old (median 62 and interquartile range of 22).
The ECG statements used for annotation are conformed to the SCP-ECG standard [23] and were assigned to three nonmutually exclusive categories diag (short for diagnostic statements such as -anterior myocardial infarction‖), form (related to considerable changes of particular parts within the ECG such as -abnormal QRS complex‖) and rhythm (related to specific changes of the rhythm such as -atrial fibrillation‖).There are 71 different statements in all, which are broken down into 44 diagnoses, 12 rhythms, and 19 form statements, four of which are also utilized as diagnostic ECG statements.A hierarchical classification into five coarse superclasses and 24 subclasses is also provided for diagnostic statements.As shown in Table II, We mainly classified by the 5 main classes (NORM, HYP, MI, STTC, CD) and by 14 classes which are (LVH, IVCD, ISC_, LAO/LAE, IMI, CRBBB, NST_, CLBBB, RAO/RAE, ILBBB, LMI, AMI, NORM, WPW).
Apart from its large nominal size, PTB-XL is notable for its diversity, both in terms of signal quality (with 77.01 percent of the highest signal quality) and in terms of a broad range of pathologies, many different co-occurring diseases, and a high proportion of healthy control samples, which is uncommon in clinical datasets.This variability is what makes PTB-XL such a valuable resource for training and evaluating algorithms in a real-world scenario, where machine learning (ML) algorithms must function reliably independent of recording settings or potentially low-quality data.
As you can see, the dataset still doesn't have a normal distribution, hence data augmentation was necessary to improve accuracy.By using data augmentation, we were able to improve the diversity of training data without having to acquire additional data.

B. Classification Model Metrics
Two models were created, one of them operates on ECG superclasses while the other operates on the ECG subclasses as shown in Table II.
Several statistical metrics were established to evaluate the classification performance of the proposed model where these metrics are precision, sensitivity, F1-score, and accuracy.All the mentioned metrics were computed due to the equations in Table III, where TP is a true positive, TN is a true negative, FP is a false positive, and FN is a false negative.The aVR lead was fed to the two models due to the amount of information it contains about the cardiac state of the individual as well as it is proven to present a good performance for classification and detection tasks [24].
Table IV shows all the classification performance metrics for the superclasses model while Table V presents the classification performance metrics for the subclasses model.www.ijacsa.thesai.orgThe proposed model was trained through 32 epochs for the superclasses model and 35 epochs for the subclasses model, Table VI introduces the average time required per epoch for each model.Also, the cross-entropy loss function is used to establish the loss during training and validation processes concerning the number of epochs which is illustrated in Fig. 4. Also, the accuracy of the training and validation processes is shown in Fig. 5      as the RRI feature of the ECG to achieve robust classification performance.The proposed model achieved accuracy and F-score, with an average of 99.86% and 99.86% respectively where these metrics can compete with the many recent state-of-art models.Due to the model performance and since the transformers were introduced to solve the sequential models' complexity challenges, the proposed model is capable to be deployed in practical applications.In future work, this model can be integrated with wearable device technology to assist in critical cases with continuous monitoring to save more lives.

Fig. 3 .
Fig. 3. Detailed Structure of the Embedding Module and Transformer Neural Network.

TABLE II .
PTB-XL DATASET SUPERCLASSES AND SUBCLASSES

TABLE III .
STATISTICAL METRICS EQUATIONS

TABLE IV .
CLASSIFICATION PERFORMANCE METRICS FOR SUPERCLASSES MODEL concerning the number of epochs.As illustrated in Table VII and Table VIII, the experimental results demonstrate that the proposed model introduced a considerable improvement over the current state-of-arts models for ECG classification.

TABLE VI .
APPROXIMATE TRAINING TIME PER EPOCH

TABLE VII .
CURRENT ADVANCED METHODS FOR ECG CLASSIFICATION VS PROPOSED MODEL

TABLE VIII .
COMPARISON WITH THE MOST MODERN METHODS AND THE PROPOSED MODEL FOR PERFORMANCE MEASUREMENTS AND THE NUMBER OF CLASSES USED IN EACH MODEL This paper introduces a novel classification model for 12lead ECG signals based on PTB-XL.The proposed model depends on the transformer neural networks for ECG sequence modeling and CNN for feature embedding.The model can be fed by the raw ECG signals as well