Evaluating Domain Knowledge and Time Series Features for Automated Detection of Schizophrenia from EEG Signals

Over the recent years, Schizophrenia has become a serious mental disorder that is affecting almost 21 million people globally. There are different symptoms to recognize schizophrenia from healthy people. It can affect the thinking pattern of the brain. Delusions, hallucinations, and disorganized speech are the common symptoms of Schizophrenia. In this study, we have used electroencephalography (EEG) signals to analyze and diagnose Schizophrenia using machine learning algorithms and found that temporal features performed well as compared to statistical features. EEG signals are the best way to analyze this disorder as they are intimately linked with human thinking patterns and provide information about brain activities. The present work proposes a Machine Learning (ML) model based on Logistic Regression (LR) along with two feature extraction libraries Time Series Feature Extraction Library (TSFEL) and MNE Python toolkit to diagnose Schizophrenia from EEG signals. The results are analyzed based on 5 different sampling techniques. The dataset was cross-validated using leave one subject out cross-validation (LOSOCV) using Scikit learn and achieve greater accuracy, sensitivity, specificity, macro average recall, and macro f1 score on temporal features respectively. Keywords—Artificial intelligence (AI); Logistic Regression (LR); smote-class weight (S-CW); borderline smote-class weight (BS-CW); electroencephalography; Time Series Feature Extraction Library (TSFEL)


I. INTRODUCTION
Schizophrenia is one of the most common and severe mental disorders which is affecting more than 21 million people around the globe [1] and almost 50% of the total population of men [2] are suffering from this mental disease than women. This mental disorder directly affects the thinking pattern of human beings if it is not treated properly and can cause discrimination, stigmatization, and disobedience of human rights [3], [4].
Delusions, hallucinations, and disorganized speech [5] are the most common examples of this severe psychotic disorder.
Hallucinations are sensory illusions that appear to be real [6] but are generated by your mind. Delusions are false beliefs that contradict reality and are not true. One cannot distinguish between what is real and what is imagined [7]. The patients having this psychotic disorder have a relative life expectancy is between 10 to 15 years and it also increases the risk of suicide to 10% which is not a great sign for human beings [8].
The diagnosis and analysis of schizophrenic patients can be done through the use of electroencephalogram (EEG) signals [9], [10]. The EEG signals have unique characteristics, variability, and dimensionality [11] as they can provide information about the electrical activities of a human brain [12], [13], and also they have the great potential to predict whether a person is a healthy control or schizophrenic [14]- [16]. In the medical field, EEG signals have vast applications like it can be used to detect epilepsy, comma, clinical death, and schizophrenia [17]. The scalp-based activity of EEG signals exhibits oscillations at different frequencies [18]. The main advantage of using EEG signal is that it is non-invasive, cheap, and possesses a high temporal resolution which gives a clear advantage over other techniques being used to diagnose schizophrenia [19].
According to researchers, there are five types of frequencies emitted by the human brain. Based on their frequency bands and locations they are categorized into delta (δ), theta (θ), alpha (α), beta (β), and gamma (γ) respectively which is shown in Table I [20].
Schizophrenia is classified into five categories according to the Diagnostic and Statistical Manual of Mental Disorders, 4 th edition DSM-IV [21], [22]. The well-known five categories are further classified into positive and negative symptoms. Delusions and hallucinations are positive symptoms while avolition, alogia, and anhedonia are negative symptoms [23], [24].  [25] the authors have used a thirteen (13) layers Convolutional Neural Network model to diagnose practical, normal, and seizure classes. In [26] they have used a deep learning algorithm such as CNN with random forest. They have implemented a voting layer to differentiate between those individuals who are at high-risk with schizophrenia and healthy controls. In [27]authors have used a CNN model to diagnose and evaluate different partitions of EEG to visualize unusual brain activities. In [28]authors have used EEG signals to compare real and imaginary music and classified them using CNN. In [29] have used a cross trail encoding technique with the aid of convolutional autoencoders and used EEG signals. The dataset used in the training was very small. In [30] have utilized a single electrode approach to classify and diagnose schizophrenia from EEG recordings. They have used the timefrequency technique to differentiate between schizophrenia and healthy controls. The authors have proposed a state-ofthe-art model to recognize Alzheimer's disease with the help of logistic regression and achieved higher accuracy as compared to the domain knowledge-based handcrafted features [31].
The motivation behind this research is to evaluate the performance of different sampling techniques with the help of EEG signals for the diagnosis of schizophrenia and also to assist and verify the psychiatrist's decision because the diagnosis takes around 6-12 months as it is based on the questionnaire surveys.
The objective of this research is to develop a Machine Learning (ML) model and to compare the effect of different sampling techniques on the mentioned dataset that can validate the doctor"s decision and quickly diagnose this severe mental disorder.
The following is the structure of this research paper. The introduction is included in Section I. Section II explains the methodology and different python toolkits used in the experiment. Section III includes the results on three different techniques with filtered, no Z score, and no filter and unfiltered datasets. Section IV concludes the conclusion and how this study can be useful to diagnose Schizophrenia with the help of machine learning algorithms. Section V explains the future work.

A. EEG Dataset and Preprocessing
The raw EEG data of fourteen (14) patients having schizophrenia, comprised of seven (7) males and seven (7) females with their average ages of 27.9 ± 3.3 and 28.3 ± 4.1 years, respectively. The experiment was carried out at the Institute of Psychiatry and Neurology in Warsaw, Poland [32]. Similarly, fourteen (14) healthy patients having no major disease were recruited of the same gender and same age group with seven (7) males and (7) females, respectively. The raw data were collected with the consent of all the participants. Data were collected at a sample rate of 250 Hz using the International 10-20 system [33] as depicted in Fig. 1. Raw data were obtained when the patients were in a relaxed state with their eyes closed. The channels utilized to collect data were Fp1, Fp2, F7, F3, Fz, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, and O2 respectively. The acquired EEG data was partitioned into different partitions and are considered stationary signals. Each segment had a window duration of 25 seconds (6250 samples). There were 1142 EEG segments in total, with each segment containing 6250 x 19 sample points and were normalized with Z-score before being sent to the logistic regression (LR).

B. Research Toolkits 1) Time Series Feature Extraction Library (TSFEL):
TSFEL is one of the most effective available libraries of python to compute the extracted features of EEG signals. It helps the data scientist to evaluate a variety of domain knowledge features as well as handcrafted features. It can compute 60 distinct features which are extracted from statistical, temporal, and spectral domains [34].
2) MNE tool python: MNE Python toolkit [35] is an opensource python package used to evaluate and analyze human neurophysiological data such as EEG, MEG, sEEG, ECoG, NIRS, and many more. This toolkit is very helpful to visualize EEG signals. It can compute 28 univariate features and 6 bivariate features.

C. Proposed Methodology
The raw EEG data of schizophrenia was downloaded from the Repository of Open Data (RepOD), Department of Methods of Brain Imaging and Functional Research of Nervous System [36]. A bandpass filter of 0.1 Hz to 45 Hz was applied to remove the unwanted frequencies and data www.ijacsa.thesai.org segmentation was done in the first phase. After data segmentation different features were extracted with the help of feature extraction libraries such as Time Series Feature Extraction Library (TSFEL) and MNE Python toolkit. Logistic Regression (LR) was utilized for classification and to differentiate the schizophrenic and healthy control patients depicted in Fig. 2.

III. EXPERIMENTAL RESULT
In the experimental phase, two feature extraction libraries are used such as TSFEL and MNE Python toolkit. The experimental results are divided into three phases. First, we have analyzed results by applying Z score normalization (Filtered), No Z score normalization and no filter in the second phase, and unfiltered data in the third phase. For the classification purpose, Logistic Regression (LR) was employed as a machine learning algorithm.

1) Logistic regression on TSFEL filtered data: LR was
implemented on the TSFEL library with Z score normalization and applied five different sampling techniques such as Synthetic Minority Oversampling Technique (SMOTE) with Class Weight abbreviated as S-CW, Borderline SMOTE Class Weight (BS-CW), Random Oversampling Class Weight (ROS-CW), None-Class Weight (N-CW) and None-None (N-N). It is found that S-CW and ROS-CW achieved an accuracy of 77.90% which is quite good when we have a smaller data size as shown in Table II Table III, S-CW AND BS-CW achieved the higher accuracy of 82.27% and 82.72%, respectively when filtering and normalization was not implemented shown in Fig. 4. 3) Logistic regression on TSFEL unfiltered data: In Table IV. LR was implemented on unfiltered data and interestingly the accuracy of N-CW was 79.36 which was very good as compared to other sampling techniques shown in Fig. 5.  Table V, MNE-Python toolkit was used on Filtered data with Z score normalization and BS-CW accuracy was 77.45% as compared to other sampling techniques used in the experiment shown in Fig. 6. 5) Logistic regression on MNE with no Z score and no filter data: In Table VI, LR was used with MNE, and No Z Score, and No Filtered data, and S-CW performed better than other oversampling techniques. S-CW achieved an accuracy of 91.63% which is very good as shown in Fig. 7.   Table VII, LR was used on the unfiltered data and S-CW and N-CW performed better and achieved the accuracies of 79.45% each shown in Fig. 8.

7) Comparison of logistic regression on TSFEL:
After applying LR on three different datasets such as filtered, no z score, and no filter and unfiltered data we found that BS-CW has the highest unweighted macro recall value of 82.88 on no z score and no filter data as shown in Table VIII

IV. CONCLUSION
In this paper, a deep learning model is proposed to diagnose schizophrenia from EEG signals as they contain information about the electrical activities of the human brain. We have proposed a machine learning model that can diagnose Schizophrenia from EEG signals. According to World Health Organization (WHO), it is affecting almost 21 million people worldwide and it is very hard to diagnose Schizophrenia as the treatment can take from 6 months to 1 year because doctors ask several questionaries and take the survey from the patients. Different studies suggest that it can be found more in men than women. So, we need to take help from machine learning algorithms to diagnose this chronic mental disorder as quickly as possible. In our proposed model, we have used Logistic Regression (LR) as a classifier because it provides very good results when we have smaller datasets. We have evaluated the results in three different domains. First filtered data with Z score normalization, then without Z score normalization, and finally on the unfiltered data. We have used 5 different sampling techniques like SMOTE Class Weight (S-CW), Borderline SMOTE Class Weight (BS-CW), Random oversampling Class Weight (ROS-CW), None Class Weight (N-CW), and None-None (N-N), respectively. From our observation we have analyzed that the results achieved with no z score and no filter have the highest unweighted macro recall value it is due to the EEG recordings obtained from 14 SZ and 14 HC people does not have artifacts. It is also observed that when we have applied some filtering techniques so the ML model performance significantly decreased.
For cross-validation of ML model the leave-one-subjectout cross-validation technique using Scikit Learn has been utilized to validate the results in the form of evaluations parameters macro recall, macro f1 score, sensitivity, specificity, and accuracy, respectively.

V. FUTURE WORK
This research has still some limitations as the proposed model can predict and diagnose schizophrenic patients (SP) and the healthy control (HC) from the EEG signals. It cannot predict the disease severity. Also, this experiment has been done on the smaller dataset, but it can be carried out on the larger datasets as well to verify the model"s accuracy.