Machine Learning Application for Predicting Heart Attacks in Patients from Europe

Even today, there are still a large number of people suffering from heart attacks, which have already claimed numerous lives worldwide. To examine the main components of this problem in an objective and timely manner, we chose to work with a methodology that relies on taking and learning from real and existing data for use in training and testing predictive models. This was carried out to obtain useful data for the present research work. There are in parallel different methodologies that do not quite fit the model of this work. Data was collected from the "Center for Machine Learning and Intelligent Systems" which in turn contains data from patients who have ever suffered a cardiovascular attack and from patients who never suffered the disease, all of them being patients selected from different medical institutions. With the corresponding information, it was subjected to different processes such as cleaning, preparation, and training with the data, to obtain a logistic regression type automatic learning model ready to predict whether or not a person may suffer a cardiovascular attack. Finally, a result of 87% accuracy was obtained for people who suffered a heart attack and an accuracy of 81% for people who would not suffer from this disease. This can greatly reduce the mortality rate due to infarction, by knowing the condition of a person who is unaware of his or her health situation and thus being able to take appropriate measures. Keywords—Prediction; machine learning model; logistic regression; heart attack


I. INTRODUCTION
Nowadays it is more common to talk about people prone to cardiac arrest [1], the lifestyle of the general population has changed so drastically that people have started to develop cardiovascular problems frequently [2]. There are several factors to consider, one of the most obvious of which is the type of food that people choose to eat, such as junk food [3].
To analyze the main factors of this problem in an objective and timely manner, we chose to work with a meta-analysis methodology [4], This consists of taking and studying existing test data and sorting them to obtain data beneficial to our research [5]. Different research methodologies are not completely adapted to the model of our research, such as Design Thinking Methodology, which is a trial and error model, or the Ethnographic method, which is a controlled study of a sample of the population. That is why the metaanalysis methodology is ideal for the objective of our research [6].
As the main study sample in this work, we took data from various medical institutions in Europe to compare and analyze why and how cardiovascular diseases have been growing. We took data from the "Center for Machine Learning and Intelligent Systems" and acquired a CSV with the corresponding information [7], by doing this, we were able to structure the information to obtain statistical tables that help to understand the problem [8].
The main objective is to help prevent and study heart attacks in vulnerable patients in depth to reduce the mortality rate due to these diseases through concrete statistics that were implemented using machine learning.
The structure of the article is as follows: in Section II we will see the methodology, in Section III we will see the detailed case study through statistical tables, in Section IV we will present the conclusions and recommendations of the research and, finally, in Section V we have the references.

A. Information Gathering
The first thing that was done to make the investigation of Heart Attacks and have a solution, was to obtain as much information on the subject, being these real cases where people were affected.
It should be noted that the information obtained is not data, since it still has to go through a severe filtering process, and using parameters, we will get the data already separated and grouped as appropriate [9].
The sources from which the information was acquired must be reliable, we cannot resort to any page of dubious information, since this can be detrimental to the investigation. [9]. We need truthful data that does not corrupt the real and specific objective we have.

B. Parameter Configuration
In this stage, we will import the libraries for the training of our data, which will be divided into two processes [10].

1) Input data:
A collection of records containing features important to the Heart Attack problem, this data will be used during training to set up the model to make accurate predictions about new instances of similar data, the values in the input data are a direct part of the model [11]. www.ijacsa.thesai.org 2) Parameters: These are the variables that the selected machine learning technique uses to fit the data [10]. The parameters and model are optimized and tuned through the training process, run data, evaluate the accuracy and adapt until the best values are found.
3) Validation data: this model it provides us to keep the data, as test training, it will also be trained with the missing data, adjusting the validation data to finalize it will be evaluated according to the percentage acquired with the test data [12]. The data model is divided into three parts, the information will be prepared without nulls and gaps. The small volume of data will not be efficient for training.

C. Data Preparation
In this stage, we will extract data that will be important for the quality of our result, so we will obtain better results and a high range of prediction positions [13].
This section is where the data was obtained or collected from different sources, such as databases, blogs, websites, spreadsheets, etc. [14]. To clean the data obtained and later have as a result a file with a format that we have applied as CSV.

D. Model Approach
The problem shows the frequency of cardiac problems in different groups of people, being the main problem the heart attack, for this problem we propose the decision making of the machine learning methodology logistic regression. The logistic regression model is used for classification, it is a supervised type algorithm. This model is used when our objective is to forecast the probability of a certain event occurring or not [15].
This will help us to classify the data and perform automatic learning, being supervised. With this logistic regression model, we predicted the heart attack patterns, using logistic regression, the following are performed.

A. Information Gathering
A dataset of available heart disease data from the following was used in this process. In the aforementioned medical institutions, it was collected from different databases containing 76 attributes.
In particular, the database of the Cleveland institution was used to carry out our research; information on heart disease in patients. He concentrated on simply trying to distinguish the presence of heart disease [16].

B. Parameter Configuration
In this stage, the libraries to be used in the model were defined, as displayed in Fig. 1.
 Numpy: Provides functions for vector and matrix creation, especially mathematical operations.
 Pandas: Data handling, manipulation, and analysis.
 Matplotlib: Library for chart creation and data visualization.
 Scikit learn Library that will give us support for the creation and training of the machine learning model.
 Seaborn: It is a matplotlib-based library for the creation of graphs that provide a simple interface.

C. Parameter Preparation
At this stage, the data from the institutions indicated in the following point were used (A); For this purpose, the CSV file was imported to our working directory, in this case, it will be saved in Google Drive, This will facilitate access to our data when running our predictive model. Using the Google Colab platform, which is a virtual machine environment based on Jupiter and Notebooks. This runs in the cloud, where we do the Python coding, as shown in Fig. 2. The following function was used pd.head()of the pandas bookstore to showcase the first 5 rows of the dataframe as displayed in Fig. 3. As can be seen in Fig. 3, our dataset is displayed with a header divided into columns, which helps us to know what the values found in that column mean. In Table I, the meaning of the columns of the dataset is shown in more detail. The next step is to perform data cleaning and data preparation using the function df.isna() that provides us with pandas to detect missing values, which will return values of type Boolean, which indicates missing or lost values, and the function sum()will return the sum of all these values, which will result in 0 if there are no missing values, as can be seen in Fig. 4. The next step is to verify that no duplicate values are found in the dataframe with the function df. duplicate(), which will return a result of Boolean type, which will tell us if there is the duplicity of data, and with the function sum(), will give us the sum of how many rows are duplicated, this case it turned out that we have a duplicate row, as seen in Fig. 5. The next step is to remove duplicate rows from the dataframe with the function df.drop_duplicates(inplace=True), as it visualizes the Fig. 6 duplicate data were deleted. Then we will check again if there are duplicate rows with the previous function df.duplicated().sum().  Fig. 7 shows the degree of a heart attack in older people who have higher blood pressure, higher cholesterol levels, lower maximum heart rate, under a thallium stress test, one way to quantify the degree of risk is to measure the discrepancy between the disease and non-disease distributions, based on logistic regression theory. The creation of four graphs helped us to examine the most important characteristics that can be generated before a heart attack.
Using the matplotlib library, in Fig. 8 lines of code will be displayed, which will show us four plots shown in Fig. 9 from which useful information will be collected between "slp" -"output", "thalachh" -" output", "cp" -"output" and "old peak" -"output".  The graphs indicate that:  Patients more likely to have heart attacks tend to have higher heart rates.
 Patients with non-anginal chest pains are more likely to have a heart attack.
 The old speech distribution for both patient probabilities complements each other.

D. Units Approach to the Model
As shown in Fig.10 we carry out the preparation of the data, between training and testing, we create four variables: x_train de training, x_test of the test, and gives us a function train_test_split(), declarations 2 variables feature y output, we add the random percentage in the variable randon_state() el 2 percent to be applied to the output division. Now we define the training set with the function fit(x_train, y_train), we instantiate the logistic regression on a variable in this case named ClassiFier; now predict and with pred training and prediction, with which we obtained the report that shows the Fig.11, in which the summary of accuracy is displayed: recall, f1-score, support to see if you have the symptoms of heart attack, in this case, 0 means that you are not likely to have a heart attack. and 1 that if you are likely to have a heart attack.
As shown in Fig. 12 a function was created and inside we will perform the prediction and preparation with the new data. We make the confusion matrix for y_test and y_pred, it takes the index values and the columns of the data from the confusion matrix and we will put it in a graph for a better appreciation.
Use SI (MKS) or CGS as primary units. (SI units are recommended) English units can be used as secondary (in parentheses). An exception could be the use of English drives as a commercial identifier, such as a "3.5-inch disk."   350 | P a g e www.ijacsa.thesai.org Avoid combining SI and CGS units, such as current in Amps and magnetic field in Oersted. This often leads to confusion because the equation is not balanced in its magnitudes. If you must use mixed units, clearly state the units for each quantity you use in an equation.
Next, it is visualized in Fig. 13 that for 0 there was an accurate prediction of 83% and for 1 it had 84%. In our confusion matrix, he made 25 true positives, 26 true negatives, 6 false positives, and 4 false negatives.

A. From the Case Study Case
To see the precise dimensions of the research, it was compared with two other works, the first was with a Hybrid machine learning system [16] and the second with a Metaphorical machine learning system [17].
Fig. 14 shows the percentage of precision that was obtained in the results when applying the machine learning methodology, the blue color reflects the percentage of our research, the orange color is the percentage of the hybrid research, and the gray shows the work metaphorical. The value that was taken from each investigation was the level of precision of the analysis shown in percentages, resulting in similar comparisons, in the case of the metaphorical system, it uses simpler data, so its measurement is faster to carry out. The Hybrid system is developed with the union of different processes that result in a more complete analysis, and the work that was carried out is a direct machine learning implementation, so our data turns out to be more reliable and truthful, compared to the two other investigations.

B. Of the Methodology
The method used Machine learning is based on learning automatically, it provides us with tools that will help us make decisions according to the case analyzed, the logistic regression model is used, where the data is collected, after being analyzed, the configurations, data preparation and finally the problem statement.
This type of methodology used in all its phases has advantages and disadvantages, in Table II they are shown in  better detail.   TABLE II. ADVANTAGES AND DISADVANTAGES

Advantage Disadvantages
Management of the methodology allows us to take into account large numbers of variables.
Cost and implementation time The investment in Artificial Intelligence is very high as they are complex machines with a high cost in maintenance and repair.
Models provide a quick competitive advantage of calibration and re-estimation.
Increase in unemployment. The replacement of humans by machines is leading many people to unemployment on a large scale. Machine learning favors innovation and the search for new solutions thanks to the interpretation of data.
There is no creativity. Machines do not think, they work within parameters, so the creative capacity remains absent.
Optimized logistics processes will also help us to improve the organization's logistics systems and processes. And it is that it will have a solid database for decision making.
As effective as this technology is, it is not a human being, and it lacks feelings. Thus, as we mentioned earlier, it has no limits and ignores the moral barrier. A circumstance which, if not put on the brakes, can be very dangerous.
Compared to Machine Learning like Deep Learning, they mimic the human brain's way of learning. Their main difference is, therefore, the type of algorithms used in each case, although Deep Learning is more similar to human learning because it functions as neurons. Machine Learning tends to use decision trees and Deep Learning neural networks, which are more evolved. In addition, both can learn supervised or unsupervised.

V. CONCLUSION AND FUTURE WORK
In conclusion, the present research work collected accurate information from medical institutions on patients who have ever suffered heart attack problems and on patients who have never suffered such disease, imported libraries for data preparation, data cleaning, and the development of the machine learning model, which in the present case was of the logistic regression type, which gives a result of 1 when there is a presence of probability or 0 when there is an absence of probability. www.ijacsa.thesai.org The data analysis method used was machine learning, which mechanized the construction of our logistic regression model. As a result of the implementation of the model, we had a response of 87% accuracy for people likely to suffer a heart attack and 81% for patients who would not suffer the disease.
As a future topic, it is suggested to implement techniques for data preprocessing such as SMOTE (Synthetic Minority Over-Sampling Technique) for data imbalance. It is also suggested to include in future experiments other variables or characteristics that can facilitate the prediction of the proposed model to optimize it.