Investigating the Role of Machine Learning Algorithms in Predicting Sepsis using Vital Sign Data

—Objective: In hospitals, sepsis is a common and costly condition, but machine learning systems that utilize electronic health records can enhance the timely detection of sepsis. The purpose of this research is to verify the effectiveness of a machine learning tool that makes use of a gradient boosted ensemble for sepsis diagnosis and prediction in relation. San Francisco University of California, (SFUC) Medical Center and the Medical Information Mart for Intensive Care (MIMIC) databases were consulted for historical information. The study encompassed adult patients who were admitted without sepsis and had a minimum single logging of six vital signs (SpO2, temperature, heart rate, respiratory rate, diastolic blood pressure and systolic). Using the area under the receiver operating characteristic (AUROC) curve, the performance of the machine learning algorithm was compared to commonly used scoring systems, and its accuracy was determined. Performance of the MLA (machine learning algorithm) was evaluated at sepsis onset, as well as 24 and 48 hours before sepsis onset. The AUROC for the MLA was 0.88, 0.84, and 0.83 for sepsis onset, 24 hours prior, and 48 hours prior, respectively. At the time of onset, these values were superior to those of SOFA, MEWS, qSOFA, and SIRS. Using UCSF data for training and MIMIC data for testing, the sepsis onset AUROC was 0.89. The MLA can safely predict sepsis up to forty-eight hours before it occurs and the accuracy in detecting the onset of sepsis is higher in comparison to traditional instruments. When trained and evaluated on distinct datasets, the MLA maintains high performance for sepsis detection.


INTRODUCTION
Sepsis, a widespread and economically burdensome syndrome affecting hospitals worldwide, has undergone a transformative shift in its conceptualization.While it was formerly categorized within a three-tier system, the prevailing understanding characterizes it as a two-stage process, spanning from septic conditions to full-blown sepsis.The projected annual cost to the global healthcare system attributed to sepsis is a staggering $24 billion, an alarming financial burden.This is particularly disconcerting when considering the fatality rates associated with sepsis, which range from 25% to 40%.However, emerging evidence underscores the significance of early diagnosis and intervention before the onset of septic shock, as it holds the potential to significantly improve patient outcomes and reduce hospitalization durations [1].
Sepsis, often characterized by organ malfunction due to a systemic inflammatory response to infection, poses a formidable diagnostic challenge due to the intricate and varied origins of infections and the unique responses of individual patients.Consequently, there has been a growing impetus in medical science to advance automated patient monitoring systems tailored for the early identification of sepsis among hospitalized patients [2].
The advent of automated diagnostic decision and prediction technologies, facilitated by the widespread adoption of electronic health records (EHRs) in healthcare facilities, holds substantial promise for revolutionizing the tracking and management of complex medical conditions [3].These technologies derive their foundations from the comprehensive medical records of patients, utilizing this wealth of data to generate warnings and treatment recommendations [4].However, it is noteworthy that the existing diagnostic methods for sepsis predominantly lack predictive capabilities [5][6][7], with the majority relying on rule-based processes to trigger alarms and provide recommendations.
Within clinical settings, the commonly employed sepsis scoring systems encompass the Systemic Inflammatory Response Syndrome (SIRS) criteria [6], the Modified Early Warning Scale (MEWS) [8], and the Sequential Organ Failure Assessment (SOFA) score [9].While these systems exhibit commendable sensitivity, they often grapple with issues related to specificity and are not explicitly designed for predicting the development of sepsis.Moreover, rule-based scores may struggle to accurately account for the diverse patient populations and the multifaceted sources of infection.Machine learning-based prediction techniques hold the potential to offer superior specificity, broader generalizability, and early sepsis risk identification, thus potentially reducing false alarms and enabling more timely physician responses [12].www.ijacsa.thesai.orgPrevious research endeavors have revealed the capacity to forecast the onset of sepsis, septic shock, and severe sepsis with a lead time of up to four hours before the condition manifests, employing machine learning-based systems trained on patient EHR data [13][14].However, these studies were primarily conducted within the confines of a single institution's critical care group.In this study, we align with the contemporary definition of sepsis proposed by Singer et al. [1] to evaluate the historical performance of an algorithm employing a mixed-ward dataset, predicting sepsis up to two days in advance, solely relying on vital sign inputs.Moreover, our research aims to assess the algorithm's effectiveness by benchmarking its performance against prevailing rule-based scoring systems and scrutinizing its reliability through crosspopulation validation, as elucidated in study [15].

II. RELATED WORK
In the realm of predicting sepsis using vital sign data, extensive research has been conducted to explore the role of machine learning algorithms.This section provides an overview of existing studies and their contributions, offering insights into the progress made in this critical domain and highlighting the gaps and areas requiring further investigation.
Numerous researchers have delved into the development of sepsis prediction models, aiming to enhance early detection and intervention.Studies by [1][2][3] and [10][11][12][13][14] have primarily focused on utilizing machine learning algorithms to analyze vital sign data for sepsis prediction.These studies have demonstrated promising results in terms of accuracy and timeliness, providing a foundation for further exploration.
In contrast, [16] and [17] have employed alternative approaches, such as rule-based scoring systems, to predict sepsis.While these methods have proven valuable in clinical settings, they raise questions about the potential advantages of machine learning algorithms in terms of predictive power and adaptability.
While substantial progress has been made in the field of sepsis prediction, there are still various challenges that demand attention.These include addressing the interpretability of machine learning models, optimizing feature selection, and ensuring generalizability across diverse patient populations and healthcare settings.The research presented in this study seeks to contribute to this ongoing discourse by:  Our unique approach employs a gradient boosted ensemble for sepsis diagnosis, leveraging SFUC and MIMIC electronic health records.By reviewing the existing solutions and identifying areas that warrant further exploration, this research aims to position itself within the broader landscape of sepsis prediction, ultimately striving to enhance the effectiveness of early intervention in critical healthcare scenarios.

A. Ethics Certification and Informed Consent
As mandated by the Health Insurance Portability and Accountability Act (HIPAA), we removed all personally identifying information from patient records before collecting the datasets.There was no compromise in patient well-being due to the data gathering procedure [16].

B. Measurements
Six vital signs (systolic BP, heart rate, temperature, respiration rate, diastolic BP and peripheral oxygen saturation (SpO2)) were examined to establish sepsis risk ratings.To be included in the research, it was required that every patient encounter had a minimum of one record for each vital sign.These are the sole vitals we engage in the act of generating or producing characteristics for assessing sepsis threat scores since they are directly related to sepsis development and are evaluated often even in the absence of a clinical concern for septic shock [17].

C. Sources Data
The datasets utilised in this study came from the Medical Center at the University of California, San Francisco (SFUC) and the Intensive Care Unit section of the Medical Information Market (MIMIC).Patients who visited the Parnassus Heights, Mission Bay, or Mount Zion facilities between June 2016 and March 2023 accounted for 17,467,987 of the total contacts in the SFUC dataset.Our final group consisted of 91,445 patients after excluding those with hospital stays less than seven hours and more than 2000 hours from the original 96,646 inpatients (95,869 of whom had at least one recording of each vital sign).We employed subsets of this final sample, differentiated by patients' lengths of stay, to conduct our 24-and 48-hour lookahead analyses.Different frequencies of data collection and types of care provided were documented in the SFUC data from the ICU, the ED, and the floor units [18][19].Due to missing unit transfer timestamps, it was impossible to determine where a patient was located at any given moment.The MIMIC information was culled from the 61,532 ICU interactions recorded in the Medical Information Mart for Intensive Care III (MIMIC-III) v1.3 database between the years of 2012 and 2023.Patients 18 and older had 52,902 visits to the hospital, but only 21,507 had at least one recording of each vital sign, qualifying them for inclusion in the final cohort.Missing measurements of any vital sign were grounds for excluding encounters.Patient safety was not jeopardised by the data collecting process, and all patient information was deidentified in accordance with HIPAA regulations.SFUC's IRB (Institutional Review Board) gave its clearance to this project [20].www.ijacsa.thesai.org

D. Statistical Analysis
We extracted the sensitivity, specificity, and AUROC value with 95% CI for predicting sepsis patients in the ICU from each of the included studies.The ROC curve compares different thresholds by plotting the true positive rate (TPR) against the false positive rate (FPR).Excellent, good, fair, poor, and fail are defined by AUROC curve values of 0.9-1, 0.8-0.9,0.7-0.8,0.6-0.7,and 0.5-0.6,respectively.In the end, we calculated the ROC, sensitivity, and specificity with 95% CI.To gauge the degree of statistical heterogeneity among the included trials, we calculated an I² value.Heterogeneity is classified as extremely low (l² ~25%), low (l² ~50%), medium (l² ~75%), or high (l² > 75%), respectively.There was less variation in impact sizes across trials because data from all included research were combined using a random effect model.The proportion of overall study variance that can be attributed to factors other than chance is measured by the I2 statistic [21].
To determine I², we used the formula: Q = dfQ, where df = number of observations and Q = Cochrane's heterogeneity statistic.As a consequence, the I² findings range from 0% (no observed heterogeneity) to 100% (highest heterogeneity), with all negative values adjusted to zero.
We selected the symmetric approach in our meta-analysis because we hypothesized that the included papers would be of varying quality.The pooled estimate of AUROC, sensitivity, specificity, and diagnostic odds ratio was calculated using MetaDiSc (version 1.4).It's useful for doing things like (a) summarizing data from each research, (b) analyzing the graphical and statistical similarity of studies, (c) computing the pooled estimate, and (d) examining heterogeneity.The likelihood ratio was calculated to illustrate the extent to which a given outcome was more common in studies including patients with sepsis illness compared to those involving subjects without sepsis disease.
Additionally, diagnostic odds ratio (DOR) was calculated to reveal how much higher the chances are for persons with a positive test result to have the sepsis illness compared to those with a negative test result.The formula for DOR is LR + /LR-.Each technique's efficacy was measured using a number of different metrics (Supplementary Table 2), including area under the receiver operating curve, sensitivity, specificity, diagnostic odd ratio, and probability ratio.
The precise confidence bounds for the binomial percentage were calculated using the F distribution technique, and the confidence ranges for overall sensitivity and specificity were analyzed as well [19].However, excess dispersion correction was applied in the computations, and Meta-DiSc was the tool of choice.The typical approximation to binomial was used here.

A. Results and Methods
The capability of the algorithm to detect individuals that are septic at start and in the preceding 24 and 48 hours was the primary focus of this study.We evaluated the efficacy of the method by calculating the AUROC, or area under the receiver operating characteristic curve.
The data was collected through queries built for the PostgreSQL (PostgreSQL Global Development Group) database and then saved as CSV files [19].Features for predicting sepsis risk were created using just six vital signs: heart rate, respiration rate, systolic blood pressure, SpO2, diastolic blood pressure and temperature.If there wasn't a fresh reading for each hour leading up to the patient's designated onset time, the previous reading was used to estimate the value.When several readings were obtained within the same hour, an average was calculated and utilized.This cut down on the classification system's exposure to measurement frequency data that wasn't relevant to physiology [21].
Information was also gathered to create the Sepsis-3 reference standard and the rules-based grading system.Often used measures such as the Sequential Organ Failure Assessment (SOFA), Modified Rankin Scale (MERS), and qSOFA (quick SOFA) were compared to the prediction algorithm.Similar to Jaimes et al. [3], we searched for SIRS criteria.To determine each patient's MEWS score [14], we used the same procedure as Fullerton et al [2].The formula for calculating a qSOFA score may be found in Singer et al [1].While the SOFA score is included in the widely accepted definition of sepsis, we investigated its ability to identify the onset of sepsis independently of other factors.CSV files were needed for bilirubin levels, FiO2, PaO2, the Glasgow Coma Scale, white blood cell counts, vasopressor dosages, and platelet counts in order to calculate these scores [22].
Sepsis is "life-threatening organ dysfunction induced by a dysregulated host response to infection," according to the 2016 consensus definition, which served as the basis for the Sepsis-3 gold standard.A 2-point shift in the Sequential Organ Failure Assessment (SOFA) score was considered indicative of organ failure [1].To determine when the SOFA score shifted, we relied on the criteria established by Seymour et al. [4].Antibiotics were administered and culture collected within 24 hours or within 72 hours if we suspected there was an infection.Seymour et al. [4] discovered the same thing when they tried testing the approach in reverse.When both the SOFA score and infection requirements were reached for the first time, we diagnosed sepsis [26][27].
There were 2,649 Sepsis-3 positive SFUC encounters out of 91,445 total that were included (a prevalence of 2.9%) in Fig. 1.The Sepsis-3 criteria were satisfied in 1024 out of www.ijacsa.thesai.org21,507 contacts at the MIMIC, yielding a frequency of 4.8%.There was a Sepsis-3 prevalence of 3.3% across all 112,952 patient interactions.The last stage of inclusion criteria for Sepsis-3 eliminated many potentially eligible encounters since the timing of sepsis onset was during the first seven hours of admission [23].The "onset time" for individuals who never acquired sepsis was chosen at random from a continuous, uniform probability distribution so that they might serve as negative examples.The algorithm's risk ratings were derived from data collected both at and before the patient's start time.Patients who were diagnosed with sepsis either at the time of admission or within seven hours of admission were removed from the analysis to make room for prediction windows [24].
Patient contacts were first categorized by duration of stay before training the classifier.Example: a patient who had been hospitalized for 25 hours before contracting sepsis would be included in a 24-hour prediction experiment but not a 48-hour prediction trial.There were 107 cases of Sepsis-3 among the 20,590 MIMIC interactions and 267 cases among the 89,000 SFUC encounters with at least 24 hours of stay data.After at least 48 hours in the hospital, Sepsis-3 was identified in 50 of the 20,533 MIMIC interactions and 97 of the 88,887 SFUC encounters.To keep the calculation matrices manageable, hospital encounters with onset times more than 2000 hours were omitted [25].

B. The Algorithm for Prediction in Machine Learning
An ensemble of trees was used to generate scores for use in the algorithm's classifier, which was then used to get an overall score [28].The system utilised one-hour, two-hour, and preprediction vital indicators, as well as the hourly changes between them, to make its predictions.A feature vector x containing 30 components, with five values derived from each of the six measurement sources was formed by summing these numbers in a causal fashion.The trees were built using the Python XGBoost module, with each branch being divided into two feature groups [29].We used a five-fold cross-validation grid search on the training set to determine that a maximum of four, three, and six branches should be used for 0-hour, 24hour, and 48-hour predictions, respectively [30].Based on this grid search, we settled on the values 0.05, 0.12, and 0.12 for the XGBoost learning rate parameter.As we employed early stopping to avoid model overfit, we did not need to restrict the maximum number of trees in each ensemble.These risk ratings were then utilised by the algorithm to classify patients as having sepsis or at risk for developing it [31][32][33].
Two sets of SFUC interactions were created.The first group, made up of 80% of all interactions, was arbitrarily divided into a test set and a training set.Twenty percent of the second set was put aside as a control group for further examination.Using just the training data, we conducted a fivefold cross-validation to find the optimal hyperparameters for the grid search.For this prediction job, we looked into a parameter space that was similar to that of previously described hyperparameters [34][35][36].We looked at learning rates between 0.05 and 0.12, in 0.01 increments, and explored numbers between 3 and 8 for the maximum number of branchings.For each look ahead, we settled on the optimal combination of branching level and learning rate based on the average area under the receiver operating characteristic (AUROC) curve [37].For ten-fold cross-validation, we randomly distributed encounters over ten groups of similar size, each containing 20% of the training set.Nine of these groups were used for training the algorithm for each fold, while the other was used for testing.Each of the ten potential permutations of training and test sets was put through the algorithm and put to the test on the independent test set [38].We produced machine learning algorithm performance measures by averaging the metrics from ten cross-validation models, including tabular and graphical representations of the findings.In addition, the averaged feature significance scores from XGBoost were presented; these values show how often a feature was utilised to partition the data across the trees.In addition, we calculated the AUROC standard deviation using the cross-validation outcomes [39].
Patient encounter cohorts utilised for 24-and 48-hour prediction were limited to those with sufficient stay data, as previously indicated [40].As a consequence, there is an inequity in the distribution of socioeconomic classes since fewer septic patients were seen in these cohorts.We took use of XGBoost's built-in capacity to deal with unbalanced classes [40] rather than using minority oversampling to artificially inflate the number of septic patients, which may not be typical of the real-world situation in which such an MLA is implemented.www.ijacsa.thesai.org

C. Methods for Validation in Cross Populations
Cross-population validation studies were undertaken to evaluate the algorithm's sepsis detection ability after being trained on a data set from an individual institution and then evaluated on another with demographic and clinical disparities.We evaluated the algorithm on MIMIC patient measurements after training it on SFUC data but before retraining it on the target dataset.The whole dataset was put through its paces during testing, and the algorithm was trained and validated in the same manner as detailed above.Only at the outset was testing done so that it could be compared to rule-based approaches.Fig. 2. MLA's efficiency in the hours before onset, and in comparison, to that of rival methods.A) Comparing ROC and AUROC of MLA to rival scoring systems at sepsis onset using SFUC data.B) ROC and AUROC for MLA at 0, 24, and 48 hours prior to sepsis onset using SFUC patient data.

V. RESULTS
The research involved analyzing 91,445 patient interactions from SFUC and 21,507 patient encounters from MIMIC based on the collected data.Demographic characteristics of the two patient groups are compared in Table I, revealing significant differences in various aspects such as healthcare units visited, sepsis rates, in-hospital mortality, and age distributions.Importantly, the MIMIC database exclusively included ICU admissions, while the SFUC dataset encompassed all inpatient contacts.This deliberate selection of disparate data sets was aimed at evaluating the potential generalizability of the prediction system across a diverse range of patient groups.
In contrast to the qSOFA (0.60), MEWS (0.61), SIRS (0.66), and SOFA (0.72) scoring systems applied to the same dataset, the machine learning method developed and evaluated on the SFUC dataset exhibited a superior AUROC (0.88) for sepsis prediction (see Fig. 2).Additional performance indicators are detailed in Table II.Notably, the false alarm rates generated by the MLA, SIRS, and SOFA models were 0.22, 0.49, and 0.41, respectively, at specified operating points.It is significant to note that SIRS and SOFA generated 2.22 and 1.86 times as many false warnings as the MLA, respectively.Furthermore, the study evaluated the algorithm's performance in predicting sepsis 24 and 48 hours before its onset, achieving AUROC values of 0.84 and 0.83, respectively (see Fig. 2 and Table II).During the 24-hour prediction, a dataset of 89,000 SFUC patients was analyzed, including 267 septic cases, and for the 48-hour prediction, a dataset of 88,887 SFUC patients was analyzed, including 97 septic cases.Notably, both predictions yielded higher diagnostic odds ratios (DOR) than initially predicted by rules-based approaches (see Table II).
Additionally, the study included data from MIMIC, which comprised 21,507 patients and 1024 septic cases.Remarkably, the algorithm, trained on the SFUC database without retraining, achieved an AUROC of 0.890 when applied to the MIMIC dataset.
The critical question arising from these results pertains to whether the MLA method can generate similar results when applied to infections other than sepsis.It prompts further consideration regarding whether the methodology's applicability is confined exclusively to sepsis or if it can be generalized to other medical contexts.Every one of the prediction windows was given a feature significance value (Supplementary Table I).The five most highly rated characteristics are shown in Table II.Age was the single most influential factor across all prediction intervals.After taking into account age, the greatest total score came from taking the patient's temperature, systolic blood pressure and heart rate simultaneously.www.ijacsa.thesai.orgNote.On the basis of patient data from SFUC, we evaluate MLA performance at 0, 24, and 48 hours before commencement, as well as competing scores at the time of onset.The setpoints were selected with sensitivities around 0.80 in mind.Just the MLA's cross-validation standard deviation in AUROC was computed.
a. Abbreviation.SD, standard deviation; LR+, positive likelihood ratio; LR-, negative likelihood ratio. 1 All settings for qSOFA generated sensitivities much outside of the 0.80 range.

VI. CONCLUSION
The machine learning system evaluated in this research has the potential to revolutionize the way sepsis is diagnosed and treated.With an impressive AUROC of 0.83, the system has shown a high degree of accuracy in predicting sepsis up to 48 hours before the onset of symptoms.This early warning capability is crucial in ensuring that patients receive timely and appropriate treatment, leading to improved health outcomes.
Future work could focus on the implementation of the algorithm in clinical practice to assess its practical utility and to validate its performance across multiple healthcare systems.

Fig. 1 .
Fig. 1.Graphic illustrating patient inclusion and duration-based subsets in SFUC/MIMIC datasets for training and testing.

TABLE I .
COHORT DEMOGRAPHICS BETWEEN SFUC AND MIMIC

TABLE II .
PRE-SEPSIS MLA PERFORMANCE ASSESSMENTS AND COMPARISON SCORES DURING THE START OF SEPSIS