Improving the Computational Complexity of the COOL Screening Tool

—Autoimmune disorder, such as celiac disease and type 1 diabetes, is a condition in which the immune system attacks body tissues by mistake. This might be triggered by abnormality in the development of biomarkers such as autoantibodies, which are generated by unhealthy beta cells. Therefore, screening of such biomarkers is crucial for early diagnosis of autoimmune diseases. However, one of the fundamental questions of screening is when to screen subjects who might be at a higher risk of autoimmune disorder. This requires an exhaustive search to find the optimal ages of screening in retrospective cohorts. Very recently, a comprehensive tool was developed for screening in autoimmune disease. In this paper, we improved the computational time of the algorithm used in the screening tool. The new algorithm is more than 100 times faster than the original one. This improvement would help to increase the utility of the tool among clinicians and research scientists in the community.


I. INTRODUCTION
Autoimmune disorder is a condition in which the immune system mistakenly attacks healthy body tissues in different organs of the body. For example, in type 1 diabetes, the immune system destroys the insulin-producing cells of the endocrine pancreas, which leads to insulin deficiency [1]. In celiac disease, eating gluten -a protein found in wheat, rye, and barley -causes the autoimmune system to damage the small intestine [2]. There are many factors involved in causing such diseases such as genes, environmental factors, drugs and/or chemicals. However, autoimmune disorder is often associated with a few circulating autoantibodies, which are abnormal antibodies generated by pathogenic β-cells, when targeting a tissue [3]. Autoantibodies are often precede the onset of the disease and, therefore, considered as a clinical biomarker of the autoimmune disorder. In type 1 diabetes and celica diseases, there are four or five autoantibodies that are often used to assess the risk of developing the disease [4], [5].
Screening for autoantibodies -a group of serum tests to assess the presence of autoantibodies -is usually performed to detect the disease as early as possible so that a proper treatment or intervention can be administered. Therefore, frequent screening is of upmost importance to detect potential autoimmune disorder in subjects who in an apparently healthy population [6], [7]. Although frequent screening is beneficial for detecting subjects who are at a higher risk of the disease, it is cost inefficient and may also introduce harm for those who do not have the diseases by increasing the risk of overdiagnosis [8]. Therefore, one needs to find the optimal ages for screening in order to balance between the benefit and the harm of multiple screening.
To find a proper screening schedule, one needs to do cross-sectional experiments on retrospective cohort to find the optimal ages for screening. The authors of a recent paper [9] proposed a tool, called the Collaborative Open Outcomes tooL (COOL), that can be used to compute the quality performance of a given proposed screening schedule according to some measures that can be used to balance between the benefit and the harm of the screening schedule. However, computing these measures for a given schedule is a very time consuming task. In this paper, we propose to make these computations much faster. This proposed enhancement will increase the utility of the tool to compare multiple schedules to find the optimal (according to the given measures) screening schedule much faster.

A. Data Structure
We explain the structure of the data used for defining the screening schedule. The data has biomarkers information for each subject. Each subject may visit the clinic multiple times and each time a blood sample is taken from the subject to assess the development of biomarkers. The value of each biomarker is either positive (the autoantibody is developed) or negative. It is worth mentioning that each subjects may have a different number of visits.
Notations: We use the upper case letter to define a matrix -a two dimensional array -, e.g. X, a boldface letter to define a vector, e.g. x x x, and a italic letter to define an entry or element, e.g.
x. x x x[i] represents the entry i of the vector x x x. X[i] represents the i th row of the matrix X, and X[i] [j] represents the entry in the row i and column j.
Mathematically, let us define the data for a subject i as where T i is the number of visits for the subject i, t j i is the subject's age at the visit j, and x x x j i ∈ {0, 1} M is the list of M biomarkers for the visit j. In addition, the information about whether and when the disease was developed is recorded. For simplicity, we assume that each subject either developed the disease within a predefined period of time from birth or the subject has been observed for the full period but has not developed the disease 1 . If the subject has (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 developed the disease, the subject is not followed afterwards. y i is the age when the disease was developed and -1 otherwise.  Subject 1 has a sequence of T 1 = 4 visits. Each visit has measurements for M = 4 biomarkers. The first visit was measured at age t 1 1 = 1 year and the second and the fourth biomarkers were positives while the other two biomarkers were negatives. The second visit was sampled at age t 2 1 = 2.3 years. We can see that the fourth biomarker turned to negative in the second visit while the first biomarker became positive. The subject has developed the disease at age y 1 = 9 years. The second subject has T 2 = 3 visits at ages 2.4, 6, and 9.2 years and has not developed the disease, i.e. y 2 = −1. We clearly see that each subject may have a different number of visits and these visits might be sampled at different ages. A graphical representation of these data is shown in Fig. 1. One simple data structure that can be used to store the data for all subjects is a 3-dimensional array, where the first dimension is the number of subjects N , the second dimension is the maximum number of visits S = max i {T i }, and the third dimension is the number of biomarkers M , i.e. R N ×S×M . However, there are two challenges to store the data in a threedimensional array. The first challenge is that each subject may have a different number of visits. The second challenge is the irregularity in the biomarkers collections. As seen from the example, the biomarkers are collected at different and irregular time stamps. These two issues pose a challenge to store the data for all subjects in a 3-dimensional array, which assumes that the data are time-aligned. A better data structure for storing such information would be a 2-dimensional array (matrix) with a special structure.
Let us assume that T = N i=1 T i is the total number of visits across all N subjects. We construct a matrix X with dimensions T ×M +2, where each row represents one visit for a particular subject. The first column in the matrix represents the subject index, the second column is the age of the subject at the current visit, the other M columns are the values of the biomarkers. Data is sorted in ascending order by subject index and age. An additional array y y y stores the age at which the subject developed the disease, i.e. y y y[i] is the age when the subject i developed the disease and -1 otherwise.
Example II.2. The matrix for the data in Example II.1 can be represented as As it can be seen, subject 1 has 4 rows in the matrix representing 4 visits, and subject 3 has two rows.

B. Single-Age Screening
Problem 1 (Single-age screening). At which age a, subjects with a positive test at that age will likely develop the disease within the observation period?
The objective of screening at a single age is to assess the likelihood that a subject has the disease. Let us assume that the screening test is whether any biomarker is positive. The question would be how likely subjects with any positive biomarker at a given age will develop the disease within the observation period (e.g. within 10 years from birth). In order to compute the quality performance of the screening at a single age, we need to compute the following Table I:   TABLE I

Screening test
Developed the disease Not developed the disease No test # no test and positives (N P ) # no test and negatives (N N ) Each subject will be placed in one of these six cells. If the subject was tested positive and developed the disease, the subject will be counted in the T P cell. F P is the number of subjects who were tested positive and have not developed the disease. Similarly, F N (T N ) is the number of subjects who were tested negative and developed (not developed) the disease, respectively. Finally, since not all subjects may not necessarily have a visit at a particular age, some subjects may have no screening test and therefore will be missing from the screening test. This is accounted for in the last row of Table I. Using the information provided in Table I, the screening test is usually evaluated using the sensitivity and the specificity measures [11]. The sensitivity is the probability that the screening test is positive among those who have the disease. Specificity is the probability that the screening test is negative among those who do not have the disease [12]. These two measures can be computed as: Example II.3. If the sensitivity is 80%, it means that 80% of diseased subjects are identified as diseased (have a positive test). If the specificity is 90%, it means that 90% of nondiseased subjects have a negative test (correctly identified as non-diseased).
These two measures are important as they measure the percentage of diseased individuals who have positive test results and the percentage of non-diseased individuals who have negative test results, respectively. Nevertheless, these two measures assume that the test result for each subject is known, i.e. they do not account for subjects with missing tests. Cumulative sensitivity (CSen) and dynamic specificity (DSpc) address this issue [13]: As it can be seen, CSen and DSpc require all subjects who do/do not have the disease, respectively. However, from the subject's perspective, these two measures do not give insights about the likelihood to develop the disease if the test results is positive or negative. Positive predictive value (P P V ) and negative predictive value (N P V ) answer this question.
P P V is the probability of having the disease among those subjects who tested positive. N P V is the probability of not having the disease among those subjects who tested negative.
So, in order to evaluate the performance of a screening test, we need to compute 4 measures CSen, DSpc, P P V and N P V . Algorithm 1 evaluates the performance of a screening at a given age a by computing these four measures.
Algorithm 1 takes as parameters the age a at which the screening will be evaluated, the data matrix X that encodes the age and the biomarkers information, and the label array y y y that encodes the age at which the disease was developed. The algorithm utilizes an array f ound to mark whether the subject In line 1, the algorithm initializes the boolean array f ound with false. In line 2, it initializes all counts with zero. Then, it loops over all rows in the data matrix X (line 3), and for each row it checks whether the age of the current visit is within a specified window of 6 months around the given age (line 7). If yes, it marks that the subject has been tested (line 8) and checks the results for the screening test (line 9) using the function IsP os. If the test result is positive (line 10), the algorithm calls the function P ositiveT est in Algrithm 2, which updates the number of true positives or false positive depending on whether the patient has developed the disease. Otherwise, it updates the number of false positives (line 12). If the test result is negative (line 11), the algorithm calls the function N egativeT est which updates either false negatives if the patient developed the disease or true negatives if the patient has not developed the disease.
Finally, after iterating over the entire matrix X, the algorithm iterates over the f ound array (line 13) to find those who have not been tested at the given age (line 14) and calls the function M issingT est in line 15 to compute the number of subjects who missed the screening test and developed (N P ) or did not develop the disease (N N ). After computing T P , T N , F P , F N , N P , and N N counts, the algorithm uses equations (3)(4)(5)(6) to compute CSen, DSpc, P P V , and N P V for the single-age screening at age a. Example II.4. We compute the quality performance of screening for any biomarker (if any biomarker is positive, the result of the test is positive) at age 2 using data provided in Example II.2. The summary statistics of screening at age 2 is given in the following Table II:  The screening at a single age might not perform good as some subjects might miss the screening test and that will reduce the sensitivity and/or specificity of the test. To increase the quality performance of a screening, one can screen twice so that those subjects who missed the first screening can be covered by the second screening. This is discussed in the next section.

C. Two-Age Screening
Problem 2 (Two-age screening). How likely subjects with a positive test at either one of a pair of ages a and b will develop the disease within the observation period?
The screening test can be performed at the first age a. If the result is positive then no need to screen again and the final result is positive. If the screening test is negative or the subject missed the first screening then another screening is required at the second age and the result of the second screening determines the final result. If the subject missed both screening then it will be counted either in N N or N P depending whether the subject developed the disease. The twoage screening can be visualized as in Fig. 2. Algorithm 3 describes the two-age screening process. The algorithm takes a pair of ages a and b to compute the screening results where a < b. For each row in X, it tests whether the current visit are withing the window of 6 months of age a (line 7). If the subject has a visit within that window, the algorithm applies the screening test (line 8) and if the result is positive, it marks that the subject id has a positive test at age a (line 9) and then updates T P and T N in line 10. If the result is negative, it marks that the subject has a negative first screening (line 12).
If the current visit is not within the window of 6 months around a, the algorithms checks for the second screening (note that the matrix X is sorted in ascending order by age). If the visit is within the window of 6 months around the second age b and if the subject has no positive results in the first screening (line 13), then it checks the results of the screening at the second age (line 14). If the screening at the second age is positive, the algorithm marks that the second screening is positive (line 15) and updates the counts T P and F P (line 16). If the second screening is negative, it marks that the second screening is negative (line 18).
After iterating over all rows in X, the algorithm iterates over all subjects who missed the first and the second tests (line  Although the time complexity of Algorithm 3 is O(T.M + N ) ≈ O(T ), but the actual running time is very large, especially if the algorithm needs to be executed multiple times. For example, in almost all cases in medical context, a confidence interval for each measure (sensitivity, specificity, PPV and NPV) is required. To compute the confidence interval [14], the algorithm needs to be run thousands of times on different samples of the matrix X. In addition, to compare different screening schedules, we compute the confidence interval for each schedule and compare them to see how statistically significant the difference between the screening schedules is [15]. Therefore, it is preferred that the algorithm that computes the quality performance of the screening needs to be fast enough so that all these experiments can be run in a reasonable time.
To do that, we perform a data pre-processing that needs to be done only once, and then we will devise Algorithm 3 to make it faster which can be run multiple times and obtain the results much faster than using Algorithm 3.

D. Improved Two-Age Screening
We start with the improved algorithm for the two-age screening which can be easily modified for single-age screening. To improve the computational time of the two-age screening algorithm, we preprocess the data in a different data structure so that the computation becomes faster. The preprocessing step needs to be executed only once for the data and then each application of the two-age screening uses the preprocessed data and returns the results faster than the original algorithm.
For now, let us assume that we have already constructed a matrix B that contains the biomarker information, which will be used by the screening schedule algorithm (the construction of this matrix is explained in Section II-E). B ∈ Z N ×A where A is the number of all possible distinct ages in the data that the screening are to be evaluated at, and N is the number of subjects. The entry B[id][a] ∈ {−1, 0, 1, 2, . . . , 2 M } has the encoding of the biomarkers for the subject id at age a. Since the biomarkers are binary, then the number of all possible cases of biomarkers values is 2 M (note that the number of biomarkers is usually small in these applications as explained in the introduction section). The value −1 indicates that the subject id missed the test at age a.
Note that all ages are rounded given the window of interest. For example, visits at ages 2.4 are considered at age 2 (this is similar to line 6 in Algorithm 1) 2 Given the matrix B, the improved algorithm for two-age screening is re-written in Algorithm 4. The algorithm iterates over all subjects (line 1), and for each subject id it checks if screening at age a and age b are missing (line 2) then it marks that the final result is missing (line 3). If one of the tests is positive (line 4) it marks that the final result is positive (line 5). Otherwise, it marks that the final results is negative (line 7).

Algorithm 4: Improved Two-Age Screening (ITS)
Input: Ages a, b where a < b, biomarkers matrix B, label array y y y Return: CSen, DSpc, P P V , and N P V .

E. Data Preprocessing for ITS
We preprocess the data only once to construct the biomarker encoding matrix B which makes the algorithm runs faster as evident by our experiments. The algorithm for constructing the matrix B is shown in Algorithm 5. The algorithm iterates over all rows of the matrix X (line 2). For each row, it maps the age to the closest age (line 6), encodes the biomarkers (line 7), and stores the value in the matrix B (line 8). To encode the biomarker information into one integer value (line 9), we multiple the biomarker vector into the encoding vector (line 10) to obtain the code value (line 11).

F. Improved Single-Age Screening
The improved algorithm for a single-age screening is shown in Algorithm 6.  8 Compute CSen, DSpc, P P V , N P V using equations (3)(4)(5)(6) III. EXPERIMENTS We evaluated the performance of the SS, TS, ISS and ITS algorithms on datasets with different number of subjects and visits. The description of the datasets is shown in Table III. The experiments were run on a Mac laptop with processor 2.7 GHz Quad-Core Intel Core i7 and 16 GB of memory. The screening test used for these experiments is to test for any positive biomarker, i.e. if any biomarker is positive the result of the screening test is positive. The code is written in Python [16]. Python has a data structure called pandas dataframe [17] which can be used store information in the matrix X. Using the dataframe, the SS and TS algorithm can be even run faster if we filter the dataframe on rows where the age is within the 6 months window of the given age a. This is done using the command In all our experiments for the SS and the TS algorithms we used the above command. We compared the running time for the single-age screening algorithms SS and ISS on different datasets. The experiments were run multiple times and the median and quartiles of the running times are reported as shown in Fig. 3. It is clear that the running time of the SS algorithm increases linearly with the dataset size. It takes about 80 seconds for Algorithm 1 to compute the quality performance of screening at a single age on data that has about 19,000 subjects, while the improved algorithm ISS takes only a fraction of a second to get the results. The running time for the ISS algorithm is shown in Fig. 4.

B. Two Ages Screening
We compared the running time for the TS and ITS algorithms to compute the performance of the two-age screening. The results are shown in Fig. 5. A very similar behavior is observed. The TS algorithm scales linearly with the dataset size. The ITS algorithm is much faster than the TS algorithm. The running time for the ITS algorithm is shown in Fig. 6. ITS takes only 0.1 seconds to compute the quality performance of screening at a given two ages while TS takes 175 seconds.

C. Data Preprocessing for ISS and ITS
The additional overhead that the improved algorithms add on top of the original algorithms is the data preprocessing, i.e. the construction of the biomakers encoding matrix B. This  step is required only once for each dataset. The running time for Algorithm 5 is shown in Table IV.

IV. CONCLUSION
Screening of biomarkers is of atmost importance to assess the risk of developing autoimmune diseases such as type 1 diabetes and celiac diseases. To improve the quality performance of the screening test, screening more than one time is required. Algorithms to compute the quality performance of a screening schedule were developed as part of a screening tool. However, the running time of these algorithms are large which hinders the utility of the tool on large applications. We improved the running time of the screening algorithms by more than 800 times at an additional cost of preprocessing the data only once. We evaluated the running time of these screening algorithms on datasets with different sizes.