DNA Profiling: An Investigation of Six Machine Learning Algorithms for Estimating the Number of Contributors in DNA Mixtures

DNA (Deoxyribonucleic acid) profiling involves analysis of sequences of individual or mixed DNA profiles to identify persons these profiles belong to. DNA profiling is used in important applications such as for paternity tests, in forensic science for person identification on a crime scheme, etc. Finding the number of contributors in a DNA mixture is a major task in DNA profiling with challenges caused due to allele dropout, stutter, blobs, and noise. The existing methods for finding the number of unknowns in a DNA mixture suffer from issues including computational complexity and accuracy of estimating the number of unknowns. Machine learning has received attention recently in this area but with limited success. Many more efforts are needed for improving the robustness and accuracy of these methods. Our research aims to advance the state-of-the-art in this area. Specifically, in this paper, we investigate the performance of six machine learning algorithms -Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Stochastic Gradient Descent (SGD), and Gaussian Naïve-Bayes (GNB) -applied to a publicly available dataset called PROVEDIt, containing mixtures with up to five contributors. We evaluate the algorithmic performance using confusion matrices and four performance metrics namely accuracy, F1-Score, Recall, and Precision. The results show that LR provides the highest Accuracy of 95% for mixtures with five contributors. Keywords—Machine learning; DNA profiling; DNA mixtures;


I. INTRODUCTION
Between different individuals, most of the genome is the same. However, there are some differences, and here comes the science of Deoxyribonucleic acid (DNA) profiling. It is the process that takes benefit from these differences and gives the ability to distinguish between individuals [1]. DNA profiling analyzes DNA sequences that are referred to as genetic markers. The most commonly used genetic marker is Short Tandem Repeats (STRs) [1]. DNA profiling is used in important applications such as for paternity tests, in forensic science for person identification on a crime scheme, etc. [2]. Determining the number of contributors is one of the essential stages in DNA profiling. This task is often not straightforward because of the challenges that could appear, caused due to allele dropout, stutter, blobs, and noise [3], [4].
The current methods for finding the number of unknowns in DNA mixtures can be divided into three types [5]. The first type includes the basic methods which are compute-intensive, are slow, and have accuracy issues (e.g., [6]). The second type includes high-performance computing (HPC) methods, which are faster but highly compute-intensive, and their accuracy requires significant improvements (e.g., [7]). The third type includes machine learning methods that are faster but their classification accuracies and robustness need to be improved, requiring many more efforts in this direction (e.g., [8]).
Recent years have seen rapid and considerable growth in using machine learning in different fields, showing promising results [9]. However, when dealing with inferring the number of contributors in the DNA profile mixture, few researchers have addressed the effect of using machine learning to solve this challenge. To the best of our knowledge, there are three works to date [8], [10], [11], and each one deals with the problem from a different perspective. The research on machine learning based DNA profiling is in its infancy, many more works are needed to improve the diversity and accuracy of the machine learning methods. Our research aims to advance the state-of-the-art in the DNA profiling domain. Specifically, in this paper, we investigate the performance of six machine learning algorithms --Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), Stochastic Gradient Descent (SGD), and Gaussian Naïve-Bayes (GNB) --applied to a publicly available dataset called PROVEDIt. The dataset contains DNA mixtures with up to five contributors.
We have investigated the performance of these algorithms in detail using four performance metrics namely accuracy, F1-Score, Recall, and Precision. The performance of each algorithm has been analyzed using confusion matrices and graphs of the four matrices for each of the five classes, One-Person, Two-Person, Three-Person, Four-Person, and Five-Person.
For KNN, the highest values for the F1-Score, Recall, and Precision metrics were achieved, all for the Five-Persons class, at 68%, 62%, 75%, respectively. For the RF algorithm, the highest values for the F1-Score, Recall, and Precision metrics were achieved for the Five-Persons class at 86%, One-Person class at 88%, and the Five-Persons class at 90%, respectively. The rest of the paper is organized as follows. Section II briefly reviews the related works. Section III describes the methodology of the proposed work. Section IV presents results and their analyses for the six machine learning algorithms. Section V contains the conclusion and future work.

II. RELATED WORK
The methods for estimating the number of contributors in a DNA mixture can be divided into three types. These are basic methods, HPC methods, and machine learning-based methods. The basic methods and tools include, among others, Maximum Allele Count (MAC) [6], Total Allele Count (TAC) [11], MLE [12], DNA Mixtures [13], Lab Retriever [14] and DNA MIX [15]. The parallel or HPC methods include Euroformix [16], LikeLTD [17] and NOCIt [4], [5], [18]. To the best of our knowledge, only three works have used machine learning to determine the number of contributors in a DNA profile. Since machine learning is the focus of our research, these three methods are reviewed below in some detail.
Marciano and Adelmen [8] evaluated five machine algorithms, and finally, they chose the SVM that reached 98% accuracy in the training stage and 97% accuracy in the testing stage for four contributors. Note that the 97% accuracy is on a dataset with up to four contributors compared to five contributors where typically the accuracy will be lower due to a larger number of classes. The data that they have used consists of 1405 profiles from 20 individuals. Benschop et al. [11] examined ten machine learning algorithms, and finally, they chose the RFC model with 19 features. They used 590 profiles that range from a single person to five person mixtures. They removed both Amelgenin and Y-chromosomal markers. There were more than 250 features for each profile, but they chose only the best 50 features. In terms of Accuracy, they got (83%). Kruijver et al. [10] use decision trees in their work. They used 766 profiles from Global filer multiplex with a 25second injection. In terms of Accuracy, they got from (77.9% -85.2%).
The research on machine learning based DNA profiling is in its infancy, many more works are needed to improve the diversity and accuracy of the machine learning methods. Our research aims to advance the state-of-the-art in the DNA profiling domain. Specifically, in this paper, we investigate the performance of six machine learning algorithms.

III. METHODOLOGY AND DESIGN
This section presents the proposed methodology for this work, depicted in Fig. 1. Section A will give a short explanation of the dataset that has been used. Section B will explain the ML models used in this work, and finally, Section C will show the evaluation metrics used.

A. The Dataset
The data in terms of DNA profiles have been taken from the public dataset PROVEDIt [19]. This dataset contains more than 25,000 STR profiles containing DNA mixtures that range from one to five contributors. The dataset contains more than one kit with different cycles number and injection times. Fig. 2 shows the number of profiles that we have taken from this dataset. We took 156 profiles to represent each class among the five classes, and we ended with 780 DNA profile mixtures, which means that we have 18720 samples (780 profiles * 24 markers). When we collected the data, we made sure it contained different injection times and cycle numbers.
We encountered more than one challenge for the preprocessing stage, including dealing with empty cells, OL values and deleting the unwanted markers. All of these challenges were addressed during the pre-processing phase in order to prepare the dataset for the classification stage.

B. Machine Learning Methods
In this paper, we examined six different machine learning algorithms that are introduced below.
K-Nearest Neighbors (KNN) is considered one of the simplest algorithms in classifying tasks. This algorithm aims to find the samples that exist close to each other [8].
Random Forest (RF) is an algorithm that is used in both classification and regression. As the name implies, it is a set of multiple decision trees. The dataset will be divided into a batch of random datasets, then building a decision tree for each of them. Each decision tree will give a diffident decision, and the majority result will be taken [20].  Support Vector Machine (SVM) is a very familiar algorithm when dealing with classification problems. When there is more than one way of drawing the line (boundary) to separate the data points (support vectors), one of the solutions is to measure the distance (margin) between the boundary and the data points. SVM will try to maximize this margin [8].
Stochastic Gradient Descent (SGD) is a suitable choice when having a significant dataset in terms of size and when there is not much computation. For forward pass, it uses a single sample at random and then changes weights [21].
Logistic Regression (LR) calculates the dependent variable based on the independent variable by calculating the errors between the actual data point and the predicted data point by the linear equation, then square the errors, sum them up, and minimize them [8].
Gaussian NB (GNB) comes from the Gaussian distributions that represent the dataset. It is suitable when the dimensionality of the inputs is complex and high. It used the Bayes theorem. It assumes that each feature is independent of other features [22].

C. Evaluation
In this work, we used four different performance metrics. Which are Accuracy that calculated as following , F1-Score that calculated as following , Recall that calculated as following , and Precision that calculated as following . Where TP is True Positive, TN is True Negative, FN is False Negative, and FP is False Positive.

IV. RESULTS AND ANALYSIS
This section presents the performance for the six algorithms. The six algorithms: KNN, RF, SVM, SGD, LR and GNB are analyzed respectively in Section IV.A to Section IV.F. Section IV.G will show a comparison between all the six algorithms. Section IV.H provides a brief descriptive comparison of our work in this paper with the earlier related works.     Fig. 3, we know that TP for Five Persons class is 626 and FP is 206, and the lowest is for both Three Persons and Two Persons classes Precision (45%) because as we know TP for Three Persons is (558) and FP is (679), and for Two Persons class TP is (506), and FP is (617). For F1-Score, the highest score is for Five Persons class (68%), and the lowest is for Two Persons class (47%). For Recall, the highest score is for Five Persons class (62%), and the lowest is for Two Persons class (49%). For Precision, the highest score is for Five Persons class (75%), and the lowest is for both Three Persons and Two Persons classes (45%).  Fig. 6 shows F1-Score, Recall and Precision for RF model. The highest score is for Five Persons class (90%) Precision because referring to Fig. 5, we know that TP for Five Persons class is (840) and FP is (96), and the lowest is for Two Persons class Precision (73%) because we know that TP for Two Persons class is (822) and FP is (302). For F1-Score, the highest score is for Five Persons class (86%), and the lowest is for Two Person class (76%). For Recall, the highest score is for One Person class (88%), and the lowest is for Three Persons class (79%). For Precision, the highest score is for Five Persons class (90%), and the lowest is for Two persons classes (73%).      Fig. 7 we know that TP for Five Persons class is (972), FP is (49), and FN is (42), and the lowest is for both Four Persons class Recall (87%) and Three Persons class Precision (87%) because we know that TP for Four Persons class is (911) and FN is (137), and TP for Three Persons class is (932), and FP is (141). For F1-Score, the highest score is for Five Persons class (96%), and the lowest is for Four Persons, Three Persons and Two Persons classes (89%). For Recall, the highest score is for Five Persons class (96%), and the lowest is for Four Persons class (87%). For Precision, the highest score is for Five Persons class (95%), and the lowest is for Three persons class (87%). Fig. 9 shows the confusion matrix for SGD model.    Fig. 10 shows F1-Score, Recall and Precision for SGD model. The highest score is for both Five Persons and One Person classes (100%) Recall because referring to Fig. 9, we know that TP for Five Persons class is (1009) and FN is (5), and TP for One Person class is (1026), and FN is (zero), and the lowest is for Two Persons class Precision (11%) because we know that TP for Two Persons class is (118) and FP is (913). For F1-Score, the highest score is for Five Persons class (93%), and the lowest is for Two Persons class (20%). For Recall, the highest score is for both Five Persons and Two Persons classes (100%), and the lowest is for Two Persons class (11%). For Precision, the highest score is for Five Persons class (88%), and the lowest is for Three persons class (45%). Fig. 11 shows the confusion matrix for LR model.   Fig. 11, we know that TP for Five Persons class is (989) and FN is (25), and the lowest is for Four Persons class Recall (91%) because we know that TP for Four Persons class is (958) and FN is (90). For F1-Score, the highest score is for Five Persons class (96%), and the lowest is for Four Persons, Three Persons and Twp Persons classes (94%). For Recall, the highest score is for Five Persons class (98%), and the lowest is for Four Persons class (91%). For Precision, the highest score is for One Person class (97%), and the lowest is for Three persons class (93%).       Fig. 13, we know that TP for Five-Persons class is (213) and FP is (0), and the lowest is for One-Person class Recall (7%) because we know that TP for One Person class is (70) and FN is (31). For F1-Score, the highest score is for Three Persons class (71%), and the lowest is for One Person class (12%). For Recall, the highest score is for Three Persons class (83%), and the lowest is for One Person class (7%). For Precision, the highest score is for Five Persons class (100%), and the lowest is for Two persons class (43%). Fig. 15 shows a comparison in terms of Accuracy between the proposed six ML algorithms. The x-axis shows the models names, and the y-axis shows the Accuracy percentage. The results show that LR has the highest score with (95%), then SVM with (91%), then RF with (82%), then SGD with (65%), then KNN with (55%) and finally GNB with (52%).

H. Comparison with Related Works
Among all the earlier works in the literature on the use of machine or deep learning for estimating the number of unknowns, only Benschop et al. [11] and Kruijver et al. [10] estimated the number of unknowns for DNA mixtures with up to five contributors. The best Accuracy performance for Benschop et al. [11] was reported for the RF algorithm at 83%. The best Accuracy performance for Kruijver et al. [10] was reported for the Decision Trees algorithm at 85%. Comparing these results with our work presented in this paper, we have clearly achieved a better performance, i.e., for the LR algorithm at 95% Accuracy.

V. CONCLUSIONS AND FUTURE WORK
DNA profiling is considered one of the most challenging problems in forensic science. In the near future, the forensic science labs will have more profiles that could have many challenges to deal with, which shows the need for such tools that will help the analysts in their work. Within the next coming years, machine learning will become an essential component in many fields.
This study evaluated six machine learning algorithms with four performance metrics. These are F1-Score, Recall, Precision and Accuracy. The results show that the highest score for KNN is with Five Persons class Precision (75%), the highest score for RF is with Five Persons class Precision (90%), the highest score for SVM is with Five Persons class both F1-Score and Recall (96%), the highest score for SGD is with both Five Persons and One Person class Recall (100%), the highest score for LR is with the Five Persons class Recall (98%), and the highest score for GNB is with Five Persons class Precision (100%). The highest score for F1-Score is with the (LR) 97% One Person class. The highest score for Recall is with the (SGD) 100% One Person class and Five Persons class. The highest score for Precision is with the (GNB) 100% Five Persons class. In terms of Accuracy, the highest score is for the LR with (95%). Comparing with all other related works in the literature, we have clearly achieved a better performance, i.e., for the LR algorithm at 95% Accuracy.
This paper provides an investigation into the performance of machine learning methods for DNA profiling. Further evaluation of machine learning methods is needed and it will form our future work. We will use feature engineering methods to improve the performance of these machine learning methods. We will also investigate tuning the performance of the machine learning methods. Moreover, we will use deep learning to improve classification performance. A major theme of our research is smart cities and societies [23]- [25], big data [26]- [28], high performance computing [29], [30], healthcare [31]- [33], information systems [34], [35], system integration [36], [37], and artificial intelligence [38], [39]. Future work on DNA profiling will also look into developing new smart applications for DNA profiling and its integration with other smart city systems.