Optimality Assessments of Classifiers on Single and Multi-labelled Obstetrics Outcome Classification Problems

It is indisputable that clinicians cannot exactly state the outcome of pregnancies through conventional knowledge and methods even as the surge in human knowledge continues. Hence, several computational techniques have been adapted for precise pregnancy outcome (PO) prediction. Obstetric datasets for PO determination exist as single label learning (SLL), multi-label learning (MLL) and multi-target (MTP) problems. There is however no single classifier recommended to optimally satisfy the needs of all the classification types. This work therefore identifies six widely used PO classifiers and investigates their performances in all three classification categories; to find the best performing classifier. Obstetric dataset exposed to input rank analysis via Principal component Analysis, produced thirteen (13) significant features for the experiment. Accuracy, F1-measure and build/test time were used as evaluation metrics. Decision tree (DT) had an average accuracy and F1 score of 89.23% and 88.23% respectively, with 1.0 average rank. Under MLL configuration, average accuracy (91.71%) and F1 score (94.28%) were highest in the random forest (RF) which had a 1.0 average test time rank. Using MTP, DT had an average accuracy of 88.80% and average F1 score of 71.13%, the multi-layered perceptron (MLP) had the best time cost with an average rank value of 2.0. From the results, RF is most optimal in terms of accuracy and average rank value, while DT is the most efficient in terms of time cost. The comparative analysis of global averages of the six base classifiers shows that RF is the most optimal algorithm with an average accuracy of 87.3% given all three data setups in the study. MLP on the other hand had an unexpectedly high time cost, making it unsuitable for similar data classifications if time is the main criterion. It is recommended that the choice of the classifier should either be RF or DT depending on the application domain and whether or not time cost is a major consideration. Keywords—Pregnancy outcome; random forest; multi-label learning; comparative analytics; machine learning algorithms; single label learning; maternal outcome prediction; decision tree


I. INTRODUCTION
Machine Learning (ML), a fast-rising branch of artificial intelligence (AI), encompasses computer science, engineering, mathematical sciences, cognitive science and many more disciplines [1]. The advancement and wide applications of ML is largely due to the availability of enormous data repositories and the satisfaction and reliability of its performancesaccuracy and computational cost. It equips systems with cognitive capability of understanding the concepts of their environments through the building of models and functions, and the communication of their experiences with patterns. These models and patterns are built and implemented through the process called ML. There are two key classes of MLsupervised ML (SML) and unsupervised ML (UML) [1]. Both UML and SML draw inferences by learning, however UML utilizes datasets with input features only while SML depends on datasets having both input and target attributes for mapping and extraction of relationships between input and output feature spaces. Any dataset with target or desired output variable(s) is referred to a labelled dataset. Unlabeled datasets lack response variables therefore do not support model training activity needed by SML techniques [2][3][4]. In labelled datasets, every record has predefined class label(s) and supports two broad types of data mining applicationsregression and classification [5]. In regression tasks, the target variable(s) is in continuous numeric form whereas classification requires class labels or categorical variables as the target. Classification is the most common and widely applied SML approach. It is aimed at identifying and assigning membership class to a new record, from a set of already defined classes [4,6]. Classification approaches are sub-divided into two groups according to the number of labels; single label and multi-label. The conventional singlelabel classification approach deals absolutely with disjoint classes-each record belongs exclusively to a unique class, whereas in multi-label classification the labels are intertwined and each record is associated with two or more class labels [7]. In single label problems, the categories may comprise of two labels (binary class) or more than two labels (multi-class). For example in medical diagnosis, a laboratory test result might confirm the presence or otherwise of causative organisms in the tested patient's sample while the patient can concurrently suffer from more than two diseases.
In maternal healthcare (MHC), obstetricians are confronted with the tasks ensuring safety of both the mother and baby throughout pregnancy, during delivery, and within a specified period after delivery. This is achieved by providing specialized medical care services while she is expectant, during child delivery and after deliveryantenatal, neonatal and post-natal care services. They are therefore required to obtain clinical factors for the realization of the safety of mother throughout the period during pregnancy and birth, and the newborn in a bid to minimize mortality and morbidity. These involve simultaneous predictions of multiple outcome regarding mother and neonatal status using common baseline www.ijacsa.thesai.org risk factors. Maternal outcome, mother's status during and after delivery, neonatal physiological status, conditions and overall state among others are central in MHC management. Hence, multiple target prediction, multi-label and multi-class predictions are essentially mandatory tasks in the obstetric healthcare domain. However, these maternal decisions are repeatedly made based on doctors' perceptions and experience without utilizing the pieces of vital knowledge concealed in the huge data repositories [8,9]. The author in [10] state that only about 30% of pregnancy outcomes classified by gynecologists and obstetricians concerning pathological fetus or pregnancy turns out to be true. This limitation in current medical practice has led to several complications in deliveries and avoidable deaths from the over 130 million deliveries per year globally. It is therefore expected that a robust computational technique for accurate pregnancy outcome determination will be available to assist medical personnel.
Although solutions from data mining and computational models are laudable and widely accepted methods for medical predictions, none is confirmed as a universal and bestperforming model for prediction of diverse maternal outcomes; individually or in a combined target setup. This paper aims at assessing the performances and suitability on obstetrics dataset, classification algorithms under varying maternal outcome target configurations, given that they comprise binary, multi-class and multi-labeled target features. The remaining sections are structured as follows: Section 2 gives a review of related works on medical diagnoses regarding maternal health care management. In Section 3, the dataset acquisition, preprocessing and description are presented while the methodology of the comparative analytics is described in Section 4. The predictive results along with the evaluations of their performances as well as discussions are described in Section 5 while conclusions and further directions are given in Section 6.

A. Single Label Learning
Classification tasks are broadly categorized into singlelabel learning (SLL) and multi-label learning (MLL) based on the nature of association existing between target labels and input patterns [11,12]. The goal of SLL is to build a model for the prediction of a distinct class label from a set of nonoverlapping labels using input samples. It deals solely with disjoint classes and comprises two types: binary (or filtering in of textual and web-data domian) [13] and multi-class classification [11]. Binary classification has two unique class labels and involves the mapping of input features to only one of the two classes based on an explicit assessment criterion. Examples include disease diagnosis (positive or not), gender discrimination (male or female), email spam detection (spam or not), quality control (pass or fail), maternal status after delivery (alive or death) among others. Some of the famous binary classification datasets are adult dataset (adult.csv) to predict if a person's earnings per annum exceed $50,000 or not, titanic dataset (whose target has passengers who survived or not), diabetes dataset (positive or negative diabetic status), Cleveland heart disease dataset, ionosphere, banknote authentication dataset (authentic or fake). Logistic Regression, k-Nearest Neighbors (KNN), decision trees (DT), support vector machine (SVM), Naive Bayes (NB) and neural networks (NN) are some notable binary classification algorithms. Unlike binary learning problems which have two class labels, multi-class learning is applied to problems involving three or more disjoint class labels. It relies on the assumption that 1) each observation is assigned to only a single label, and 2) each class label is independent of the other [6] For example, a fruit can be one of the following types; apple, mango, orange, pear, a student can graduate with only one class of degree. Iris, zoo, waveform, dermatology, sport, MNIST, ionosphere, glass and wine datasets are some of the examples of widely used multiclass datasets that are available in data repositories and widely reported in the literature. SVM, DT, multinomial logistic regression and multi-layered perceptron are suitable algorithms for multi-class tasks. Widely adopted methodologies for multi-class tasks include; 1) decomposing target label space, via the following methods; one-vs-all, all-vs-all, and error-correcting codes 2) arrangement of the classes in a tree-like structure (hierarchical method) 3) adapting and extending binary classifiers to perform multi-class classification tasks [11,14,15].

B. Multi-label Learning
In real-world scenarios, the same set of input features are often used to concurrently predict more than one target variable. The target feature may consist of binary labels, categorical or continuous values. For binary target features the type of classification is MLL while real-valued target variables are referred to as multi-target regression. However, when the target features are categorical, it becomes a multitarget prediction problem. The MLL problem is a special kind of multi-target learning (MTL) (multi-dimensional or multiobjective), where each label can be associated with more than one values, as opposed to binary labels which have two values depicting relevance(1) or otherwise(0). Recently, MLL has progressively attracted the attention of researchers especially in ML communities and has been extensively applied to solving many problems including image and video analysis, text, bioinformatics, web mining, rule mining, information retrieval, medical diagnosis and prediction and many more [16]. Techniques advanced for MLL classification problems include; algorithm adaptation approach (AAA), problem transformation methods (PTMs) [11,12,17] and ensemble methods [11,18]. The PTMs transform the original MLL problem into multiple SLL (binary or multi-class) or regression tasks while AAAs adapt the base learning algorithms themselves to solve MLL problems rather than transforming them. PTMs adopt the basic SLL classifiers to accomplish the classification task after the transformation stage and thereafter combine the results into an MLL solution. In consideration of the flexibility of the PTMs [12,17], this work performs MLL using classifier chain (CC), bayesian classifier chain (BCC), RAndom k-labEL sets (RAkEL) and Pruned Set (PS) methods and its MTL variant Nearest Set replacement (NSR).
CCs provide a means of combining several binary classifiers into a single multi-label model that is capable of exploiting correlations among targets. It is based on binary www.ijacsa.thesai.org relevance (BR) [12,17,19] approach and beats the weaknesses of BR with an improved performance in addition to the inherited strengths of BR especially low time complexity. The main idea of CC is to incorporate label dependency to BR [7,20]. The BCC [21] uses many classifiers, one per class, linked in a chain to find a joint distribution of the classes C = (C 1 , C 2 , . . . , C d ) given the attributes X = (x 1 , x 2 , . . . , x n ). In BCC settings, a CC can be constructed by firstly inducing the classifiers that do not depend on any other class and then proceed with their descendants, according to the dependence structure which can be represented as a Bayesian network. It is an alternative method for MLL that integrates class dependencies while preserving the computational proficiency of the BR technique [21]. The RAkEL algorithm repetitively constructs a cooperative group of Label Powerset (LP) classifiers. That is, it transforms a multi-label problem into one multi-class classification problem where the possible values for the transformed class attribute is a set of distinct subsets of labels present in the original training data. Each LP classifier is trained by relying on label correlations required for ranking of the labels by averaging the zero-one predictions of each model per considered label. RAkEL offers the following advantages [13]: 1) computationally less expensive due to resulting subsets of SLL tasks; 2) improvements in the class-imbalance ratio of the dataset thereby enhancing the accuracy of minority labels; 3) collation of multiple predictions for the same label by the different LP models. The PS method leverages the most significant label relationship within a multi-label dataset by eliminating insignificant and noisy label sets which might distort the performance of the classification. This reduces the complexity originating from the label dependencies without significant information loss [20,22]. The author in [20] report from experimental evidence that the PS approach outperforms LP and other baseline methods and is highly recommended for data sets with diverse concept drifts. The NSR method is the MTL version of PS where the closest sets replace outliers, rather than using subsets.
Researchers have built and used a variety of multi-labeled datasets in disparate formats and have made them available in notable multi-label data repositories including MULAN [13], Multi-label/Multi-target Extension to Weka (MEKA), Library for SVM (LibSVM) [23], Knowledge Extraction based on Evolutionary Learning (KEEL) (Alcala-Fdez et al, 2011) and R Ultimate Multilabel dataset repository (RUMDR), each one using two base file formats; comma-separated values (.CSV) and attribute-relation file format (.ARFF) file formats. MULAN, scikit-multi learn, MEKA and the Multi-labelled dataset in R (mldr) package provides exploratory analysis of MLL datasets. While MEKA is a general-purpose MLL software, mldr package is limited to exploratory analysis only [24]. This work therefore adopts MEKA for MLL for comparative analytics of obstetric outcome. The degree to which samples in the dataset have more than one label of datasets (multi-labelness) is estimated with two basic parameterslabel cardinality (LC) (1) and Label density (LD) (2) [24]. LC indicates the mean number of labels of the records in the dataset while LD is equivalent to LC divided by the number of labels [14,24].
Where n represents the number of samples in the dataset, Y i the label set of the ith instance, and k the sum of labels in the dataset. The LC level is directly proportional to the number of active labels per sample. Several classifiers have been developed and adapted for binary, multi-class and multilabel classification problems, but there are no classifiers recommended to optimally satisfy the needs of other classification problems. This work investigates the performances of widely used classifiers on all three types of classification with a view of finding the best performing (most suitable) one.

C. Classification Approaches for Medical Diagnostic Problems
Classification is a fundamental and pivotal task of ML and data mining (DM) applications. It is encountered in various areas, such as medicine to identify a disease of a patient, prediction of the effectiveness of surgical procedures, medical tests, and the discovery of relationships among clinical and diagnosis data. The classification of health care data (HCD) for non-faulty diagnosis and appropriate prescriptions is a rising application area of DM that is grabbing the attention of researchers [25,26]. Several works have utilized various classification methods for diseases' diagnosis and prediction. The proper utilization of classification algorithms significantly improves the analysis, disease prediction and severity level determination in addition to ensuring early detection and effective prevention mechanisms. Over the years, analysis of morbidity and mortality data in maternal-related care has evolved from the traditional to intelligent research approaches with the aim of improving the efficiency of mother and child care during pregnancy. Nonetheless, effective analytical approaches that breed intelligent decisions are dependent on the availability of reliable data collected from the healthcare domain for the purpose of extracting knowledge for informed decision-making. This process is supported by classifiers implemented in binary, multi-class or multi-label approaches. However, a universal and multi-label classification with Extreme Learning Machine (ELM) classification approach capable of performing the functions of the three aforementioned classifiers were proposed by [11] and [14], respectively. The survey conducted by [27], provided information about the association rule, classification and cluster analysis as useful tools in the identification and discovery of risk in maternal care. These tools are developed using a few underlying algorithms that have been used for mining maternal-related care, such as DT, NB, KNN, ANN, SVM, RF, Gaussian NB and so on [28][29][30]. ML algorithms comprising Logistic Regression (LR), SVM, DT, BPNN, XGBoost and RF, in building predictive models for early pregnancy loss after In vitro fertilization-embryo (IVF) transfer with fetal heart rate. Each of the models experimented on the features associated with on-going pregnancy and early pregnancy loss samples. RF stood out with a high performance of 97% for recall ratio, F 1 and area under the curve (AUC), in addition to an accuracy of 99% especially for those within 10 www.ijacsa.thesai.org weeks after embryo transfer. In [31] MLL was performed by adapting and extending three SLL algorithms. The comparative analysis was conducted on Genbase, Yeast and Scene datasets which were evaluated in terms of LD and LC. Genbase dataset which had 27 labels, depicts greatest multilabelness with LD of 0.05 and LC of 1.35. Four base ML algorithms (SMO, KNN, C4.5 and NB) were used to develop a predictive model which revealed SMO as the best algorithm. However, inclusion of more well-known datasets would have helped in the comparative analysis.
The author in [28] adopted the Gaussian NB classifierbased methodology with four variables obtained from INEGI. These variables were: gender, gestational age, maternal age and fetuses. The result of the classification recorded 96% accuracy in terms of precision, recall and F1-score respectively. Similarly, the NB classifier was used to compare physician-based classification for 21,000 child and adult deaths in India, South Africa and Bangladesh. This comparative study was carried out on the classifier between two different datasets without performance evaluation of any existing analytical methods. To detect gestational diabetes mellitus (GDM) in pregnant women without a visit to the hospital, a decision support system was developed based on MLP with newly designed input [50]. The identification of predictors of in-hospital maternal mortality among women attending referral hospitals in Mali and Senegal was addressed by [51]. Nonetheless, BR, LP and CC methods with different base classifiers were used for classification [12]. Although the work was limited to the phonemes of the Tamil language only, the procedure for evaluation is useful in the classification of maternal care problems. The author in [32] compared SVM and Logistic Regression (LR) to determine their performance efficiency in pregnancy outcome prediction on anonymized dataset of 420 different pregnancy details. Four output categories were defined, and the results show that the average specificity of SVM in all four categories is at least 1% higher than that for LR, except in the case of underweight infant prediction where LR had a higher specificity. On the other hand, the average sensitivity of LR was at least 10% higher than that of SVM. The study failed to compute the classification accuracies of the designed models, although LR was adjudged as a better model. The author in [49], performed a study on the cardiotocography (CTG) dataset of the University of California Irvine machine learning repository. They compared ten machine learning algorithms; focusing on their predictive precision, recall and F1 scores. Submission of the work is that during training; DT learnt better while NB had the least learning accuracy. Conversely, between the MLP, RF, SVM, and NB algorithms; the RF had the best result with an accuracy of 92%. This is followed by MLP with 84% accuracy, then 83% for the SVM classifier with linear kernel and 77% for NB. Moreover, the work reported in [33] compared the classification ability of NB, RF, DT, and SVM on the CTG dataset using the Minimum Reduction Maximum Relevance technique for feature extraction. Their measurement matric comprised of Accuracy, Precision, Recall and F1 Score. After experiments, they report that SVM had the best classification ratings followed by RF with 96%, 88.3%, 91%, and 89.3% respectively. In addition, the work did not consider the MLP classifier even though it has been widely used with interesting results in the literature for pregnancy outcome (PO) prediction. The work reported in [10] proposed an ensemble of One Dimensional Convolutional Neural Network (1DCNN) and MLP for abnormal birth outcome detection. The study performed traced segmentation on CTU-UHB intrapartum cardiotocography dataset with 552 trace observations for class distribution equalization and 1DCNN for learning and automatic feature extraction from segmented CTG data. Classification results from the proposed model were compared with SVM, RF and MLP models trained with random weight initialization. The model evaluation using sensitivity, specificity and AUC showed that the conventional MLP classifier out-performed SVM and RF in two measures, except that it had the lowest specificity. The RF algorithm on the other hand had a higher specificity (69%) and AUC (67%) scores. SVM had 68%, 56% and 62% in sensitivity, specificity and AUC respectively, at a batch size of 500. Considering the sensitivity (80%), specificity (79%) and AUC (86%), the authors concluded that models evaluated in the study failed to produce better classification results compared to the proposed ensemble 1DCNN.

III. DATA ACQUISITION AND FEATURE SELECTION
Data was acquired from secondary health facilities in Uyo, Nigeria. A total of one thousand six hundred and thirty-two (1,632) records were obtained from archives of retrospective observations of pregnant women recorded while they enrolled for antenatal care, with an input feature space of forty-two (42) features excluding the target variable. A sub-set of the attributes are; maternal age, number of children delivered, previous medical history, abortion, miscarriage, prematurity, previous illness, number of attendances to antenatal care, modal mode of delivery, antenatal registration, and mode of delivery, amongst other features. Cleaning, aggregation and pruning of attributes with only a single domain value was performed. The outcome is a dataset with thirty-five (35) input features, which were exposed to input rank analysis [34,35] via PCA in WEKA software. The selection criterion was based on eigenvalue scores not less than unity [35] regarding PO as target variable. This produced thirteen (13) significant features with a cumulative effect of 67.13%. The distribution of the variance for each factor and rank given in Table I, shows that average maternal blood pressure topped the list with an EV of 3.86 (11.7% percentage of variance), followed by average maternal weight (EV = 2.77, proportion = 8.39%). The 13th ranked attribute, average ascorbic acid level accounted for 3.17% variation with an EV score of 1.05. Target feature description of is also represented in Table I, PO consists of four Death=0) and Neonatal weight (NW) assumes low, normal or overweight as possible values.

A. Predictive Analytic Models
Widely used and most performing algorithms SML algorithms; NB, SVM, DT, KNN, RF and MLP classifiers are compared. The experiment aims to observe which algorithm is capable of classifying PO in all multiple classification learning scenarios.
 KNN is a supervised classification technique aimed at predicting the target variable given a set of features [36]. It is a type of instance-based learning, or lazy learning approach in which the approximation of functions is performed locally. KNN is based on the principle of determining a fixed number of training examples closest in distance (usually Euclidean distance) to an unknown point, and predict the label from these pieces of information. Although KNN is simple, it does not require categories to be linearly separable in addition flexibility, it is computationally costly although very fast in the training phase and arduous to estimate the optimal value of k [5,15].
 NB is a classifier based on the Bayes theorem. Results from different classification and prediction studies suggests its strength and dynamism. The implementation of NB algorithm computes the posterior probability of a hypothesis given an observed data. Given an observation ; NB helps determine the possibility of having d as a component of , using (3): where ( is the likelihood of finding in , ( is the probability of the observation , while ( is the probability of observing the data, irrespective of the specified hypothesis. The NB algorithm can often outperform more sophisticated classification methods and ranks among the topmost successful algorithms for text documents classification. It implicitly assumes that all the attributes are mutually independent which violates real-world scenarios and performs poorly on data comprising highly correlated features. It exhibits greater accuracy and speed when applied to large databases, generalizes well even with limited training samples. www.ijacsa.thesai.org  SVM is a non-parametric supervised learning classifier that finds the trade-off between minimizing the training set error and maximizing the margin for optimal classification. It is known to have the best generalization ability and resistant to overfitting [37]. It is a machine learning approach efficient for solving classification and regression problems. It relies on supervised learning models which are trained by learning algorithms and is very effective when confronted with large amount of training samples to identify patterns from them. It is one of the most powerful ML algorithms for optimization, prediction and classification tasks [38,39]. Its efficiency in the prediction of weather, power output, stock market dynamics, bioinformatics, voice and handwriting recognition, image and video analysis, and medical diagnosis, among others has been demonstrated in the literature. The major strengths of the SVM include: 1) relatively easy training and moderate scaling even with high dimensional data; 2) trade-off between the model complexity and the error are controlled easily; 3) it can handle both continuous and categorical data as well as ability to capture the nonlinear relationships in the data; 4) assumptions regarding data structure are not required because it is a non-parametric technique; 5) provides a good generalization performance with high accuracy. Some of its weaknesses include: 1) comprehensible of results to largely depends on interpretability of the input features; 2) they are computationally costly and need a good kernel function; 3) it lacks transparency in its results because it is a non-parametric method.
 DT is a method for approximating discrete-valued functions, in which the learned function is represented by a decision tree. Mathematically, the i th C4.5 DT classifiers solve the following problem that yields the i th decision function as presented in (4).
 DT adopts hierarchical design to implement the divideand-conquer approach. It is a non-parametric technique used for both classification and regression without functional form specification. It can be directly converted to a set of simple if-then rules to enhance human comprehensibility thereby minimizing the ambiguity of complicated decisions. DTs are effective outliers and missing values detection [5]. Because of overfitting the data, additional pruning tasks (prepruning and post-pruning) are required, in addition to being computationally expensive. Its performance largely depends on the characteristics of the dataset.
 RF consists of a combination of classifiers where each classifier contributes with a single vote for the assignation of the most frequent class to the input vector (x) [40]. RF is an efficient model for averaging multiple deep DT that has been trained on different parts of the same training set when the goal is to reduce variance in the result. Trees constructed with fixed training data are prone to be overly adapted to the training data. The averaging function of the RF algorithm is described in (5).
where N is the total number of trees created in random subspaces, is the classification tree, represent the instance to be classified, and n is a count of the sub trees which ranges from 1 to N.
 MLP consists of multiple layers of simple, bi-state, sigmoid processing nodes of neurons that interact using weighted connections [41]. The MLP classifier is a neural network that utilizes backpropagation in prediction based on threshold functions comprising a linear combination of weight, bias, and input data, as defined by (6). Each perceptron has an activation threshold; below which the perceptron is inactivated. ( where denotes the cumulative vector of weights, X is the vector of cumulated inputs, is the bias and is the nonlinear activation function.

B. Problem Formulation and Dataset Modelling
The dataset on maternal outcome is modeled in three main data-setups: 1) single label/single target 2) multi-label 3) multi-target. The single label/single target setup has two variants; single target binary class (ST-BC) where each observation is only associated with a single binary class label for modeling MS target attribute; and single-target multi-class (ST-MC) representation where each instance is associated with a single target with multiple class labels (PO and NW target attributes). A record may be associated with more than two binary class labels in the multi-label (MLL) data configuration while in multi-target (MTP), every label can assume many valuesnominal attributes. The input vector space consists of input variables representing pregnancy risk factors for PO prediction. The target feature space has target variables, for the multi-target problem. An instance ( , where x= is the input feature vector and is the target vector, together are constituents of X and Y respectively. The input vector space is given in (7) while (8) defines the multi target arrangement.
[ ] The task is to predict variants of both single-label and multi-labelled data setups. This is followed by the assessment of the weighted accuracies and computational costs of all strategies for optimal predictive power decision making in the domain of obstetric management. Table II gives the specifications of the dataset configurations. In all classification learning types, the input feature dimension is while the target vector for each of the SLL setting (MS, PO and NW) are column vectors. In MLL and MTL, the dimensionality of the target vector is 9 and 3 respectively, with 9 labels each. All variants of SLL setups depict LC of unity and LD of 0.5 for MS, 0.25 and 0.33 for PO and NW respectively. However, MLL and MLT have the same LC (3.00) and LD (0.33).

C. Empirical Setup
The empirical evaluation was performed on some varying experimental setups on the obstetrics outcome dataset. The different configurations were based on SLL and multi-labeled classifications types. The single labeled data configuration comprises ST-BC (where the input features are associated with one of the two class labels of the MS target) and ST-MC (where the input features are mapped to one of the more than two class labels of the PO and NW targets, respectively). All base classifiers were implemented under WEKA [42], in the SLL scenario and MEKA based frameworks [43] with the multi-labeled setting, running under Java JDK 1.7 environment. The following base classifiers: SVM, RF, DT, MLP, KNN and NB were used separately as internal classifiers in WEKA (for the ST-BC and ST-MC configurations) and MEKA (for the MLL and MTL datasets) environments. Implementations were carried out with a train/test mode of 10-fold cross validation [9] on each configuration of the dataset and repeated 20 runs with each classifier-algorithm pair on a 64bit machine of 8GB RAM size with windows 10 operating system.
The WEKA/MEKA default parameters were adopted to implement the base classifiers in both SLL and MLL settings with a batch size of 100. MLP used a learning rate of 0.3 and momentum of 0.2 while the maximum training time was 500 seconds for each iteration. There was no distance weighting associated with KNN while Linear search was used with only a single neighbor. A confidence factor of 0.25 was set for C4.5 DT. John Platt's sequential minimal optimization (SMO) algorithm was adopted for training SVM classifier with RBF Kernel function as well as epsilon value fixed at . NB classifier adopted unsupervised discretization without kernel estimator. MLL and MTL setups adopted the following PTMsclassifier chains (CC), random k-label sets (RAkEL) and Bayesian classifier chains (BCC) [14] for optimality evaluations of the six base algorithms. MEKA default parameters were also adopted for the chosen PTMs and base classifiers including a batch prediction size of 100. The BCC employed CC for creating maximum spanning trees based on marginal label dependence, and NB as base classifier [21]. The RAkEL method [31] builds ensembles of Label Powerset (LP) classifiers. The training of LP classifiers relied on label correlations produced through the averaging of zero-one predictions of each model per considered label.

A. Single Labelled Learning Results
The results for the SLL settings are presented in Tables III  and IV. They represent the mean values, standard deviation (stdev) and the rank given in brackets. A rank of 1 being the highest and indicates the highest performance indicator value while a rank of 6 is the least performance rank value.
From Table III, the computed mean accuracy results show that ST-BC has the highest mean accuracy (0.950 ±0.219) and mean F1 scores (0.964±0.006). This implies that the base classifiers used in this experiment performed better in terms of accuracy and F1 score in the ST-BC configuration. In terms of classifiers, DT, RF and SVM depicts the same mean accuracy (0.964) with a slight upward variation in the stdev of DT. The fourth ranked classifier is MLP while NB produced the least mean accuracy (0.896 ±0.023). F1 score produced by DT in the ST-BC (0.982±0.001) is ranked the 1st while MLP yielded the smallest F1 score.
The build and test costs ( All classifiers showed significant improvements in the testing time, DT and MLP are the top performers with average rank of 1.00 and 1.67 respectively while RF execution time was the highest time and earned a rank of 5.33. The rank of the classifiers based on accuracy and F1 score (Fig. 1) show that DT is the best ranked classifier (rank=2) in both accuracy and F1 score while SVM has a rank of 3 in both metrics. Other classifiers have an average rank greater than 3.0 in both metrics except the accuracy of RF with an average rank of 2.33. NB is the least ranked classifier based on accuracy and second lowest based on F1 score. The average rankings based on train and test time (Fig. 2) are unequal in all the classifiers. However, NB, KNN and RF ranked higher in training than testing while DT yielded the best average ranking.

B. Multi-labelled Learning Performance
The distribution of average accuracy and F1 score across the PTMs and classifiers (Table V) show that NB earned the lowest accuracy and F1 score (rank=5.50) while RF produced the best performance in both Accuracy and F1 score.
It is observed that the rank of each classifier across the PTMs is the same in both metrics in addition to a marginal variation in their values. Similar results (Table VI), show that top classifies regarding accuracy earned lower ranks for time cost. Although MLP is ranked 6 with outstandingly high build time values, it competes favourably with other classifiers in the test time. KNN and DT had the best performers in the build and test times respectively while the highest execution time is exhibited by KNN followed by RF.
In the MTP scenario, Tables VII and VIII give the accuracy/F1 score and build/test time values, respectively. The F1 scores are the lowest in all dataset configurations and classification types with CC approach producing the highest average performance. The top performers are KNN, SVM and NB, in that order, and with RF having the highest average F1 score and rank of 1.25. NB earned the least rank in both accuracy (5.25) and F1 score (6.0). DT earns the highest rank (1.5) which is slightly higher than that of DT in terms of accuracy.
For build and test costs, KNN utilizes an insignificant time during model build and returned as the most expensive algorithm during model execution. The reverse is the case with MLP, although the average rank of KNN is better. The ranking of DT is average in both test and build metrics while RF ranks 4.00 and 4.50 in test and build phases, respectively.
A summary of the ranks of classifiers across the datasets and classification types is given in Table IX and Fig. 3. The result shows that the ranks of classifiers in learning types varies especially between SLL and others. RF earned the best rank in MLL followed by MTP and SLL with an overall best rank of 1.78 for accuracy while depicting the worst rank in terms of time cost. DT is the second best ranked classifier regarding accuracy but is ranked the best regarding time cost while SVM is the second top classifier when considering time cost. In terms of optimality, it implies that RF is capable of producing high accuracy across dataset and classification types although is computationally expensive. This corroborates the findings reported in [9].
In term of both metrics, DT is optimal for consideration followed by KNN. It is therefore necessary to choose between RF and DT depending on the application domain and whether or not time cost should be given consideration. A cursory analysis of the result via statistical significant evaluation is presented in subsequent sections.

C. Statistical Significance and Rank Validation
The main goal is to ascertain if there is any base classifiers whose performance is significantly different from others and also perform multiple comparison analysis. This was achieved by implementing non-parametric procedures [44,45] individually to each of the four categories of dataset-target setups for informed statistical inferences. Friedman testa non-parametric variant of the repeated-measures Analysis of Variance, was used to test the null hypothesis that there is no significant difference in the performances (accuracies and time costs) of the classifiers. It compares the average rankings of the six classifiers across each of the four dataset configurations, calculating test statistic which estimates the probability of the observed rankings under the null hypothesis. Nemenyi's test and Bergmann-Homme 's post-hoc procedures implemented in R produced pairwise comparisons of all algorithms. The results are presented in the following subsections.

1) SLL Analysis :
Friedman test on the performances of the classifiers reveals that there was no statistically significant difference in the accuracies ( 2 =10.071, df =5, p=0.0732) and time cost ( 2 =8.8571, df =5, p=0.1149) of the six classifiers at 95% confidence level (CL). This implies that the null hypothesis that there is no statistically significant difference between performances of classifiers in terms of accuracies and time cost for the SLL dataset setups is accepted. Nemenyi test (Fig. 4) compared all classifiers to each other and obtained the critical difference (CD) value of 3.2853 for both accuracies and time. As shown in Fig. 4, none of the distances separating any two classifiers in terms of their accuracy and time is greater than the CD value, this confirms that the performance of every pair of classifiers is not statistically different. In both cases, DT is the best performing classifier while RF has an average rank (AR) of 2.67 and 4.33 on accuracy and time cost respectively. Although, NB has the lowest accuracy value with an average rank of 5.17 it earned an AR of 2.67 for cost, while SVM is the most computationally expensive classifier in the SLL scenario.
2) MLL Analysis: The results of accuracies ( 2 = 36.464, df = 5, ) and time cost ( 2 = 10.929, df = 5, p=0.05281) for MLL target configurations signify the existence of statistically significant difference in accuracies of classifies while the average time used by each classifier does not vary significantly at 95% CL. The CD=2.7924 (Fig. 5) is returned for both accuracy and time cost. The top three performing algorithms regarding accuracy; RF, MLP and DT, do not depict statisticaly significant difference between each other while the bottom performing classifiers kNN, SVM and NB are statistically similar. NB is lowest ranked classifier in terms of classification accuracy and is significantly different from values produced by RF, MLP and DT since their respective difference in length is greater than CD (2.7924).
Although RF is the best performing algorithm as evidence by its accuracy, it is the most time consuming algorithm with an AR of 4.75 while DT consumed the smallest amount of time in all dataset configurations, followed by NB.

3) MTL Analysis :
In MTL setting, the comparison of the differences in the performance accuracy of the classifiers is statistically significant at a CL of 95% while the time costs across classifiers, statistically, does vary significantly. This is as indicated by their respective p-values and chi-squared values regarding accuracy ( 2 = 35.125, df = 5, ) and time ( 2 = 5.9107, df = 5, ). The CD diagram (Fig. 6), depicts the results of Nemenyi test showing the statistical comparison of all classifiers against each other by ARs based on accuracy and time. Classifiers that are not connected by a bold line of length equal to CD have significantly different ARs at 95% CL. In the case of accuracy, the values of NB are significantly different from RF, DT and MLP respectively. RF has the highest AR (1.44) followed by DT (2.12) and MLP (2.69) using accuracy while DT (2.62) and RF(4.25) stand out as the best and worst algorithms respectively when considering computation time.

4) Multiple Comparison of Classifiers on all targets setups:
Results of multiple comparison analysis on the combined accuracies and time costs obtained from the classifier in four dataset settings are discussed in this section. The Friedman test on aggregated values of the adopted metrics produces accuracy values ( 2 = 70.019, df = 5, p= ) and time values ( 2 = 23.123, df = 5, p= ) which depicts a statistically significant difference in performance metrics at significance level. The CD diagram (Fig. 7) obtained from the comparisons for accuracy and time, shows that the accuracy of NB significantly differs from accuracies of other classifiers while the performance of KNN differs significantly from DT and RF. The accuracy of SVM is however equivalent to others except R F and NB. RF is the highest ranked (1.61) and best performing algorithm based on accuracy followed by DT (2.36). MLP earned an AR of 3.0 and returned as the third ranking classifier while the accuracy of NB is the worst. In terms of time cost (Fig. 7b), the worst performing classifier is the RF with an AR of 4.45 and is similar to the accuracies of other classifiers except for NB and DT. DT is best classifier in terms of computational cost closely followed by NB and KNN. This implies that RF yields the highest accuracy across all classification types (dataset configuration) while it is the most computationally expensive algorithm. The obtained p-values from the Freidman test specify that the null hypothesis (that all the algorithms perform the same) is reject. This, therefore, serves as the justification for conducting the post-hoc test. Bergmann-Homme 's test procedure is the most powerful, best performing, and most suitable when the number of algorithms is less than nine (9) [46][47][48], although it is complex and computationally expensive. Statistical pairwise comparison of the six algorithms based on average accuracies and time cost are given in Table X. As shown in Table X, there are four major heterogeneous pairwise groupings of classifiers based on accuracy, with RF and DT being outstanding and individually significantly different from the rest of the classifiers, except MLP while NB depicts a statistically significant difference from all other classifiers. KNN-SVM and RF-DT pairs, each produced a ρvalue > 0.05, therefore statistically equivalent.
The time cost of RF is significantly different from DT (ρ<0.05) and NB (ρ<0.05) while statistically equivalent with MLP and SVM (ρ=1.0). Although the time used by DT is not statistically different from that of KNN (ρ=0.383), it exhibits a significant difference when compared with MLP (ρ=0.0110) and SVM (ρ=0.0110) in addition to RF. Pairwise comparisons involving KNN yielded no statistically significant difference as well as SVM compared with RF and MLP respectively. The summary of the Bergmann-Homme 's corrected average values (accuracy and time) of each algorithm over all the dataset is given in Table XI and Fig. 8. The results confirm that RF (accuracy=87.3%) is the best performing algorithm followed by DT (accuracy=86.3%) based on accuracy metrics while NB is the least expensive algorithm across all dataset and classification types. The ranking of classifiers considering both performance metric reveals DT (rank=2.0) as the best optimal performing classifier followed by RF (rank=3.0) while MLP (rank=4.5) depicts the worst performance.

VI. CONCLUSION
Over the years, analysis of morbidity and mortality data in maternal-related care evolved from traditional to intelligent research approaches with the aim of improving the efficiency of mother and child care during pregnancy. For intelligent automated predictive solutions, ML and statistical approaches have been the most popular techniques in the literature; following the increasing clinical and administrative interest in PO determination. Results from both methods have contributed to the research of PO prediction, preconception counseling, antenatal assessment, intrapartum care, postpartum management, and reproductive health education among others. In this paper, six ML-based classifiers, including SVM, RF, DT, MLP, KNN and NB were identified as widely used and highly successful in obstetric outcome prediction. The performances and suitability of these techniques on obstetrics dataset classification under varying maternal outcome target configurations were assessed, positing that they comprise binary, multi-class and multilabeled target features. Performance efficiency was achieved by empirical evaluation of implemented non-parametric procedures individually for SLL, MLL and MTP to enable informed statistical inferences. Using SLL, three configurations including MS, PO and NW were defined, whereas the MLL and MTP evaluations both used the CC, BCC, RAkEL, PS/NSR PMTs to evaluate performance efficiency. Dataset obtained from archives of secondary healthcare facilities in Uyo, Nigeria, was reduced feature dimension of 13 x 1632. From the results, in the SLL setup, DT had the best accuracy, F1 score and test time with an average rank of 1.0. This was followed by RF in accuracy and SVM in F1 score, while MLP had the second best time cost. NB had the worst accuracy and F1 values, while the worst test time is observed in RF. In MLL, we observed DT was least expensive in terms of time cost; whereas KNN was most expensive. RF performed better with the highest accuracy and F1 scores and was followed by DT and MLP for accuracy and F1 measures, respectively. The accuracy and F1 values obtained for NB suggests that it is the least performing classifier with the MLL setup. With an average rank of 1.50, DT had the highest accuracy in the MTP setup. This was followed by RF, while NB had the worst performance. For F1measure evaluation, RF, DT and NB had the best, second and least performances respectively. The comparative analysis of global averages of the six base classifiers shows that RF is the most optimal algorithm with an accuracy of 87.3% given all three data setups in the study. The pole position of RF in terms of accuracy measure is in agreement with the submission in [49] (Hoodbhoy et al., 2019) that compared ten machine learning algorithms on PO determination and observed RF had an accuracy of 92% compared to lower scores obtained by MLP, SVM and NB. It also corresponds with the result obtained in [33] where the accuracy of RF was best with a score of 96%, and the work of [9]. In terms of time cost, NB is the least expensive algorithm even though it has the poorest global accuracy score. MLP on the other hand had an unexpectedly high time cost, making it unsuitable for similar data classification if time is the main criterion. Finally, from the comparative analysis, it is recommended that the choice of classifier should either be RF or DT depending on the application domain and whether or not time cost is a major consideration. As further research, the tuning of parameters of the base classifiers using evolutionary computing would be carried out in order to improve performance in terms of accuracy and computational cost.