Machine Learning Model for Prediction and Visualization of HIV Index Testing in Northern Tanzania

Human Immunodeficiency Virus Acquired Immunodeficiency Syndrome (HIV AIDS) in Tanzania is still a threatening disease in society. There have been various strategies to increase the number of people to know their HIV status. Among these strategies, HIV index testing has proven to be the best modality for collecting the number of HIV contacts who might be at risk of contracting HIV from an HIV-positive person. However, the current HIV index testing is manual-based, creating many challenges, including errors, time-consuming, and expensive to operate. Therefore, this paper presents the Machine Learning model results to predict and visualise HIV index testing. The development process followed the Agile Software development methodology. The data was collected from Kilimanjaro, Arusha and Manyara regions in Tanzania. A total of 6346 samples and 11 features were collected. Then, the dataset was divided into training sets of 5075 samples and a testing set of 1270 samples (80/20). The datasets were run into Random Forest (RF), XGBoost, and Artificial Neural Networks (ANN) algorithms. The results of the evaluation, by Mean Absolute Errors (MAE), showed that; RF MAE (1.1261), XGBoost MAE (1.2340), and ANN MAE (1.1268.); whereby the RF appeared to have the best result compared to the other two algorithms. Data visualisation shows that 17.4% of males and 82.6 of females had been notified. In addition, the Kilimanjaro region had more cases of people with HIV status from their partners. Overall, this study improved our understanding of the significance of ML in the prediction and visualisation of HIV index testing. The developed model can assist decision-makers in coming out with a suitable intervention strategy towards ending HIV AIDS in our societies. The study recommends that health centres in other regions use this model to simplify their work. Keywords—Index testing; machine learning; random forest; XGBoost; artificial neural network


I. INTRODUCTION
Index testing refers to a case-finding strategy that aims to get the exposed contacts of HIV Positive individuals for HIVtesting services. It is also known as partner notification [1]. This person is known as an indexing client. Healthcare workers and counsellors ask index clients to list all their partners, including sexual partners and or injecting drugs partners and their children. The process is voluntary and confidential. In the process, each partner and the children are contacted and informed on the exposure to HIV and offered voluntary testing. The purpose of index testing is to break the chain of HIV transmission. In addition, health workers provide HIV testing to the people who have been exposed to HIV [2]. If the result is positive, they are linked to the treatment, and if the status is negative, they are given prevention services.
There are various HIV testing modalities such as Voluntary HIV counselling and Testing (VCT), Community VCT home-based, mobile, and outreach testing. However, home-based and mobile outreach is costly. Therefore, index case testing was introduced to increase the number of people to know their status, and it has been a promising strategy towards the maximisation of HIV case detection [3]. The current HIV index client testing system does not have an automated system, and the data are collected manually. Therefore, it is challenging to analyse the data and predict HIV index testing. In addition, it requires expertise for data entry and data analysts to do the work. Hence, resulting in additional cost and time-consuming in obtaining the intended results. Not only that but also human errors are unavoidable.
Other researchers applied machine learning in health care specific to HIV AIDS solving various problems like HIV case findings, HIV predictors, Patient-specific current CD 4 count, and prediction of new HIV Index using internet data, however, the methods used were statistical descriptive, estimated index and chi-square. Therefore, this paper presents the results of the developed Machine Learning model that can help experts make predictions and produce up-to-date data visualisation that is readable and understandable. In addition, the developed Machine-learning model can predict the number of HIV Index testing using partner notification information to identify people who are at risk to contract HIV AIDS. Hence, help decision-makers to come out with a good intervention strategy towards ending HIV AIDS in our societies.
The paper consists of five parts namely: introduction, literature review, material and methodology, results and discussion, conclusion and recommendation.

A. Overview of Literature Survey
HIV is an infectious disease that threatens public health globally. According to World Health Organization (WHO), 38% million people are living without HIV globally. 19% do not know their status [4]. Many people living with HIV are (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 2, 2022 392 | P a g e www.ijacsa.thesai.org located in middle and low-income countries, with an estimated 68% in sub-Saharan Africa.
WHO Strategy of 2016 to 2021 addresses human rights and equity, with a radical decline in a new HIV infection and reducing death. The global target is to reduce new infection to less than 500,000 by 2020 and end HIV by 2030 as a public threat. The current target is 90 90 90, meaning that 90% know their status, 90% receives quality treatment and care, and the last 90% retain in extended care [5] [6].
In the southern part of Africa, various countries have made substantial progress toward the HIV/AIDS Program target of ensuring that 90% of people living with HIV know their status. HIV testing and counselling is the crucial step towards achieving the Joint United Nations Program on HIV/AIDS (UNAIDS) of 90 90 90. However, the target for 2025 is 95% 95% 95% [7].

B. HIV Trends in Tanzania
The HIV status in Tanzania shows that 1.7million people live with HIV, 77,000 new HIV infections, and 27,000 AIDSrelated death [8]. The new strategic plan reviewed by the Ministry of Health for 2018 to 2022 is making Index testing Services and Partner Notification services one of the National Strategy for Identification of the People Living with HIV (PLHIV) [9]. Index case testing will support Tanzania to maximise HIV case detection in achieving the first target of 90 (2017-2022) and the next of 95 for 2025 for males, adolescents, and children.

C. Machine Learning in Health Care
Machine learning(ML) is the use and development of computer systems that can learn and adapt without following explicit instructions use algorithms to analyse and draw inferences from the pattern in data [10]. Machine learning algorithms depend on domain knowledge of the data to create features that make these algorithms work. ML has been used in various domains with data availability, including computer vision, automatic speech recognition, business analysis, natural language processing, and even health care. However, the process demands lots of time and effort for feature selection, and features must extract relevant information from vast and diverse data to produce the best outcome.
Machine learning techniques accurately provide predictions in various applications, such as drug discovery and disease diagnosis, especially with quality data. Machine learning interest is in cancer diagnosis, diabetes, autism subtyping in health care [11]. Also, ML is used to predict cholera disease [12].
Machine learning in HIV/AIDS had applied as follows: Machine learning to identify HIV predictors for screening [13]. Machine learning in the prediction of patient-specific current CD 4 cell count to determine the progression of human immunodeficiency. [14] Prediction of new HIV infection in China by using internet search [15], predicting default from HIV service in Mozambique [16], Another area is improving HIV case findings [17].
Other related works predict HIV index Testing using different methods are Index and target community testing to optimise HIV case findings among men. The process used descriptive statistics, estimated index cascade, and Chi-Square test. [18] Sustained high HIV case finding through Index testing via services register using Microsoft excel. [19] Another study done was about applying machine learning on HIV/AIDS diagnosis and therapy planning [20].
Therefore, this study aims to use machine-learning techniques to predict HIV index testing and visualisation items of Age, Sex, location, and relationship to strengthen the ability to plan, prioritise, and implement the effective intervention.

A. Materials
The study area selected was the northern part. The Northern party regions include Tanga, Kilimanjaro, Arusha, and Manyara. Kilimanjaro, Arusha, and Manyara had chosen to represent the party. The dataset used in this study was from different health centres and community sites from Arusha, Kilimanjaro, and Manyara. The client information consists of 6346 samples and 11 features, and index-client data consists of 7226 samples with 13 elements.
Python was the programming language used in this study. The reason that led to this programming language took into consideration its ability to offer a variety set of open-source libraries to support machine learning.

B. Methoderations
1) Knowledge discovered from data science: Knowledge Discovered from Data (KDD) is extracting knowledge from various vast quantities of data. In carrying out this study, we selected this approach due to its application in data mining using different algorithms and clearly defined phases. [21]. The study followed an interactive refine at each step (Table I) explains the stages of KDD. The selected dataset was cleaned by ignoring features with no value, and the most occurring feature-filled the missing values; the duplicated value was identified and cleared. The dataset used to make predictions was the client information. Later the two datasets were combined for data visualisation.
Data reduction was made on the following features: client id, date, residence, contact number, and CTC number. The removal was due to the following reasons; the features had no impact on the target (client id, contact no, and CTC number). The features had no values (date and residence). Lastly, the data was transformed into a suitable format for model development. The categorical data were converted into 1 and 0, respectively.

3) Data visualization:
Data Visualization refers to the graphical presentation of the analysed data so a user can get insight from it and make decisions [22] [23]. Data exploration was done using python.

4) Machine learning algorithms:
There are different ways of solving ML problems. ML can be divided into three major parties: Supervised, unsupervised, and reinforcement. Each model may apply algorithms based on the dataset and intended results [24]. Machine learning models are designed to classify things, predict outcomes, find patterns and make informed decisions.
Based on this study, three algorithms were selected for performance comparison to determine the best algorithm for predicting the number of HIV Index testing (based on literature). These algorithms were XGBoost, Random Forest (RF), and Neural network. The study considered all the three ML algorithms to select the best performing. Therefore, these algorithms are explained hereunder. a) XGBoost: XGBoost is an ensemble algorithm based on gradient boosting that has been explained to be an efficient and reliable machine learning technique in solving challenges [25]. It is an open-source library that works best in speed, performance, and parameter setup [26]. XGBoost is used in classification and regression predictive modelling problems. XGBoost denotes the best algorithm for competition on the Kaggle [27].
b) Random Forest: Random forest is an ensemble learning technique that uses a network of decision trees. Breiman proposed it in 2001. It is used for classification and regression [28] [29]. The random forest technique combines various randomised decision trees. It is applied in larger-scale problems. Random sampling enhances the depreciation of the overfitting problem [24]. The randomly generated dataset is used to train the dataset for the ensemble decision tree. Each decision tree will determine output. Fig. 1 below shows how a random forest algorithm is formed. c) Artificial Neural Network: The artificial neural network, usually called a neural network, is defined as an interconnection of nodes called neurons [30]. It works like the human brain works. A collection of neurons created and connected together enables them to send messages to each other. The network is requested to solve a problem, which is performed repeatedly. The more connection is strengthened, the more success is achieved, and the reduced failure. The input variables from the data are passed to this neural as a linear connection of various variables. The value multiplied by each characteristic variable is called weight. Now the linear link is applied to nonlinear combinations to provide the ability of nonlinear relationships for neural network modelling. It is used in both classification and regression problems. The artificial neural network is trained by using a random gradient (SGD) and backpropagation algorithm [31]. Fig. 2 shows the structure of an artificial neural network. Each neuro in the input layer represents a column in the input data. Input data is fed to set of neurons and each produces output. Again, each of output is fed to other neuro, which produces another output, which is again fed to the output layer. Error is calculate at this final output layer and again sent back to network for further refine of the output of each neuro. The process is repetitively until the minimal error is obtained.

5) Experimental procedures:
The development of models involves major tasks: Acquiring datasets, preprocessing, feature engineering, and model selection. Fig. 3 shows the summary of how the experimental procedures were carried out.  395 | P a g e www.ijacsa.thesai.org 6) Evaluation metric: Model evaluation refers to choosing the best-performed model representing data and determining how well the model will work in unseen data. There is a wide variety of evaluation metrics for regression models [32]. According to the literature review, the most used metrics are Mean Squared Error (MSE), RMSE, and MAE. The MSE is calculated as the mean or average squared differences between actual output and predicted target values in a dataset. RMSE is an extension of the mean squared error. MAE score is calculated as the average of the absolute error. In this study, the metric used was MAE due to its simplicity and understandability.

A. Results
The subsection explains the results obtained towards developing the HIV index-testing model. 1) Feature engineering: Experiment result from feature engineering showed that people with no knowledge of HIV has a strong coefficient of (0.5). Followed by Marrital_status married (0.175), Age (0.15), Female gender (0.1), Position (influence of someone in the society (0.1), and the rest has a coefficient of less than (0.1). Fig. 4 provides more visualisation of the extracted features using the random forest algorithm. Table II explains in detail the components selected  for model development. 2) Data visualization: The section depicts the insight of data from different angles of view. Fig. 5 shows the number of HIV index per client-id. Fig. 6 illustrates the number of HIV indexes by status and site. Fig. 7 visualise the HIV index versus HIV status and type of relationship. Fig. 8 and Fig. 9 show the number of HIV Index by each region and distributions in term of Age.

3) Model development and evaluation:
The result obtained from Model development using three algorithms, as shown in Table III indicates that Random Forest performed well compared to the other two by having the smallest value of MAE: The smaller the value, The desired model.

B. Discussion
The process of understanding the domain knowledge was done thoroughly. Various methods were used to solve the problem to a specified domain. The feature that had a high contribution to the target value was identified. People with no knowledge had led a client to have many client notifications by 35% followed by Position of the client in society 15%, marital status 14%, Age 10%, and Sex 8%.
Data visualisation shows that many clients refer to only one person followed by two and three, while few had up to 12 to 20 people. Kilimanjaro region had high returns compared to the two and a good HIV index specific to the Hai site. The sexual partner notification had a high percentage in information followed by biological children.
The best performance algorithm was a random forest. It had the smallest value of Mean Absolute Error (MAE) of 1.1261. The result remained unchanged after improving the model using the best parameters by GridSearchCV. Lastly, the model was saved ready for deployment.

A. Conclusion
Machine learning is an essential skill in current days. Health care is widely used in many ways, such as decision support, developing medical care guidelines, and applying them in detecting diseases. This paper used machine learning to predict the HIV index and visualisation to help decisionmakers develop a suitable intervention strategy to end HIV AIDS as a health threat to society.
However, in achieving the main objective, in addressing the specific goal, the study encountered the following limitations; Missing information in health care data, Lack of enough information in health care such as social-economic and social behaviour information. This information could have an impact on the result. Therefore, the model was developed considering the collected data.

B. Recommendation
Tanzania is one of the sub-Saharan countries with a large rate of people living with HIV. Therefore, client partner notification is vital and can help to yield the target of 95 95 95. However, the study recommends that more researchers and development be required to capture all the required data for better results.
Due to the limitation observed the study recommends that the health care system, especially the unit dealing with HIV/AIDS use the automated system and review the data to be collected for both hospitals and stakeholders to facilitate quality data collection. In addition, HIV knowledge awareness should continuously be given to the community of all ages, and areas.
ACKNOWLEDGMENT I want to extend my special thanks to the Nelson Mandela African Institution of Science and Technology for granting the opportunity to pursue a master's degree, Center of Excellence in ICT in East Africa (CENIT@EA), for supporting studies and Soft Med company internship took place.