White Blood Cells Detection using YOLOv3 with CNN Feature Extraction Models

There are several types of blood cancer. One of them is Leukaemia. This is due to leukocyte or white blood cell (WBCs) production problem in the bone marrow. Detection at earlier stage is important so that the patient is able to get a proper treatment. The conventional detection and blood count method is less efficient and it is done manually by pathologist. Thus, there will be a long line to wait for the results and also delay the treatment. A faster detection procedure and technique will have high impact on the real time diagnostic. Fortunately, these problems are able to overcome by making the blood test procedures automatic. One of the effort is the development of deep learning for WBCs detection and classification. In computer aided WBCs detection, the You Only Look Once (YOLO) based platform present a promising outcome. However, the investigation of optimal YOLO structure remains vague. This paper investigate the effect of the deep learning based WBCs detection using You Only Look Once version 3 (YOLOv3) with different pretrained Convolutional Neural Network (CNN) model. The models that been tested are the Alexnet, Visual Geometry Group 16 (VGG16), Darknet19 and the existing YOLOv3 feature extraction model, the Darknet53. The architecture consist of the bounding box for class prediction, feature extraction, and additional convolutional layers. It was trained with 242 WBCs images from Local Initiatives Support Corporation (LISC) dataset. The final outcome shows that the YOLOv3 architecture with Alexnet as its feature extractor produced the highest mean average precision of 98% and have better performance than the other models. Keywords—Alexnet; darknet19; darknet53; detection; VGG16; white blood cells; YOLO


I. INTRODUCTION
The human body immunization system depends on the white blood cells condition. Normal WBCs consist of Basophil, Eosinophil, Lymphocyte, Monocyte and Neutrophil. Immune system will be affected if there are any abnormality detected in the WBCs. Blood cancer such as leukemia is most common WBCs abnormality. Patient with this type of cancerous cell will have problem to fight virus and bacteria in the body system and weaken the immune system [1] [2]. It also affect the production of red blood cells and platelets in the bone marrow. There are several type of leukemia. An Acute Lymphoblastic Leukemia (ALL) is the one of it, which have abnormal lymphocyte which called leukocytes. This condition is crucial for early detection in order to establish a proper treatment for the patient. A typical diagnose method is the WBCs blood count which provide the data for the immune system and any blood related disease [3]. The conventional method for leukemia diagnose are by bone marrow biopsy, lymp node biopsy, flow cytometry, lumbar puncture, lab test and also image tests, which is very challenging. However, the current automated system is depending on the application image processing, segmentation, feature extraction and finally the classification steps. Unfortunately, this method have an optimization issue [4]. A simple automated blood count method have been developed by using Convolutional Neural Network (CNN) architecture. The microscopic images are fed directly into the architecture for the classification and produce an output result [5]. On top of that, the neural network (NN) method have been evolve over the years and become the basic of a faster detection method.
The YOLOv3 detection method which been implemented in this project utilized the fundamental of neural network. This detection method practice a deep learning method for localization step which by using bounding box prediction, instead of the common sliding window search method [6]. The training process uses the sum of squared error loss and the logistic regression analysis for the objects' score prediction. The score will become 1 if the bounding box is overlap with the ground truth prior than another bounding box. Next, for the prediction of the classes, the independent logistic classifier and the cross-entropy loss are implemented. Then, as for the step prediction, three scale sizes are used. Meanwhile, for the feature extraction, Darknet-53 is used and followed by a few layers of convolutional layers. Lastly it produced a labeled output image.
This project investigated the implementation of different pretrained models (Alexnet, VGG16, Darknet-19) as the YOLOv3 feature extractor. LISC dataset images were used during the training and testing of the system. The finding of this research is the effect of different feature extractor on the detection rate and the detection average precision. The Alexnet as the feature extractor showed the highest detection mean average precision. Thus, this model can improve the existing blood smear detection method.

II. YOU ONLY LOOK ONCE (YOLO)
Initially, the You Only Look Once (YOLO) was first introduced by J. Redmond et al. [7]. The YOLO detection system is illustrated in Fig. 1. In essence, the system will first resize the image, then undergo the convolutional network, and lastly the non-max suppression layer. The outcome is the detected labeled image. In the paper, the object detection is shown as a regression problem as the spatially separated the www.ijacsa.thesai.org bounding boxes and the associated class probabilities. From a full image, a single neural network is able to predict both the bounding boxes and the class probabilities with one evaluation. The speed of the model is at 45 frames per second. On the other hand, the Fast YOLO (smaller version) has higher process speed which is at 155 frames per second and had attained higher mean average precision in comparison with the other real time detectors.
The convolutional layers network of the YOLO detection is shown in the Fig. 2. It consist of 24 convolutional layers, two fully connected layers and interchanging 1x1 convolutional layers for feature space reduction from the preceding layers.
The convolutional layers of the YOLO detection network are illustrated in Fig. 2. It consist of 24 convolutional layers, two fully connected layers with 1x1 convolutional layers alternately. This is to reduce the feature space from preceding layers. The overall YOLO model is trained together with the loss function that directly resembled the performance of the detection.
The authors presented another paper in the same year with YOLOv2 which is the upgraded from the previous YOLO model [8]. It is also named YOLO9000 due to its ability to detect 9000 or more objects' categories in real-time. The Darknet-19 was implemented in the architecture as the classifier. It consist of 19 convolutional layers and five maximum pooling layers. The model is able to process different image sizes while performing a balance of speed and accuracy. After that, the author updated the YOLOv2 to YOLOv3 [9]. In this new version, the Darknet19 had been upgrade to Darknet53 as the feature extractor. Its improvement include the extractor shortcut connection, feature map upsampling and concatenation.  There are many types of pretrained model that been implemented in any machine learning architecture. Models like Alexnet, VGG16, and Darknet that had been trained with Imagenet and the weight is kept and ready to use for future implementation. Most of these models are CNN based.
Alexnet consist of five convolutional layers (only layer 1, 2 and 5 are followed by the max pooling) and three full connected layers [10]. All the inner layers applied Rectified Linear Units (ReLu) function as it activation function and softmax activation function in the final output layer. The application of the ReLu activation function speed up the training time up to six times on the Canadian Institute for Advanced Research 10 (CIFAR-10) dataset, instead of using the tanh activation function. Other than that, Alexnet allowed a multi graphics processing unit (GPU) training. Thus, it able to train bigger model and also shorten the training time. The conventional CNN pooling process will pool output from its neighbor groups of neurons without overlapping. Nevertheless, overlapping pooling was introduced in Alexnet which shows 0.5% loss reduction and the model less likely to overfit. The summary of the Alexnet architecture is tabulated as in Table I and the architecture diagram is as in Fig. 3.
The VGG16 model has more layers compare to the Alexnet structure. VGG16 architecture is illustrated in the Fig. 4. It was introduced in 2014 and to be an improvement of the Alexnet [11]. The large kernel-sized filters in Alexnet is replace with a multiple 3x3 kernel-sized filters. The model had achieved 92.7% test accuracy in ImageNet. The structure has 16 layers depth and the VGG16 model is summarized in Table II. The input of the first convolutional layers is 224x224 RGB image and passed through the layers of convolutional layers then it applied maximum maximum pooling layer together with the 3x3 sized kernel filter. This produced a smaller image with dimension of 112x112x64. Then, it followed by two more convolutional layer, 3x3 sized 128 feature maps. Next, the maximum pooling with same size. Convolutional layers with 3x3 sized filter with 256 feature maps followed by maximum pooling are in the fifth and sixth layers. In the seventh to twelfth layer, there are two groups of three 3x3 sized 512 filters convolutional layers and maximum pooling layer. The last reduced size output is 7x7x512. Then, the output from the convolutional layer is flatten through the fully connected layers. Lastly, the layer of softmax function. All the hidden layers consist of the ReLu activation function.
Another CNN base model is the Darknet-19. It is initially used in YOLOv2 [8]. The model used filters and after each pooling step, there will be a couple of channel. The global average pooling is used for making the prediction and the filters for feature representation compression between convolutions. In order to stabilize the training, increase the convergence timing, and to make the model batch regulated, the paper used Batch Normalization technique. The Darknet19 consist of 19 convolutional layers and five maximum pooling layers. The summary of the model is tabulated in Table III.

V. RELATED WORKS
The detection and classification method of the WBCs has been studied widely in medical and also engineering field. One of the study presented a Regional Convolutional Neural Network (RCNN) which was trained by transfer learning using Alexnet, VGG16, Googlenet and Resnet50 [12]. The Resnet50 transfer learning shows the highest performance.
The author also have tested another CNN based architecture that tested on different pretrained models as its feature extractor with the Extreme Learning Machine (ELM) classifier [13]. The output from these feature extractors were then combined and the minimum redundancy maximum relevance method was used to select the efficient features. Finally, the ELM was enabled. The results shows accuracy rate of 96.03%. There are three studies that had been done. The Alexnet-ELM method which used ELM as classifier for the features at the fully connected layers of each CNN models. It obtained an accuracy rate of 95.29%. Then, the performance of classifier were tested and the Resnet model achieved 95.2% accuracy rate. Lastly, the paper studied the CNN -Minimum Redundancy Maximum Relevance (MRMR) -ELM method on the WBCs data. MRMR feature selection algorithm was used for features combination at the last layers of the tested models. This method achieved the accuracy rate of 96.03%.
A project also had demonstrated that Alexnet has the best performance as feature extraction for WBCs type classification in comparison with Lenet, and VGG16 architectures [14]. The Discrete Transform (DT), quadratic discriminant analysis (QDA), linear discriminant analysis (LDA), Support vector machine (SVM), k-nearest neighbors (kNN) with Alexnet also been compared to a softmax classifiers and the highest accuracy is the combination of QDA-Alexnet which is 97.78%.
Additional application other than WBCs detection, Alexnet also been implemented in the detection and classification of the red blood cell (RBCs) [15]. The designed framework was able to classify 15 types of RBCs. The results obtained were: 95.92% accuracy, 77% sensitivity, 98.82% specificity, and 90% precision.
A different comparison had been made between the VGG16 and Resnet50 for WBCs classification [16] and the Resnet50 achieved 88.29% of accuracy. One of the project that utilized VGG16 is by M. Shahzad [17]. The framework starts with feeding the original images and ground truth images to the preprocessing stage. This include labeling of the pixel-level and conversion of RGB-Grayscale. The VGG16 later fed into the system as a feature extractor. Then the training process begin. The system accuracies are 97.45% for RBCs, 93.34% for WBCs, and 85.11% for platelets. Meanwhile, another paper also had compared the utilization of CNN models and Alexnet had perform better than GoogleNet and Resnt-101 [18]. A paper had implement image processing algorithm based for preprocessing and together with VGG16 as its classifier [19]. The experiment achieved 95.89% accuracy. In addition, a paper had use the concept of capsule for the classification model of WBCs [20]. The developed model proved that it had higher precision value than using the Resnet and VGG model. A paper by K. Almezhghwi also apply the VGG16, Resnet and Densenet in the project. The paper studied the generative adversarial networks (GAN) and image transformation operation for data augmentaion together with the deep neural networks for feature extraction [21]. The outcome of the experiments is that the highest accuracy was achieved by using the Densenet-169 as the feature extractor which is 98.8%.

VI. METHODOLOGY
The project starts with preparing the hardware, software and datasets. The hardware used for this experiment is the Intel® Core™ i5-5200U CPU @ 2.20GHz processor. Whereas the Spyder by Anaconda software is used for all the programming activity include training, detecting and result analysis.

A. Datasets
This project worked with the dataset from LISC which is a public dataset. There are five image categories which are Basophil, Eosinophil, Lymphocytes, Monocytes and the www.ijacsa.thesai.org Neutrophil. Images are parted into training and testing as in Table IV. The number of training images is as 53, 39, 52, 48, and 50, respectively. Whilst the number of testing images is five for each category and eight images containing multiple WBCs type. The preparation of the datasets starts with annotating the images. The labelling process is done in the Microsoft's Visual Object Tagging Tool (VoTT) software. After completing the preparation of datasets, the experiment continue with the training and detecting process.

B. YOLOv3
The main structure of this project is the YOLOv3 image detection. It starts with the forming of bounding box. Next, followed by the class prediction, prediction across scales, feature extraction, and the convolutional layer. Based on the YOLOv3 architecture in Fig. 5, it used a multi-scale method for the multiple target detection. The finer grid cell enable for smaller target detection. In every grid, three bounding boxes are predicted. Then, it will predict the three categories and five basic parameters which are x, y, w, h, and c.
Then, the multilabel classification will predict the classes in each bounding box. The classes will be predicted by using the binary cross-entropy loss during the training. Three sizes scales will then go through feature extraction. In addition, the Darknet-53 model will supported by additional convolutional layers. The feature map from the previous two layers were upsampled by double and concatenation the feature map from the prior layer in the network which produced a semantic and finer grained data. Fig. 6 shows the flow process of the YOLOv3. The dataset was split into training set and testing set. The training set was then undergo a bounding box step, class prediction and the prediction across scales. Later, the features were extracted by the Darknet53 model before it pass through convolutional layer and produced output. The steps were repeated by replacing the extractor with different CNN models. This project tested on the Alexnet, VGG16 and Darknet19 model which shows in Fig. 7.

C. Evaluation of the Classifier Test Data
The performance of the architectures was determined based on its accuracy, precision and sensitivity ratios as in equation (1), (2), and (3), respectively.
(1)  Then, the mean average precision (mAP) value is obtained by using (4).

VII. RESULTS AND DISCUSSION
The

VIII. CONCLUSION
The YOLOv3 architecture originally implement the Darknet53 as its feature extractor. This paper investigated the effect of using different feature extractor (Alexnet, VGG16, and Darknet19). The outcome of the experiment shows that the YOLOv3 with Alexnet as feature extractor obtained the highest mean average precision which was 98%. It also have the lowest losses and shortest testing time. As a result, it demonstrated to have higher performance compared to other models. On the other hand, more improvement is able to apply as a continuous project, such as to train on greater number of dataset, testing on several different source of datasets and also to use GPU to increase the training speed.