Lung Cancer Detection using Segmented 3D Tensors and Support Vector Machines

—Tumor is currently the second most prevalent cause of mortality, and its prevalence is expanding rapidly. The development of pulmonary nodules inside the lungs is suggestive of the existence of lung cancer. The detection of cancer is achieved using nodules detected in computer tomography (CT) images obtained from the LUNA 16 dataset. This study uses the Python library "PyTorch" for this purpose. A three-dimensional model has been used to train and extract the nodular segments from CT-Scan images, referred to as CT-scan chunks. It is done due to the impracticality of handling the whole CT scan image due to its vast size. The previously mentioned chunks are then transformed into PyTorch tensors. The tensors are subsequently input into a deep learning model to extract features, which are then passed through a sequence of machine learning classifiers for the purpose of classification. These classifiers include Support Vector Machines, Multi-layer Perceptron, Random Forest Classifier, Logistic Regression, K Nearest Neighbor, and Linear Discriminant Analysis. Our research has shown that the use of chunk extraction from CT-Scan images, coupled with the creation of tensors using segmented CT scans, has significantly enhanced the precision of various machine learning algorithms. Additionally, this approach has the advantage of reducing the computational time during runtime. In our study, the use of Support Vector Machines yielded the best degree of accuracy, reaching 99.68%. The findings of this study have the potential to be valuable in the practical implementation of real-time lung nodule identification applications.


I. INTRODUCTION
Cancer is a fetal illness that kills thousands and millions of people worldwide.Lung cancer is responsible for around 350 daily fatalities, making it the second leading cause of cancerrelated deaths.Roundabout 103,00 of 127 lakh lung cancer fatalities in 2023 (81%) are directly attributable to smoking.The remaining approximately 20,500 lung cancer fatalities unrelated to smoking would be classified as the eighth leading cause of cancer mortality [1].The process of diagnosing pulmonary nodules is valuable for the identification and treatment of cancer in its initial stages, therefore significantly contributing to the preservation of many human lives [2].Doctors have significant challenges when dealing with a substantial volume of CT-scan data.Computer-aided detection (CAD) systems have been shown to alleviate the burden of radiologists while enhancing diagnosis accuracy in the context of lung cancer [3].The CDAM technique was used in order to visually represent the characteristics and highlight the decision-making process inside Zhang's unique CAD (Computer-Aided diagnosis) system for lung cancer diagnosis [4].Recent research on lung nodule identification demonstrates the exceptional performance of Deep Learning (DL)-based CAD systems [5].
The use of DL techniques has shown exceptional efficacy in the domain of lung cancer diagnosis, namely in the identification of pulmonary nodules.A study used the Faster R-CNN and U-Net-like encoder-decoder models to analyze the LUNA16 dataset, resulting in an accuracy rate of 0.842 [6].The 3D CNN model is employed by Duo et al. [7] for Lung cancer detection.The researchers partitioned their approach into two distinct components, namely candidate screening and false positive deduction.The researchers attained a sensitivity rate of 90.06% and a FROC score of 0.839.Khosravan et al. obtained a FROC score of 0.897.The researchers put out a computational procedure known as S4ND with the purpose of facilitating end-to-end learning.The architecture is comprised of densely connected CNN units [8].Wang et al. used three slices to make a 3D-RGB image for nodule detection and achieved an accuracy of 0.968 [9].
There was a need to improve the nodule detection accuracy, so it can be effectively used in real time scenarios.Moreover, the use of three-dimensional convolutional neural network (3D-CNN) architectures has more potential in efficiently capturing the unique attributes of malignant nodules in contrast to two-dimensional CNN (2D CNN) designs.Thus far, only a limited number of publicly available research publications have been published on the use of 3D CNNs for the identification of lung cancer.This research study used 3-dimensional CT-Scan results, 3D CNN features and machine learning classifiers, using specialized Python libraries to leverage the capabilities of Python and the effectiveness of machine learning methods.Various state-ofthe-art ML techniques have been proposed for the purpose of identifying lung nodules in the context of lung cancer detection in literature.However, there is ongoing debate over the efficacy of current methods for the timely identification of lung cancer.Therefore, we advocate for the adoption of a comprehensive methodology in the detection of lung cancer.The research makes many noteworthy contributions, which are outlined below:  Nodules have been identified in the photos obtained from the publicly accessible dataset LUNA16.
 Images were preprocessed by isolating the lung region, creating slices, and then segmenting the CT-Scan www.ijacsa.thesai.orgslices.Additionally, the nodule segment from the segmented slices was transformed into CT-Scan chunks and a 3D tensor was created.
 In our study, several machine learning classifiers for the purpose of classification are used after extracting features using 3D deep learning algorithm.These ML classifiers include Support Vector Machine, Random Forest, Machine Learning Perceptron, Logistic Regression, Linear discriminant analysis, and K Nearest Neighbors.
After this introductory section, a comprehensive assessment of the existing literature pertaining to the detection of lung cancer is presented in Section II.Subsequently, our proposed technique is outlined in Section III.The findings of the proposed methodology are provided and compared with the outcomes of state-of-the-art algorithms in Section IV along with detailed analysis and examination of the methodology used in the study, as well as a comprehensive presentation of the obtained findings.The concluding section encapsulates the research effort and its novel contributions.

II. BACKGROUND STUDY
According to the data shown in Fig. 1, lung cancer has emerged as the second most prominent cause of mortality.Lung cancer can be diagnosed efficiently by detecting nodules in the CT scan images as shown in Fig. 2. Currently, pulmonary nodules are mostly diagnosed using various forms of chest imaging examination, such as MRI, X-ray, CT-Scan, and others.Because of its great accuracy and cheap cost, CT scan has become the gold standard in diagnostics [10].Due to its high level of sensitivity, CT scans have become the preferred method in screening for lung cancer.Classifying pulmonary nodules, however, remains a laborious task.Furthermore, the size of CT images is continually expanding, adding more labor to the normal diagnosis procedure for radiologists [11].
Manual feature engineering is often employed in conjunction with machine learning for medical picture categorization.For both binary and multi-class classification, support vector machines (SVM) are a well-known machine learning approach.Nodule detection was performed using an adaptive morphology-based strategy and SVM [12].Manifold classification regularization was applied by Ren et al. for nodule identification along with the Dense network of neurons (DNN) [13].Image categorization and object recognition both benefited greatly from DNN's automatic feature extraction.In a research study, a convolutional neural network (CNN) was used for the purpose of detecting and classifying lung nodules.[14].Classifying nodules in CT scan pictures is another application for CNN-based Generative Adversarial Networks [15].A pulmonary nodule classification system was designed by Jiang et al.The authors used improved deep features by using contextual attention and located region of interest (ROI) using spatial attention.For enhancing the performance of the classifier an ensemble was employed [16].CNN was used along with kernel-based non-Gaussian for cancer detection.The authors applied an adaptive histogram equalization technique in the preprocessing step for segmenting the region of interest.They have achieved an accuracy of 87% [17].In a study, the deep learning model employed for nodule identification is a receptive field regularized V-net [18].The classification of lung nodules is accomplished by the use of a hybrid approach including the SqueezeNet and ResNet models.In addition to convolutional neural networks (CNN), authors used a bio-inspired approach called the "Whale optimization algorithm with adaptive particle swarm optimization" (WOA-APSO) for detecting lung cancer [19].They achieved an accuracy of 97.18%.Histogram equalization, the Tophat approach for noise removal, and the Boosted deep convolutional neural network (BDCNN) were utilized by Rani et al. [20] while classifying lung cancer.The noise was filtered out using weighted mean histogram analysis, and features were extracted using a hybrid dual-tree complex.Ultimately, a deep CNN was used for the purpose of lung cancer detection [21].
A sequential technique for identifying lung cancer in CT lung images was devised by Guo et al. [22].The used approach included the utilization of both a CNN-based classifier and a feature-based classifier.The first use of a CNN-based classifier was seen in the analysis of a dataset pertaining to the detection of lung cancer.Subsequently, a classifier based on features was employed for further investigation.[22].Irregular nodule shapes were segmented by Hesamian et al. [23].To tackle the issue of diminished contrast, scholars have conducted investigations including the use of deep-learning methodologies on artificially created images derived from innovative color schemes.In this work, a modified version of the DL-based U-Net model was used.Chen et al. used the dense neural network methodology [24].
The usage of batch normalization and dropout was employed in conjunction with this dense architecture.The www.ijacsa.thesai.orgperformance on the LUNA16 dataset has been commendable.Basal et al. and Xu et al. used ensemble learning to identify lung cancer [25], [26] and combined YOLOv3 with CNN for malignant nodule detection.Fractal networks are employed in research to categorize lung nodules [27].The performance of the system was evaluated and verified using the LUNA16 dataset, using the Fractal net model.The resulting accuracy was 94.7%.With the use of ensemble learning, Muzammil et al. improved accuracy to 96.89%.In their work [28], they integrated the SVM and AdaBoostM2 algorithms with the Linear discriminant analysis (LDA).Another research [29] used deep learning to build a segmentation technique for identifying lung cancer.Numerous more cutting-edge research [30]- [33] covered a wide range of topics and highlighted additional real-world issues.The main objective of this research is to investigate lung cancer screening.The CT scans were segmented into chunks and then subjected to analysis using a three-dimensional convolutional neural network (3D-CNN) model.Deep learning-based imaging algorithms have significantly enhanced the accuracy and efficacy of lung nodule segmentation, identification, and classification.These algorithms use old medical images to achieve superior performance in these tasks.The system's ability to engage in reinforcement learning facilitated the attainment of this outcome.An assortment of unsupervised deep learning algorithms, such as CNN, Faster R-CNN, Mask R-CNN, and U-Net have been used to create convolutional networks that aim to identify lung cancer and minimize the occurrence of false positives.This is in contradiction to the conventional approach of using supervised and reinforced learning techniques.Previous researches have shown that computed tomography (CT) is the most often used imaging modality in computeraided detection (CAD) systems for the detection of lung cancer.The use of 3D-CNN architectures shows more promise in effectively capturing the distinctive characteristics of malignant nodules compared to 2D CNN designs.So far, only a restricted number of research articles using 3D convolutional neural networks (CNNs) for the purpose of lung cancer detection have been made accessible to the public.

III. METHOD
In the subsequent part, we will elucidate the strategy that has been put forward.In Fig. 8 a block diagram is provided, illustrating the suggested technique.This demonstrates the detection of nodules in the images, followed by the segmentation of the CT-Scan image into distinct sections, as seen in Fig. 7.The segmentation and classification tasks were effectively accomplished by using deep learning techniques.Within the scope of this research activity, the process involves the concatenation of a 3D tensor, which is afterwards used in conjunction with machine learning algorithms, with the aim of facilitating the identification of lung cancer.www.ijacsa.thesai.org

A. Dataset
The data used in this research was obtained from Kaggle, a reputable and well-known data repository.The Lung Nodule Analysis 2016 Challenge dataset, often known as Luna16, has been released to the public.It was updated in 2020.The dataset comprises CT scans from a total of 888 patients.Every computed tomography (CT) scan has a thickness that exceeds 2.5 millimeters.The data is saved in the MetaImage format, more especially in the mhd/raw file format.Every .mhdfile is accompanied by a separate binary file that contains the raw pixel data.The dataset has a total of 754,975 candidates, whose information is categorized into two distinct groups: malignant and benign.Malignant nodules are classified as "1" whereas benign nodules are classified as "0".Fig. 3 depicts the distribution of lesion sites among the patients.

B. Preprocess
During the preprocessing phase, the following steps were conducted: 1) Extract CT-Scan images: The CT-Scan files were retrieved from the dataset using the "SimpleITK" package.
2) Unique Ct scans: We found that there are total 888 unique CT-Scans for 754975 candidates.
3) Missing values: The missing values in the data were examined.A total of 443 missing data were identified, indicating the absence of unique CT-Scans.Consequently, the aforementioned values were eliminated.

4) Histogram:
The pixel values undergo a transformation to Hounsfield units (HU), and afterwards, their histograms are shown, as seen in Fig. 4. The histogram has a range of -2000 to 1000 Hounsfield Units (HU).The histograms of all the CTscan pictures exhibit a similar distribution.

5) Pulmunary region:
The pulmonary area is marked out by using several threshold values, as seen in Fig. 5.

C. Process
The data is divided into training and validation.The input images are transformed in the data preprocessing step, and noise and outliers are filtered away.Data is analyzed and it is found that the candidate file has a flag for each mass in the CT scans."seriesuid" is a unique identifier for each CT Scan.The class label is 0 if the CT-Scan doesn't have a nodule.The class label is 1, if CT scans have a nodule (both malignant and benign).There are over 750k candidates.The candidates belonging to the same CT scan are categorized into groups, after which the diameters dictionary is used to get the diameter of each candidate.The order of the applicants was flipped, resulting in all individuals with nodules being positioned at the top of the list.The candidates without comparable CT scans were excluded from the dataset.The entire number of applicants is 750,000, with just 37,000 having comparable CT scans.A chunk was obtained by generating a list consisting of three slices, with each slice corresponding to a certain direction.The segments are then transformed into PyTorch Tensors for CT-Scan analysis.Every tensor in the dataset has a shape of 1x10x18x18.These PyTorch tensors are fed into deep-learning models for feature extraction.Next, a training and testing dataset is created.The 3D CNN model that is used for feature extraction is elaborated in Table III.The use of flattened layers involves the conversion of the output of a maxpooling layer, which is a multidimensional tensor, into a one-dimensional representation.The neural network architecture is composed www.ijacsa.thesai.org of two fully connected layers, specifically referred to as fc1 and fc2, which are positioned after the flattened layer.Fully connected layers are often used for the purpose of feature extraction.In this instance, the fc1 layer is utilized.

IV. RESULTS AND DISCUSSION
In this research study, a deep learning model is used for the purpose of feature extraction.Prior to this, the data is subjected to preprocessing, whereby it is transformed into three-dimensional CT-scan chunks to serve as input.The machine learning classifiers use these characteristics in order to do classification.The transfer learning technique has been used in our study.The LUNA16 [48] benchmark dataset is used for training and validating this model.The dataset comprises a collection of 1,018 CT scan pictures.The training dataset accounts for 80% of the data, whilst the remaining 20% is allocated for validation purposes.In this research, the suggested model was evaluated using accuracy, which is a commonly used metric for assessing performance.
Accuracy is defined as: The variables TN and TP in equation (1) represent the rates of true negative and true positive, respectively, in the confusion matrix.A confusion matrix is shown in Fig. 9.In this research work, data is pre-processed, features are extracted and then machine learning algorithms are applied at the end of pipeline.The performance of the suggested model is evaluated and compared with other contemporary approaches, as shown in Table II.A more comprehensive assessment of the resilience of the suggested model may be obtained by referring to Fig. 10.Fig. 11 illustrates the accuracy of all the classifiers used in this investigation.
The findings clearly demonstrate that the suggested model has surpassed the performance of existing algorithms by attaining an accuracy rate of 99.6%.Hence, this model has the capability to be used in real-time situations for the purpose of identifying lung malignancies.One disadvantage of this work is that the suggested model is built using just one dataset.To assess the robustness of the model, it will be applied to several datasets.The abbreviations used in this research study are shown in Table IV.This research study presents a unique methodology for the identification of lung cancer.The task of lung cancer detection is accomplished by the use of data collection that is accessible to the general public.The data used in this study was acquired from Kaggle, a well-known and esteemed data source.The publicly available dataset, referred to as the Lung Nodule Analysis 2016 Challenge dataset or Luna16, has been made accessible.The information has been revised in the year 2020.The dataset consists of computed tomography (CT) images obtained from a total of 888 individuals.The data is stored in the MetaImage format, specifically in the mhd/raw file format.Each .mhdfile is accompanied by an independent binary file that includes the unprocessed pixel data.The dataset comprises a total of 754,975 individuals, with their information classified into two separate categories: malignant and non-cancerous.In the preprocessing stage, the CT-Scan files were obtained from the dataset using the "SimpleITK" function.There were a total of 888 distinct CT scans among the 754,975 individuals included in our analysis (Fig. 6).www.ijacsa.thesai.orgThe examination of missing values in the dataset was conducted.A total of 443 instances of missing data were detected, suggesting the lack of distinct CT-Scans.As a result, the previously indicated values were eradicated.The pixel values are subjected to a conversion process into Hounsfield units (HU), and subsequently, their histograms are shown, as seen in Fig. 4. The histogram exhibits a range spanning from -2000 to 1000 Hounsfield Units (HU).The histograms of all the CT-scan images show a comparable distribution.The delineation of the pulmonary region is achieved by the use of several threshold values, as seen in Fig. 5.The dataset is partitioned into two subsets, namely the training set and the validation set.During the data preparation stage, the input pictures undergo a transformation process, which includes the removal of noise and outliers.Upon analysis of the data, it has been determined that the candidate file contains a flag for each mass seen in the CT images that serves as a unique identifier for each CT-Scan.The assigned class label is 0 in cases when the CT-Scan does not reveal the presence of a nodule.The assigned class label is denoted as 1, indicating the presence of nodules in CT scans, including both malignant and benign cases.There are a total of more than 750,000 candidates.The candidates belonging to the same CT scan are categorized into groups, after which the diameters dictionary is used to get the diameter of each candidate.The order of the applicants was reversed, resulting in all individuals with nodules being placed at the top of the list.The individuals who did not have CT scans that could be compared were eliminated from the dataset.The procedure of segmentation has considerable significance within the realm of diagnostics.
The extraction of CT scan slices from images is performed in order to eliminate any unwanted portions.The lung pictures were then subjected to segmentation.The extraction of threedimensional segments includes nodules from the segmented slices.The parts are put into our model.The chunk's measurements are 10 units in length, 18 units in breadth, and www.ijacsa.thesai.org18 units in height.A chunk was acquired by constructing a list including three slices, where each slice corresponds to a certain direction.The segments are then converted into PyTorch Tensors to facilitate CT-Scan processing.The PyTorch tensors are used as inputs in 3D deep learning models for feature extraction.Following this, a dataset is constructed for the purpose of training and testing, and a variety of machine learning classifiers are used to choose the classifier that exhibits the highest performance.The results unequivocally indicate that the proposed model has outperformed current algorithms, achieving a remarkable accuracy rate of 99.68%.Therefore, this model has the capacity to be used in real-time scenarios with the objective of detecting lung cancers.

V. CONCLUSION
This study presents a novel approach that utilizes segmented 3D tensors, 3D convolutional neural network and SVM to identify lung nodules in computed tomography (CT) data.The proposed model is abbreviated as Seg3DT_SVM.The model being discussed is specifically built to accept data chunks in the format of PyTorch tensors.The discrete components, referred to as chunks, that comprise the nodule are the individual segments of computed tomography (CT) slices used.The process of obtaining cross-sectional images is an essential element of computed tomography (CT) scans.The proposed technique includes an initial phase of data preparation, as previously elucidated, succeeded by the use of machine learning algorithms.The experimental investigation used the publicly available LUNA16 dataset and was conducted using the PyTorch framework.The analyzed model demonstrates a much greater degree of accuracy (99.68%) compared to the existing state-of-the-art approaches used for the same dataset.The method demonstrates the capability to assist healthcare workers in promptly identifying cases of lung cancer in real-time situations.

Fig. 1 .Fig. 2 .
Fig. 1.Cancer incidence rates by gender in the USA from 1975 to 2019.Rates are adjusted for reporting lag times and age standardization to the US standard population of 2000 [1].

Fig. 3 .
Fig. 3.The spatial coordinates of the lesions are situated in the x, y, and z dimensions.Red colour is used to indicate benign lesions, whereas black colour is used to indicate malignant tumours [26].

Fig. 10 .
Fig. 10.Comparison of state-of-the-art accuracies with the proposed method.

TABLE I .
SUMMARY OF LUNG CANCER DETECTION ARTICLES Table I provides a review of recent research that has advanced our understanding.Literature Table I illustrates the substantial amount of scholarly literature dedicated to the examination of lung cancer diagnosis, nodule detection, and categorization.

TABLE II .
COMPARISON OF PROPOSED MODEL WITH STATE-OF-THE-ART ALGORITHMS

TABLE IV .
ABBRIVIATIONS AND THEIR EXPLANATIONS USED IN THIS STUDY