Feature Selection Model Development on Near-Infrared Spectroscopy Data Case Study of Beef Freshness Quality Prediction

—This study aims to develop a feature selection model on Near-Infrared Spectroscopy (NIRS) data. The object used is beef with six quality parameters: color, drip loss, pH, storage time, Total Plate Colony (TPC), and water moisture. The prediction model is a Random Forest Regressor (RFR) with default parameters. The feature selection model is carried out by mapping spectroscopic data into line form. The collection of lines is made into one line by finding the mean value. Next, apply the line simplification method based on angle elimination, starting from the smallest angle to the largest. Each iteration will eliminate one corner, reducing one column of data in the corresponding dataset. Then, the predicted value in the form of R2 will be collected, and the highest value will be considered the best feature selection formation. RFR prediction results with R2 values are as follows: color R2= 0.597, drip loss R2=0.891, pH R2=0.797, storage time R2=0.889, TPC R2=0.721, and water moisture R2=0.540. Meanwhile, after applying the feature selection model, the R2 values for all parameters increased to color R2=0.877, drip loss R2=0.943, pH R2=0.904, storage time R2=0.917, TPC R2=0.951, and water moisture R2=0.893. Based on the results of increasing the R2 value of the six parameters, an average value of increasing prediction accuracy of 17.49% can be taken. So, the feature selection method based on line simplification with an angle elimination system can provide very good results.


INTRODUCTION
People have consumed a lot of meat in the last few decades, and consumption has increased in recent years [1].Beef is an alternative commodity that is widely consumed to meet the need for protein in many countries [2].However, meat food products are products that rot quickly, especially under certain conditions, which can accelerate microbial growth [3].Many cases of consumers getting sick are caused by microbes found in beef that are large in number or above standard [4].
The condition of the meat can change quickly, so a method is needed that can determine the current state of beef quality.One fast method for determining meat quality is to use nearinfrared (NIR) technology [5].NIR spectrometers can predict meat quality parameters, including chemical parameters, technological parameters, quality traits, fatty acids, and many mineral contents [6][7] [8].
To speed up and simplify the process of determining meat quality, a portable or handheld NIR device can be built that can be taken anywhere [9] [10].The development of a portable NIR device can be applied in industries that require the process of determining the characteristics of meat products [11] [12].
Machine learning can be used as a model to predict meat quality parameters [13].Machine learning is also able to predict meat quality parameters, including color, tenderness, juiciness, and flavor [14].The random forest (RF) algorithm works well and produces high accuracy in classifying cattle breeds [15].Random Forest Regressor (RFR) performs well in predicting pH values in beef in real-time in a beef freshness monitoring system [16].
In this study, we propose a feature selection model on spectroscopic data to increase the accuracy of meat quality predictions.The beef quality parameters in question are color, drip loss, pH, storage time, total plate colony (TCP), and water moisture.The algorithm that will be used is RFR, with algorithm parameters by default.Based on the experimental results, the proposed method is able to increase accuracy in predicting the freshness quality of beef compared to results without using feature selection.

A. Feature Selection
Feature selection involves exploring algorithms designed to decrease the data's dimensionality, thereby enhancing the performance of machine learning.In datasets with N features and M dimensions, the primary goal is to minimize M to M′, where M′is less than or equal to M [17].Typically, feature selection often results in improved learning outcomes, such as increased accuracy in learning, reduced computational expenses, and enhanced interpretability of the model.
Recently, experts in fields like computer vision and text mining have introduced numerous feature selection methods.Through both theoretical frameworks and practical experiments, they've demonstrated the effectiveness of their approaches [18].Feature selection using established line simplification methods such as the Ramer-Douglas-Peucker algorithm and Visvalingam on NIRS data encountered www.ijacsa.thesai.orgproblems in determining epsilon values, making it difficult to produce optimal datasets.This study will compare the results of prediction accuracy from the proposed feature selection model with the established method from Scikit-Learn, namely SelectFromModel [19].Comparison of results in the form of accuracy of value prediction on six meat quality parameters using R-squared values (R2) as part of model evaluation both for evaluation of the SelectFromModel feature selection model and for the proposed feature selection model.

B. Random Forest Regressor
Random forests consist of a collection of tree predictors where each tree relies on a randomly sampled vector, independent and identically distributed across all trees within the forest.As the number of trees in the forest increases, the generalization error for forests gradually converges to a limit.This error is influenced by both the effectiveness of the individual trees in the forest and the level of correlation among them [20].
The advantages of feature selection and its relevance to enhancing the effectiveness and interpretability of machine learning algorithms are well-established.In this context, we focus on incorporating feature selection into a Random Forest setup.We propose a method that integrates hypothesis testing with an estimation of the anticipated impact of an irrelevant feature while constructing a Random Forest [21].
The utilization of R 2 in this research for model assessment is due to its superior informativeness and reliability compared to SMAPE, along with its absence of interpretational constraints seen in metrics like MSE, RMSE, MAE, and MAPE [22].The R2 value range is easy to understand because the value ranges between 0 and 1 and is indicated by a decimal number.The number 0 indicates poor model performance, and one or close to it indicates good model performance.The formula for calculating the R2 value can be seen in Eq. (1) [22].
C. Near-Infrared Spectroscopy Near-infrared spectroscopy (NIRS) holds a distinct position within the realm of bioscience and its associated fields, differing in characteristics and potential applications from infrared (IR) or Raman spectroscopy.This type of vibrational spectroscopy uncovers molecular details within a sample by detecting absorption bands arising from overtones and combined excitations [23].
The capacity of near-infrared reflectance spectroscopy (NIRS) to differentiate between Normal and DFD (dark, firm, and dry) beef and forecast quality characteristics within 129 Longissimus thoracic (LT) samples derived from three Spanish pure breeds: Asturiana de los Valles (AV; n = 50), Rubia Gallega (RG; n = 37), and Retinta (RE; n = 42).The outcomes obtained using partial least squares-discriminant analysis (PLS-DA) demonstrated successful differentiation between Normal and DFD meat samples from AV and RG (with sensitivity exceeding 93% for both and specificity of 100% and 72%, respectively), whereas results for RE and the overall sample sets were less accurate.Soft independent modeling of class analogies (SIMCA) revealed 100% sensitivity for DFD meat across all sample sets (total, AV, RG, and RE) and over 90% specificity for AV, RG, and RE, but notably lower for the total sample set (19.8%).Utilizing NIRS quantitative models via partial least squares regression (PLSR) allowed dependable prediction of color parameters (CIE L*, a*, b*, hue, chroma).The findings from both qualitative and quantitative analyses hold promise for early decision-making in the meat production process to prevent financial losses and food wastage [24].
Growing apprehensions regarding contaminated meat have spurred the industry to explore novel, non-invasive techniques for swift and precise assessment of meat quality.The primary chromophores in meat (such as myoglobin, oxy-myoglobin, fat, water, and collagen) exhibit similar absorption patterns within the visible to near-infrared (NIR) spectral range.Consequently, variations in the structure and composition of meat can result in proportional disparities in light absorption [25].

A. Sample Preparation
This study used fresh beef objects obtained from traditional markets in Bogor City, West Java, Indonesia, and then brought to the laboratory using an ice box, as shown in Fig. 1.The part of the carcass used is tenderloin.The main sample used weighs 1 kg, as shown in Fig. 2. The next step is to cut it into eight pieces of large samples weighing 17+2 grams and eight pieces of small samples of 3+1 grams, as shown in Fig. 3.All samples were placed on the laboratory table in Petri dishes and supported by wire gauze, as shown in Fig. 4. Large one of samples were used for testing the parameters of color, pH, WHC, water content, and NIR, while small one of samples were used for TPC measurements.The total samples used were eight samples per day and repeated for ten days to obtain 80 samples.

B. Data Acquisition
This study uses six beef freshness parameters, where each parameter uses a different tool.The six freshness parameters include color, drip loss, pH, storage time, TPC, and water moisture.The tool used to retrieve color data is Chromameter, as shown in Fig. 5.
The drip loss parameter value is obtained by measuring the weight of the sample between the initial and final times.The measurement interval is one hour with a total time span of seven hours so that eight drip loss data are produced starting from the 0th hour to the 7th hour.The tool for measuring sample weight is a laboratory scale, as shown in Fig. 6.The pH value parameter of the sample was measured using an electronic pH meter, as shown in Fig. 8.The storage time parameter value is obtained by simply storing the sample on the table from the 0th hour to the 7th hour with a break every hour.Then, the water moisture value is carried out using the drying or thermogravimetric method in an oven at a temperature of 105 degrees Celsius for 16 hours.Laboratory testing data collection activities were carried out for ten days, resulting in a dataset for 80 data rows, as shown in Table I.

C. NIRS Data and Dataset
NIRS data is obtained by scanning samples using a NIRS sensor.The sensor used is the NeoSpectra Development Kit [26] as shown in Fig. 9 and Fig. 10.For the wireless data acquisition process, use a notebook unit equipped with requesting software that is available from the sensor manufacturer as shown in Fig. 11.Measurement results in a spreadsheet file.The example of the averaged data can be seen in Table II.The original data of NIR Spectroscopy plotted as a graphic can be seen in Fig. 12.The total NIRS data collected is 720 data rows and 136 columns according to wavelength value from the sensor and visualized.as shown in Fig. 12.The wavelength used is between 1346.61 -2556.24nanometers (nm).The feature selection process goes through four stages as follows :

A. Mean and Single Data Line
In this process, 1 line is produced, which will represent all spectrum data.as shown in Fig. 14.Then, it is divided into one separate piece represented by one different color, as shown in Fig. 15.Then pair them with adjacent lines.as shown in Fig. 15.Then calculate the angle values of two adjacent lines as illustrated by Fig. 16.

B. Iterative Line Simplification
The process of calculating all angles along a line produces 134 angle values.Then, the angle value data is sorted starting from the smallest value.Each iteration will eliminate one corner with the smallest value.The eliminated corners will correspond to the columns that will be eliminated as well.At this stage.a set of data columns is stored with the storage index in a sequence of iterations.The elimination stages are depicted in the diagram Fig. 17

C. Random Forest Regressor
Each iteration produces a new dataset with a reduction of 1 column of data.The new dataset will then enter the machine learning process to produce an R 2 value at each iteration.At this stage, the RFR parameters used are the default settings, as shown in Table III.Data splitting for the learning and testing process is 70% to 30%.www.ijacsa.thesai.orgThe results of R 2 at each iteration will be stored in an array.Then, it will find the highest R 2 value and location in the array to be used as the best column index set.

D. The best result of the selected feature
The final stage is to determine the largest value from the collection of R 2 that has been accommodated in the array.To determine the highest R 2 value in this study, use the numpy library with the numpy.max()command [27].To find out the iteration position of the highest R 2 value, also use the numpy library with the command numpy.argmax()[28].This command will display the data index of the highest R 2 value.
To determine the best set of columns is to store the value of R 2 in each iteration.Then, find the highest value of the array to determine the index value.The index value is used to retrieve the set of columns.

A. Results from the Original Dataset and the Proposed Model
Of the six meat quality parameters, namely color, drip loss, pH, storage time, TCP, and water moisture, feature selection and machine learning models are applied alternately.Then, we will compare the prediction results using the original dataset with the dataset that has gone through the feature selection stage.The results of the comparison of R 2 values can be seen in IV.
Based on the R 2 value in Table IV, it can be seen that the line simplification-based feature selection model with the corner elimination method has succeeded in increasing the performance of the RFR algorithm in predicting all beef quality parameters.The increase in the R 2 value for all parameters can be said to be satisfactory, with the smallest increase being in the predicted storage time parameter, which is only 0.028.The results were very good, and the highest increase in the R 2 value was in the prediction of the water moisture parameter, namely 0.353.From all the increases in R 2 values, the average value can be taken to be 0.1749 or 17.49%.

B. Results from the Original Dataset and SelectFromModel
As a comparison of performance results in this study, the feature selection model from Scikit-Learn, namely SelectFromModel.was also tested.The experimental results of implementing the SelectFromModel library can be seen in Table V.Based on the R 2 value in Table V. it can be seen that feature selection by the SelectFromModel library can also improve the performance of the RFR algorithm.The smallest improvement was in the prediction of the drip loss parameter, namely 0.026, while the biggest improvement was in the prediction of the water moisture parameter, namely 0.2836.With the increase in the R 2 value in the prediction of all parameters, it can be said that feature selection using the SelectFromModel library works very well.Of all the increases in R 2 .the average value can be taken to be 0.1417 or 14.17%.

A. R 2 Score Comparison between the Proposed Model and SelectFromModel
In this stage.the results of the R 2 values from the proposed feature selection model and the SelectFromModel feature selection model are compared.The comparison results also contain the number of features selected based on their highest R 2 value, which can be seen in Table VI.www.ijacsa.thesai.orgBased on Table VI. it can be seen that the R 2 values of the proposed feature selection mode are higher than the SelectFromModel results, except for the R 2 value in the storage time parameter prediction.So overall, the average increase in the R 2 value of the proposed model is 17.49% higher than the average R 2 value of SelectFromModel.which produces 14.17%.

B. Overview of Selected Features
The result of the feature selection model is a set of features in the form of data columns; in this study, the column names are wavelength values in nanometer (nm) units.The number of features produced by the proposed model and SelectFromModel is definitely less than the number of columns in the original dataset, so this feature selection also leads to a reduction in data dimensionality.For the differences in the number and features selected, a visualization was created for all parameters, which can be seen in Fig. 18 to Fig. 23.
Each color represents one feature selection model.The red color represents the mean of all NIRS data.The green color represents the mean of the data columns selected by the proposed model, while the blue represents the results from the SelectFromModel library.

VII. CONCLUSION AND FUTURE WORK
This study has produced a feature selection model based on line simplification by eliminating angles in the average spectrum from beef NIRS data.The result of increasing RFR performance after using the proposed feature selection model is 17.49%.This result is higher than the average increase in R2 value produced by the SelectFromModel library of 14.17%.Apart from being able to increase prediction accuracy, this model can also reduce data dimensions, where fewer data will require a shorter time in the machine learning process.
For further work and development, the proposed feature selection model can be applied to deep learning algorithms.It can also be combined with RFR by applying hyperparameter tuning.The combination with hyperparameter tuning may require a longer time to find the solution set for the highest accuracy.However, RFR with hyperparameter tuning produces better accuracy compared to RFR with default parameters.

Fig. 4 .
Fig.4.The samples are placed on a petri dish with a wire mesh base.

Fig. 6 .
Fig. 6.Weight measurement process.TPC parameter values were obtained from other laboratories and measured professionally by a third party.The samples sent are the same pieces used for measuring other parameters.Examples of samples are shown in Fig. 7.

Fig. 11 .
Fig. 11.Acquisition using NeoSpectraKit in progress covered by a box.

TABLE I
; lightness; a. red/green value; b. yellow/blue value; Wdish, the weight of the empty dish; Wsample, the weight of the current sample; Wt, weight changes.

TABLE II .
EXAMPLE OF NIRS DATA

TABLE IV .
R 2 SCORE USING THE ORIGINAL DATASET AND USING FEATURE SELECTION

TABLE V .
R 2 SCORE OF USING ORIGINAL DATASET AND SELECTFROMMODEL

TABLE VI .
COMPARISON OF R2 SCORES BETWEEN THE PROPOSED MODEL AND SELECTFROMMODEL