Towards a Machine Learning-based Model for Automated Crop Type Mapping

— In the field of smart farming, automated crop type mapping is a challenging task to guarantee fast and automatic management of the agricultural sector. With the emergence of advanced technologies such as artificial intelligence and geospatial technologies, new concepts were developed to provide realistic solutions to precision agriculture. The present study aims to present a machine learning-based model for automated crop-type mapping with high accuracy. The proposed model is based on the use of both optical and radar satellite images for the classification of crop types with machine learning-based algorithms. Random Forest and Support Vector Machine, were employed to classify the time series of vegetation indices. Several indices extracted from both optical and radar data were calculated. Harmonical modelization was also applied to optical indices, and decomposed into harmonic terms to calculate the fitted values of the time series. The proposed model was implemented using the geospatial processing services of Google Earth Engine and tested with a case study with about 147 satellite images. The results show the annual variability of crops and allowed performing classifications and crop type mapping with accuracy that exceeds the performances of the other existing models.


INTRODUCTION
Agriculture has been a challenging economical sector, and a vital pillar of development in many countries. The first challenge is to ensure food self-sufficiency and respond to the increasing requirements of a growing population. The Food and Agriculture Organization of the United Nations [1] global report highlighted food insecurity and its accelerating rising trend. The report suggested that to prevent severe hunger, daring transformation must be conducted in the agri-food systems. Many recent technics and systems aimed to ensure the sustainable development of agriculture [2] and new concepts were developed like precision agriculture [3]. The recent revolution of digital technologies has radically changed agricultural management, like the employment of geospatial technologies, remote sensing data [4], and artificial intelligence [5], [6]. This study proposes a new model to show the contribution of machine learning algorithms in the identification of crop types using both optical and radar satellite images with time series of different vegetation indices. The general aim is to establish a method for improving crop type mapping accuracies, with the demonstration of the contribution of optical and radar data and their complementarity.
The paper proceeds as follows: providing related works in Section II. Section III, introduces the proposed model. The case study is presented in Section IV. Section V presents and discusses the results, and Section VI concludes and gives the intended future works.

II. RELATED WORKS
Crop-type mapping using remote sensing data and machine learning technics is the subject of multiple research [7]- [10]. Optical data has been a reference data for crop mapping studies because of the ease to link phenological development and biological properties of crops with optical acquisitions to differentiate crop species [11]- [13]. Also, it identifies various growing stages of a single crop, rice as in this case study [14]. The author in [11] used Sentinel-2 time series data with gap filling method to overcome data discontinuity caused by cloud cover. Interpolation technics were also used by [11] in the DATimes software to capture seasonal vegetation dynamics. Different optical sensors were also combined to increase time series temporal frequency and to catch field-level phenologies [15]. Both optical and radar data were used by [16] to detect paddy rice fields using phenological variations and a texturalbased strategy. Radar images only were investigated by [17] to detect winter wheat phenological stages. They analyzed the temporal variations of the Sentinel -1 time series in the function of different phenological phases. The author in [18] as well used Sentinel-1 time series to conduct classification considering Spatiotemporal phenological information.
Different machine-learning algorithms were employed to produce accurate crop maps. The support vector machine (SVM), and random forest (RF) classifiers have been the most popular in recent years for the classification of satellite images [19]. Many papers reported better performance of SVM [8], [20]- [22] as well as DT and RF algorithms compared to other techniques.
www.ijacsa.thesai.org The goal of mapping crops was approached using different methods, technics and datasets in the literature review. Although the obtained results presented good accuracies, the development of geospatial technics has imported its evolution; the integration of the Google Earth Engine allows producing large treatments instantly that can be employed to identify crops early in the season. The obtained accuracies in the state of the art are yet to be improved by exploiting complementarity between radar and optical time-series and their indices.
This research proposes a new machine-learning method combining time-series indices extracted from optical and radar satellite images and machine-learning classifiers to perform automated and high-accuracy crop-type mapping.

III. PROPOSED MODEL
The proposed model employs time series from both optical and radar data. The model was performed using the geospatial processing services of Google Earth Engine ( Fig. 1):

A. Preprocessing
Cloud masking: An important step of pre-processing the image collection is to omit the disturbance caused by clouds and shadows from the imagery. The cloud masking process was performed using the cloud probability band that was created with the Sentinel 2 cloud detector library. The maximum cloud probability was limited to 25. The gaps in the masked image were then filled with the previous interpolated image.
Speckle filtering: satellite images are usually affected by speckle noise. Multiple statistical methods were developed to remove the speckle in the concern to preserve image details. The study conducted by [23] compared different filtering methods dedicated to speckle suppression in SAR images and found that Lee-Sigma and Gamma-MAP are showing relatively good detail preserving abilities than other filter types. The author in [24] also concluded that the Gamma Map filter is reliable as proved by the comparison between the Lee filter, frost filter, and Gamma Map. In the present model, the Gamma-MAP filter is used, which is based on the Bayesian analysis of image statistics. It uses the Maximum A-Posteriori (MAP) estimation method. While using this filter, Gamma distribution is assumed for the underlying image and the speckle noise in it. Thus, this filter works best for geospatial images containing homogenous areas such as oceans, forests, fields, etc.

B. Training Sample 1) Indices calculation and time series composites:
Optical data provide information in multiple bands that can produce valuable information about the state of vegetation. For the purpose to capture spatiotemporal variation in photosynthetically active vegetation, multiple optical indices were developed in the literature to characterize and monitor the development of crops [19], [25]. The author in [25] calculated the EVI and NDVI indices from the time series to extract metrics for crop discrimination. In this study, different indices were calculated for each image in the image collection (Table I). The main used bands are Red (R), Green (G), Near InfraRed (NIR), and Short-Wavelength InfraRed (SWIR) from the optical images, and both polarization VV and VH from radar images. 2) Harmonic modelization of optical time series: Time series from the optical indices depends on the phonological cycle of crops throughout the year. The analysis of the variations is represented by applying harmonic modeling also named Fourier analysis. The analysis consists of decomposing the time-dependent periodic event into a series of sinusoidal functions, with phase and amplitude values. The general equation of a time series is presented by [26] in eq. (6).

3) Training set selection:
To guarantee a good presentation of each class, training samples should respect a good representation of each class taking into consideration spatial distribution. 20% of the samples are set for validation and accuracy calculations, and 80% were used for training and extraction features from the formulated time series.

C. Classification and Validation
In the literature, different classification methods are employed for land cover and land mapping. This study, employed two classifiers which are the most performant [8], [20]- [22].

1) Machine learning classification:
Random Forest RF classifier is based on building multiple trees from samples of the training data. Each tree is built using a different subset from the original training variables. Its advantage is that the algorithm can handle a huge amount of input data. The decision of belonging to a given class is determined by the majority vote of the trees. Support Vector Machine (SVM) is a supervised nonparametric statistical technique. The decision to separate between classes is made by calculating the hyperplane that maximizes the margin between classes. The separation between data points is based on the applied kernel function (Linear, Polynomial, Gaussian, Radial Basis Function (RBF), or Sigmoid) that determines the efficiency of the classification.
2) Accuracy assessment: Two performance criteria were used to assess the result's accuracy. The main index of Cohen's kappa is a statistical measure of interrater reliability for categorical variables. It takes into account the possibility of the accord occurring by chance (eq. 7).
While p0 = Observed accuracy. p is the sum of relative frequency in the diagonal of the error matrix. pc = Chance agreement.
F1 score also a measure of a model's accuracy can be interpreted as a harmonic mean of the precision and recall of the confusion matrix. F1-score is calculated per class for a multiclass classification problem (eq. 8).
Where: recall= and precision =

A. Study Area
To evaluate the radar and optical indices using a supervised classification method, the proposed model is tested in an agricultural zone in Minnesota State in the United States (Fig. 2). Minnesota is located in the Western part of the Great Lakes region and ranks fifth in the United States for total crop sales, the major crops are corn, Soybean, sugar beets, and dry beans. www.ijacsa.thesai.org

B. Crop Inventory Data
The reference data were collected from the cropland data layer (CDL) produced by unites state department of agriculture. The layer contains annual crops from extensive agricultural ground truth with 30m spatial resolution. The process of training started with random points selection taking into consideration to cover the totality of the study area, and covering all the agricultural types. 80 % of data presenting 5989 points were selected for training the model, and 20% for validation. The selected zone contains 14 types of crops with 4 major types.

C. Optical and Radar Data
Both radar (Sentinel 1) and optical (Sentinel 2) images were used in this study. The Sentinel 1 mission provides Cband Synthetic Aperture Radar data (SAR). The image catalog of Sentinel 1 data provides preprocessed images, terrain corrected and radiometrically calibrated. A total of 29 scenes of Synthetic Aperture Radar images were used from 01-01-2019 to 30-12-2019. The images were restricted to singlepolarization VV and VH. The active sensor expands the possibilities of acquiring data in cloudy weather allowing then better monitoring of the vegetation evolution.
The Sentinel-2 mission provides multispectral highresolution imagery with 12 spectral bands. The image collection contains 145 optical images covering all the studied periods.

D. Analysis Platform
The development of the remote sensing field imported different offers and a large amount of data from different sensors, and several platforms have been elaborated to handle geospatial analysis and processing. The Google Earth Engine was introduced as a multi-petabyte catalog and cloud computing platform with high-performance computation capabilities and has been investigated in land cover studies [27] and agricultural studies [28], [29].
The proposed process was all performed in the Google Earth Engine (GEE) platform, from the Sentinel image selection to the validation process. The GEE platform allowed the process of large-density images for pixel-based image analysis as well as the classification algorithms due to the high cloud calculation performance the platform offers.

E. Time Series Formulation, Training, and Machine
Learning Algorithms The first steps of processing time series are conducted as detailed in the previous section. After the preprocessing, calculating optical and radar indices of each imagery data was performed.
SAR indices were extracted from the single-polarization bands. The Normalized Ratio Procedure between Bands (NRPB) and the ratio were estimated using the equation in Table I where σVH and σVV are the backscatter VH and VV polarization. In the same way, the NDVI, NDMI, and NDWI optical indices were calculated, then applied the harmonic modelization of the time series.
The training was then applied to the formulated input, the training set was selected randomly from the time series stack generated from all the SAR and optical calculated indices. The input features are then fed to the employed machine learningbased classifiers.
Multiple parameters were tested for obtaining perfect results. The final parameters for the SVM classifier were set to the Radial Basis Function (RBF), 0.5 for gamma and 10 for the cost. Random Forest is applied using 800 trees and 20 variables per split.   From the SAR indices, the NRPB and ratio index time series were presented in Fig. 3 and Fig. 4 with a selection of 4 major crops. The time series of the Normalized Ratio Procedure between Bands and the ratio index can monitor the vegetation changes. The temporal signature of different considered crop types is showing different signature behavior. The NRPB time series of corn alfalfa and soybean know significant variations whereas Fallow parcels responded smoothly and monotonously to changes over the year. The variations are a function of soil surface conditions, moisture, roughness, and biomass development of crops. The NRPB index was also used by [30] to generate metrics for the input set of the model to aid the prediction of NDVI and highlighted the similarity found in the NDVI and σVH/σVV ratios with crops and finally found that insertion of the NRPB variable in machine learning models, like RF, gives better results.

A. Time Series of Vegetation Indices
Times series of the NDVI allows the characterization of each crop. Since the NDVI is a perfect index to describe the chlorophyll activity of crops, a dense and healthy state is presented by a high value of the NDVI index reaching 1, in the opposite case, the value approaches 0. Then, the time series is presenting the phonological cycle of each crop. The model is suitable for smoothing the spectral curves and allows distinction between crops. Corn, dry beans, and soybeans are presenting a unimodal periodic model, with a high value for corn culture. The resulting phonological cycles match the phonological calendar provided by the USDA National Agricultural Statistics. Corn starts in late April and is harvested in early November. While the phonological cycle of Soybean Starts with the plantation in early May and is harvested in late October. Alfalfa is presenting the highest amplitude values and a different curve from other crops due to agricultural practices. Alfalfa is harvested repeatedly during the growing season starting from early April to late October. The results of the obtained phonological cycles were compatible with the crop calendar as given by the USDA in the region of Minnesota. www.ijacsa.thesai.org

B. Classification Results
SVM and Random Forest have demonstrated their advantage in classifying agricultural cover maps. The validation of the classification results was conducted by calculating the confusion matrix. 20% of random samples were used to validate the final result. Both classifiers had given good results with the advantage of the random forest classifier with 0.95 kappa index, and 0.85 for SVM. Table II presents accuracy metrics with good accuracy results, with the advantage of the RF classifier. Other studies had demonstrated the complementary of optical and radar data [20], [31]- [34]. The authors in [35] have found an overall accuracy of 93.83% from combined inputs. The authors in [8] have found an overall accuracy between 73% and 95% depending on the input dataset used, using the SVM classifier. Performances metrics were calculated, other than the kappa index, the F1-score, and Producer accuracy are presented in Table II. The Producer accuracy represents the probability that a particular sample of a particular class is classified correctly. The most correctly attributed classes are barley, corn, soybeans, and winter wheat.

VI. CONCLUSION
This research study deals with the problem of crop type identification. A machine learning-based model for automated crop type mapping is proposed. The novelty of the model is to improve crop type mapping accuracy using time series from both optical and radar images by extracting vegetation indices. The model presents several advantages. It demonstrate the complementarity between optical and radar satellite images for crop type mapping studies. Secondly, the results pointed the advantage of Random Forest classifier over SVM. The resulted accuracy outperformed existing models in the state of the art with a kappa index of 95%.
The proposed model was implemented using Google Earth Engine and tested with a specific case study in an agricultural zone in Minnesota State in the United States. Future works intend to assess and compare the performances of deep learning and machine learning algorithms for crop-type mapping.

ACKNOWLEDGMENT
We are very thankful to Unites State Department of Agriculture (USDA), and the National Agricultural Statistics Service (NASS) for providing the Crop Data Layer from which crop types were identified. SVM www.ijacsa.thesai.org