Arabic Semantic Similarity Approach for Farmers‟ Complaints

Semantic similarity is applied for many areas in Natural Language Processing, such as information retrieval, text classification, plagiarism detection, and others. Many researchers used semantic similarity for English texts, but few used for Arabic due to the ambiguity of Arabic concepts in both sense and morphology. Therefore, the first contribution in this paper is developing a semantic similarity approach between Arabic sentences. Nowadays, the world faces a global problem of coronavirus disease. In light of these circumstances and distancing's imposition, it is difficult for farmers to physically communicate with agricultural experts to provide advice and find suitable solutions for their agricultural complaints. In addition, traditional practices still are used by most farmers. Thus, our second contribution is helping the farmers solve their Arabic agricultural complaints using our proposed approach. The Latent Semantic Analysis approach is applied to retrieve the most problem-related semantic to a farmer's complaint and find the related solution for the farmer. Two methods are used in this approach as a weighting schema for data representation are Term Frequency and Term Frequency-Inverse Document Frequency. The proposed model has also classified the big agricultural dataset and the submitted farmer complaint according to the crop type using MapReduce Support Vector Machine to improve the performance of semantic similarity results. The proposed approach's performance with Term Frequency-Inverse Document Frequency-based Latent Semantic Analysis achieved better than its counterparts with an F-measure of 86.7%. Keywords—Semantic similarity; latent semantic analysis; big data; MapReduce SVM; COVID-19; agriculture farmer's complaint


I. INTRODUCTION
The semantic analysis field has an essential role in the research related to text analytics. Measuring the semantic similarity between sentences is a long-standing problem in the Natural Language Processing (NLP) field [1], [2]. With the growth of text data over time, NLP became essential to be worthy of attention for Artificial Intelligence (AI) experts [3], [4]. Semantic similarity is used for several fields in NLP like information retrieval, text summarization, plagiarism detection, question answering, document clustering, text classification, machine translation, and others [5], [6]. It is defined as determining whether two concepts are similar in meaning or not [7]. The concepts are words, sentences, or paragraphs. Each concept takes a score. When the concept has a high score refers to high similarity or semantic equivalence to another concept [8]. Concepts can have two ways to be similar that are either lexically or semantically. Concepts are lexical similarly if words have similar character sequences and are performed using a String-based algorithm. Concepts are semantic similarly if words depend on information acquired from massive corpora, even if they have a different lexical structure. Semantic similarity can be done by a corpus-based algorithm or knowledge-based algorithm [9], [10]. Several research works of semantic similarity have been developed for English sentences. On the other side, few research works have been used for the Arabic language because Arabic is considered a complex morphological language [11]. However, the Arabic language considers the fifth most spoken language in the world. Also, it participates in the most critical foreign languages with over 300 million speakers and a wide range of functionalities that no other language can have [12]. Therefore, this paper will apply a semantic similarity approach to the Arabic dataset.
Currently, the world faces a huge disaster that threatens the world is the global Coronavirus disease (COVID- 19) pandemic. COVID-19 causes destructive economic, political, and social crises in each country. All fields have been affected by the global Coronavirus, especially the agriculture field. In our life, Agriculture plays a critical role in the entire life of the economy. It can be one source of Livelihood, contributes to national revenue, the supply of food, and marketable surplus. Moreover, it provides job opportunities to a huge percentage of the population and supplies the country with an important portion from its foreign exchange through agriculture exports. Therefore, due to COVID-19 that compounds pre-existing vulnerabilities in the field of agriculture in Egypt. Initial analyzes of this epidemic have shown disrupting access to agricultural inputs, including employment, extension, advising services, and producing markets for farmers. Most significant countries became deserted that people stay indoors, either by choice or by the government, to reduce the spread of this pandemic. Because of this, the curfew and distancing imposed by COVID-19 cause many problems for farmers. So, it is difficult for farmers to communicate and interact with agricultural experts to present their complaints and find suitable solutions. Therefore, it is essential to find an appropriate way to help in solving the farmers' complaints. Agriculture Research Center (ARC) and Virtual Extension and Research Communication Network (VERCON) [13] in Egypt provide a large group of farmers' complaints and their solutions in Arabic deployed on a public cloud. The agricultural experts have resolved these complaints. Thus, under these difficult circumstances of the spread of the COVID-19, this paper aims to develop an approach for farmers to help, support, and find the www.ijacsa.thesai.org most suitable solution for their agricultural complaints. The proposed approach is based on latent semantic analysis (LSA) to measure the semantic similarity score between Arabic farmers' complaints and the Arabic agricultural dataset, further retrieving the related solution to the farmer. As an example, the farmer"s complaint is " ‫ف‬ ‫انثغؿٛى‬ ‫نذقٕل‬ ‫انقطٍ‬ ‫ٔعق‬ ‫صٔصِ‬ ‫يٓاجًح‬ ْٙ ‫ًا‬ ‫"انًقأيّ؟‬ and its English equivalent is "Cotton leaf worm attack alfalfa fields, what is the resistance?". After applying the proposed model, the recommended solution is " ‫ٔعق‬ ‫صٔصج‬ ‫ذقأو‬ ‫الَٛد‬ ‫يثم‬ ‫تٓا‬ ‫انًٕصٗ‬ ‫انًثٛضاخ‬ ‫ادض‬ ‫تاؿرشضاو‬ ‫انثغؿٛى‬ ‫فٗ‬ ‫انقطٍ‬ 00 % ‫تًؼضل‬ 300 ‫تًؼضل‬ ‫تانـٕالع‬ ‫انغٖ‬ ‫ثى‬ ‫جى/ف‬ 200 ‫نرغ‬ ‫نهفضاٌ‬ " and its English equivalent is "The cotton leafworm is resisted in alfalfa by using one of the recommended pesticides such as Lannet 90% at a rate of 300 g/f, then irrigation with diesel at a rate of 200 liters per feddan." To improve the performance of the semantic similarity approach, we used the classification. Text Classification (TC) is an active research field and an essential in information retrieval technology [14]. It aims to classify text documents into one or more predefined categories. TC is applied in many applications like sentiment analysis, sentence classification, and document classification [15], [16]. TC can use many methods such as Decision Trees, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes (NB), K-Nearest Neighbor (KNN), etc. [17], [18].
SVM is an extremely powerful classifier in the machine learning field and is widely used in text classification [19]. However, it is fast and easy to implement. Therefore, we applied SVM on the agricultural dataset to classify Arabic complaints into crops. But, SVM didn't achieve better accuracy results. Thus, to improve performance in classifying our Arabic agricultural dataset, we resort to a parallel programming model like MapReduce. So, this paper applied the classification by MapReduce SVM using Hadoop to classify the Arabic agricultural dataset according to its crop type.
Most of the previous works applied Arabic semantic similarity to small datasets and achieved low accuracy results. Moreover, fewer of them tested on agriculture datasets and didn't use the classification.
Thus, the main objectives in this paper are as follow: 1) Applying the proposed model on a big agricultural dataset with real complaints facing Egypt's farmers.
2) The proposed model can help farmers, especially in the circumstances of COVID-19, by providing advice and finding appropriate solutions for their complaints to enhance agriculture productivity.
3) Developing a semantic similarity model between Arabic complaints and obtaining better results. 4) Using a parallel programming model like MapReduce based on SVM to classify the agricultural dataset and improve the performance of a semantic similarity model.

5)
Testing and validating the proposed model performance by implementation multiple experiments and applying previous models on our Arabic agricultural dataset.
The remaining parts of the paper will be structured so that Section II presents related work; Section III includes materials and methods; Section IV covers discussion; Section V shows the experimental results. Finally, in Section VI, the conclusion is produced.

II. RELATED WORK
This section will introduce related work about Arabic text classification and Arabic semantic similarity. Mostafa et al. [20] Proposed two models to classify the Arabic farmers" complaints based on different diseases that may affect crops. The Arabic complaint is classified into its respective crop and a specific disease in the first model. The second model could classify the complaint directly into diseases. Each preprocessed complaint is represented into a binary vector form using the vector space model by helping the crop lexicon. Experiments are conducted on the dataset by changing the training percentage with many trials using SVM and KNN classifiers. The results are shown that the proposed model is performed on par with the human expert and can be applicable for real-time operations. Moreover, Raed Al-khurayji and Ahmed Sameh [21] presented an approach that depends on a Kernel Naive Bayes classifier to solve the non-linearity problem of Arabic text classification. First, they applied preprocessing techniques on Arabic datasets like tokenization, stop word removal, and light stemmer. Then, they used the TF-IDF technique on Arabic words for feature extraction to convert them into the vector space. Experimental results are shown that the proposed approach achieved good accuracy and time compared with other classifiers. While Abutiheen et al. [22] proposed the Master-Slaves (MST) technique to classify Arabic texts. The proposed approach consisted of two phases. In the first phase, Arabic corpus text files are collected. These text files are classified manually into five categories. In the second phase, four classifiers were implemented on the Arabic collected corpus. The four classifiers were NB, KNN, Multinomial logistic regression, and maximum weight. NB classifier was applied as Master and the others as Slaves. The slave classifiers' results were used to change the NB classifier probability (Master). Each document in a corpus was represented as a vector of weights. The results of the MST have achieved a good improvement in accuracy compared with the other techniques.
Schwab et al. [23] presented a technique that depends on word embedding for measuring semantic relations among Arabic sentences. This technique relies on the characteristic of semantic words in the model of word embedding. This technique has applied three methods: no weighting method, Inverse Document Frequency (IDF) weighting method, and part-of-speech (POS) weighting method. No weighting method is used by summing the word vectors of each sentence. To improve the results, use the IDF weighting method to calculate IDF weight for each word and add the word vectors with IDF weights for each sentence. Also, use the POS tagging method that supposes weight for each POS and calculate POS for each word, then for each sentence, sum the words vectors with POS weights. This technique is evaluated the results on a small dataset. This paper demonstrated how weighing IDF and POS tagging supports highly descriptive word determination in any sentence. The performance of both IDF and POS weighting techniques achieved better results. While Amine et al. [24] proposed an Arabic search engine method depending on the MapReduce method. This method is used for finding semantic similarity among an Arabic query and the large corpus of www.ijacsa.thesai.org existing documents in the Hadoop Distributed File System (HDFS). It is also used to obtain the most relevant documents. It uses two measures in MapReduce: Wu and Palmer (WP) measure and Learock and Chodorow (LC) measure. The results appeared that WP and LC obtained better results than the existing approaches of semantic similarity. Mahmoud et al. [25] suggested a semantic similarity technique in paraphrase identification for Arabic. This technique depends on the combination of various NLP like the TF-IDF technique and the word2vec algorithm.TF-IDF technique is used to ease the identifying of highly descriptive terms in each sentence. The word2vec algorithm is used for representations of distributed word vectors. Also, word2vec can minimize computation complexity and optimize the likelihood of word prediction in producing a model of sentence vector. This paper applied the similarity using various comparison metrics, like Cosine Similarity and Distance Euclidean. Finally, the proposed technique was tested on the Open-Source Arabic Corpus OSAC and achieved a reasonable rate. In [26] used a semantically reduced dimensional vector to represent high dimensional Arabic text. It has been accomplished by extending the standard vector space model (VSM) to enhance the representation of text that utilizes Linguistic and semantic properties from Arabic WordNet and Name Entities' gazetteers. If synonyms and similar terms obtain from the same root in clusters, the vector size reduces, and the shorter NE represents the chosen cluster members. The word similarity is also determined using distributional similarity to collect similar terms into clusters. Results demonstrated the size, form of the analysis windows, and the text's nature and category based on how much it reduced. In [27] suggested a method for finding the semantic similarity among two Arabic texts. This approach used hybrid similarity measures that are edge-counting semantic approach, cosine similarity, and N-gram similarity. The edge-counting semantic approach determined the value of a threshold. If the first approach's similarity result was lower than the threshold value, then cosine similarity is applied. Moreover, if the cosine similarity value compared with the predefined threshold was not appropriate, use N-gram similarity. This hybrid approach addresses problems of writing mistakes like repetitive, incomplete, and substituted characters. The hybrid similarity results outweigh the results of any of the three measures that have been used individually.

III. MATERIAL AND METHODS
The proposed model aims to measure the semantic similarity's score between the current farmer's complaint and the available historical agricultural problems to provide an adequate solution to the farmer's complaint. Our proposed model was applied and tested on the agriculture problems dataset and their solutions. ARC and VERCON. It contains complaints of various causes, such as harmful weeds, fungal diseases, and other diseases that affect plants and their solutions. It also includes complaints belongs to 31 governorates and their directorates. It contains more than 10,000 complaints. The complaints are written in the Arabic language in an unstructured form and not well-formatted. These complaints related to different crops such as "rice, okra, wheat, corn, cotton, beans, etc...". It is also associated with different categories, which are "Administrative, Productivity, Marketing, and Environmental".
Due to the variety of crops in the agriculture dataset, the proposed model applied a classification method to classify the farmer complaints dataset according to the crop type. Table I shows some examples of the dataset's complaints  related to different crops and their English translation.   TABLE I. EXAMPLES FOR ARABIC FARMERS" COMPLAINTS AND THEIR ENGLISH EQUIVALENT

Complaints in the Arabic Language
Yellow spots on the leaves of onion plants.

‫انًٍ.‬ ‫تذلغج‬ ‫انقًخ‬ ‫اصاتح‬
The presence of spots on the upper surface of okra leaves with the appearance of a spider thread on its lower surface The lack of water in the Arimon canal for more than a month exposes the existing winter crops to fallow, such as clover.

‫اكثغ‬ ‫يُظ‬ ‫اعًٌٕٚ‬ ‫ترغػح‬ ‫يٛاج‬ ‫ٔجٕص‬ ‫ػضو‬ ‫انلرٕٚح‬ ‫انًذاصٛم‬ ‫ٚؼغض‬ ‫يًا‬ ‫كٓغ‬ ‫يٍ‬ ‫انثغؿٛى.‬ ‫يثم‬ ‫نهثٕاع‬ ‫انقائًح‬
The proposed model consists of four phases, as shown in Fig. 1 Preprocessing, MapReduce SVM classification, and Latent Semantic Analysis. The last phase is the ranking and selection to choose the most semantically relevant solution to the farmer"s complaint. The next sections explain in detail the phases of the proposed model.

A. Preprocessing Phase
The preprocessing phase is an essential step for Natural Language Processing (NLP) tasks. It transforms input text into a more desired form for performing better for further steps [28]. Unfortunately, the complaints' meaning is difficult to understand and interpret since farmers typically write complaints without following the Arabic grammar rules.
Data preprocessing includes four operations: tokenization, stop word removal, complaints auto-correction, normalization, and lemmatization, as shown in Fig. 1.
 Tokenization: It is a method for breaking texts into tokens. Words are separated from their neighboring words by blanks such as white space, periods, commas, semicolons, and quotations [29]. For example, The Arabic complaint is ‫فٗ‬ ‫يـذٕقٗ‬ ‫يظٓغ‬ ‫نٓا‬ ‫صفغاء‬ ‫تقغ‬ ‫"ٔجٕص‬ ‫انقًخ"‬ ‫ٔعقّ‬ ‫ػهٗ‬ ‫طٕنّٛ‬ ‫صفٕف‬ .
 Stop Word Removal: The most popular undesirable term is either a punctuation mark or a stop word. Therefore, they are eliminated from complaints since they do not have any meaning or indications about the content. We used an online Arabic stop words list for elimination [30]. Examples of these unimportant words in the Arabic language such as: www.ijacsa.thesai.org And in the English language are (Whenever, first, about, where, as long as, which, from, to, on, above, below, etc.) In addition, eliminating all symbols such as: (@, #, &, %, and *).
 Auto-correction: The farmers may write their complaints with spelling errors. For example, the crop name of tomatoes ‫"طًاطى"‬ may be incorrectly written in slag way as ‫"أطّ"‬ , the crop name of corn ‫"طعِ"‬ may be incorrectly written as ‫"ػعِ"‬ and the name of rice crop ‫"أعػ"‬ may be incorrectly written as ‫"عػ"‬ [20]. Therefore, we use auto-correction to solve these problems by substituting the incorrect word with the correct one.
 Normalization: Text normalization is transforming the input text into a canonical (standard) form. It is critical for noisy texts like comments on social media and text messages that are popular in abbreviations and misspellings. Moreover, it concentrates on removing the inconsistent language variations. For example, In English, the word "croooop" can be transformed into its canonical form "crop" and also in Arabic like ‫"يذصٕٔٔٔل"‬ is normalized into ‫"يذصٕل"‬ [31], [32]. So, we applied some methods for normalization such as the letters ‫إ"‬ ، ‫أ‬ ، ‫"آ‬ will convert into one form ‫".ا"‬ Also, the letter ‫"ج"‬ will replace by "ِ," the letter "٘" will convert to "ٖ." Also, remove the diacritics from the words such as " ‫اع‬ َ ‫ْج‬ ‫ك‬ َ ‫"أ‬ will convert to ‫."أكجاع"‬ Furthermore, if there are more spaces between words, then we removed them. We also convert numbers into words such as "15" will convert to ‫ػلغ"‬ ‫".سًـّ‬  Lemmatization: It is an essential step in the preprocessing phase and a significant component for many applications of natural language processing. It is an operation to find the base form for a word. For example, in Arabic words like ‫)ثًاع(‬ has the root ‫".ثًغ"‬ Also, in English, "fruits" has the root "fruit." We used an online Farasa lemmatization [33].

B. MapReduce SVM Classification Phase
Text Classification is the process of distributing each document to its labeled class [34]. MapReduce is a popular programming model developed by Google. It can process massive datasets in a parallel manner and achieves a high performance [35], [36]. The main idea of MapReduce comes from the divide and conquer algorithms which are used to divide a large problem into smaller subproblems. Therefore, we apply MapReduce SVM to classify big data preprocessed agricultural complaints according to their crop type, such as rice, wheat, okra, etc. MapReduce SVM uses the Hadoop framework to share the classification between many machines using HDFS to store the preprocessed agricultural complaints to classify and store the classification result. MapReduce model is divided into two tasks which are Map and Reduce [37]. It divides the dataset into smaller chunks and then assigns each chunk to a single map task. Map tasks' number is equal to the number of data chunks. Thus, each map task processes each data chunk in a parallel way. The model shuffles and sorts the Map outputs and transfers them to the Reduce tasks. The Reduce task is a summarization step that all associated records are processed together by a single entity. The Map and Reduce tasks are mathematically represented in (1) and (2), respectively [38].
After this phase, each complaint is classified according to its crop type. Table II shows the number of Arabic agricultural complaints in each crop after applying the MapReduce SVM model.

C. Latent Semantic Analysis Phase
In this phase, LSA is applied to measure the semantic similarity among the agriculture dataset and the farmer complaint. It is a technique used for representing documents as a vector. It helps to find the similarity between agricultural complaints by calculating the distance between vectors.
There are three main steps for the LSA-based algorithm:  Creating the input matrix (Term-Sentence matrix)  Applying reduced singular value decomposition (RSVD) on the created matrix  Calculating the semantic similarity score between the farmer's complaint and complaints document. www.ijacsa.thesai.org These steps will be explained in detail in the following sections.

1) Term-sentence matrix:
In this phase, an input matrix is created for the farmer"s complaint query and classified complaints document. Each row in the matrix represents the word or term in the farmer"s complaint or classified complaints document [39]. Each column represents the complaints. The cell value is the result of the intersection between term and complaint. There are two methods used as a weighting schema for data presentation for filling the cell values: Frequency (TF) or Term Frequency-Inverse Document Frequency (TF-IDF).
In TF-based LSA, the cells are filled with the term frequency (TF i ) of terms in the complaint statement (C j ) as in (3).

W(t ij ) = tf ij
Where W(t ij ) is the weight of a term (i) in each complaint statement (j) and is the frequency of a term (i) in each complaint statement (j).
TF_IDF-based LSA, the cells are filled with the weight of (TF_IDF) of the term (i) in complaint statement (C j ) as shown in (4) and (5).
Where TF_IDF ij: TF is the frequency of a term (i) in each complaint statement (j), and IDF reflects the importance of term among all sentences (5) Where N represents the number of complaints in the collection, and ComplaintFreq (f) is the number of complaints containing the term.
2) Reduced singular value decomposition: Singular value decomposition (SVD) is an algebraic method that plays an essential role in text mining and natural language processing. SVD is used to improve the term sentence matrix, remove noise, and determine the relationships between terms and complaints statements [40]. SVD decomposes the Term Sentence Matrix into three matrices that detect all the important properties and features of the matrices.
Equation (6) shows the SVD decomposition of the m × n matrix.

SVD = USV T
Where U is the m-dimensional matrix, V is the ndimensional matrix, and S is the diagonal matrix.
Moreover, RSVD is applied to improve and enhance the performance of SVD and reduce the matrix dimensionality.
3) Semantic similarity score: After applying RSVD, RSVD results are used to calculate semantic similarity between the farmer query and classified complaints document. The semantic score is calculated using the most common similarity method, which is the cosine similarity. Equation (7) represents the calculation of cosine similarity.

|| || || ||
Where is the similarity score between the farmer query and complaints document, A is the weight of the term in the query, and B is the weight of the term in the complaint statement.

D. Ranking And Selection Phase
In this phase, rank the complaints according to the semantic score, then select the complaint of the highest score. Finally, retrieve the solution of the complaint with the highest score to the farmer query.

IV. DISCUSSION
F-measure is used to evaluate the performance for the proposed classification approach and semantic similarity approach.

A. Classification Evaluation
The performance of the MapReduce SVM classifier using Hadoop is evaluated. Also, we compared our classification results with the previous classification works as in Mostafa et al. [20] and Mohammad et al. [41]. Authors in [20] applied two classifiers that are SVM and KNN, on the same agricultural dataset to classify agricultural complaints into crops.
Moreover, authors in [41] used two classifiers that are the Naive Bayes algorithm (NB) and the Hybrid Naive Bayes with Multilayer Perceptron network (NB-MLP), to classify the dataset into positive or negative sentiment. Therefore, we applied NB and NB-MLP algorithms to our agricultural dataset to classify complaints into crops. www.ijacsa.thesai.org  Table III shows a comparison between our MapReduce SVM evaluation results and previous works that used NB, NB-MLP, SVM and KNN classifiers. We evaluated the results on four crops, such as wheat, rice, cotton, and beans, familiar with authors in [20].
As a conclusion, the evaluation results of MapReduce SVM achieved better results than previous classifiers of NB, NB-MLP, SVM and KNN.

B. Semantic Similarity Evaluation
The proposed semantic similarity approach using TF-based LSA and TF_IDF-based LSA are tested and evaluated. And finally, the results of the proposed approach are compared with the previous models. The tests applied on twenty-five crops which are Okra, Mandarin, Watermelon, Wheat, Rice, Cotton, Beans, Tomatoes, Potato, Peach, Apricot, Lentils, Onions, Clover, Apples, Eggplant, Grapes, Orange, Banana, Guava, Peas, Cowpea, Cabbage, Garlic, and Lettuce.   Table V shows some examples of the Arabic queries complaints related to different crops and their English translation.
In TF-based LSA, average Precision, Recall, and F-measure values for the twenty-five crops are shown in Table VI. In TF_IDF-based LSA, we also apply the previous Arabic queries in Table IV on each crop of the previous twenty-five crops. Thus, average Precision, Recall, and F-measure values of the TF_IDF-based LSA for the twenty-five crops are shown in Table VII.
As a conclusion, by comparing the evaluation results of the TF-based LSA approach with the TF_IDF-based LSA approach, we conclude that the results of TF_IDF-based LSA approach achieved the best results since TF-IDF measures how important a term in complaints that give high weight for important terms while TF shows the only number of times that a term appears in a complaint. Finally, we compared our results with the previous models as Schwab et al. [23]. They applied three methods that are no weighting method, IDF weighting method, and POS weighting method. Schwab concluded that applying both the IDF and POS weighting methods achieved better results in performance. Therefore, we apply the previous models (IDF and POS weighting methods) to our agricultural dataset. Table VIII shows the average Precision, Recall, and Fmeasure values for the previous models. We also apply IDF and POS weighting methods on the same twenty-five crops used in our proposed TF-based LSA and TF_IDF-based LSA approach.
Finally, Fig. 2 show the F-measure evaluation results of our two proposed model TF_IDF-based LSA and TF-based LSA compared with the previous models of IDF and POS weighting methods.
The comparison figure shows that the proposed TF_IDFbased LSA achieves better results than the proposed TF-based LSA and previous models of IDF and POS weighting methods. The LSA-based proposed model is applied to retrieve the most relevant complaint and its solution to the farmer query. Table IX for an Arabic farmer query. As shown in Table IX, the Arabic farmer query is " ‫ٔجٕص‬ ‫انثايٛـــــــا‬ ‫فٗ‬ ‫"صٚضاااٌ‬ and in English equivalent is "Presence of worms in okra". The model will be followed step by step as follows:

Consider the example in
Firstly, apply the preprocessing steps on the Arabic farmer query and all complaints in the dataset. Table X represents preprocessing steps on the Arabic farmer query.
Secondly, applying classification by MapReduce SVM approach using Hadoop to classify Arabic farmers' query into a crop that belongs to. As in Arabic farmer query ‫تايٛا"‬ ‫صٚضاٌ‬ ‫,"ٔجٕص‬ this query is classified into ‫تايٛا‬ "okra" crop.
Thirdly, applying LSA steps that the input matrix is created for Arabic farmer query and all complaints that belong to okra crop. Then, apply RSVD to the input matrix.
Finally, the output of RSVD is used to measure the semantic similarity score between the farmer query and the complaints document.
According to the proposed two LSA methods, which are TF-based LSA and TF_IDF-based LSA, Table XI shows the results semantic similarity score between the farmer query and the complaints document.
As shown in Table XI, the results of the semantic similarity score using TF_IDF-based LSA are better than the semantic similarity score using TF-based LSA since TF-IDF shows how important a term is in complaints while TF shows the number of times that a term appears in a complaint.
Fourthly, rank the complaints according to the semantic similarity score of TF-based and TF_IDF-based LSA, as shown in Table XII and Table XIII, respectively, then select the complaint with the highest score.  As shown in Table XII, after ranking semantic similarity score and selecting the highest score that is 0.9998 and its complaint is ‫انثايٛا"‬ ‫ثًاع‬ ‫صاسم‬ ‫صغٛغِ‬ ‫صٚضاٌ‬ ‫"ٔجٕص‬ which is the nearest complaint to farmer query. Table XIII, after ranking semantic similarity score and selecting the highest score that is 0.994 and its complaint is " ‫انثايٛا‬ ‫يذصٕل‬ ‫دفع‬ ‫كٛفّٛ‬ ‫ػٍ‬ ‫انًؼاعع‬ ‫ؿأل‬ ‫دفظٓا‬ ‫طغٚقح‬ ٔ ‫فٛٓا.‬ ‫يرٕفغِ‬ ‫انغٛغ‬ ‫انٕقد‬ ‫فٗ‬ ‫"الؿرؼًانٓا‬ which is not the nearest complaint to farmer query.

As shown in
By comparing the results of both the TF_IDF-based LSA approach and the TF-based LSA approach, we conclude that the TF_IDF-based LSA approach is the best method for measuring the semantic similarity score.

VI. CONCLUSION
Agriculture has an important role in the economy of every country. Not only supplying foods for the whole population of a country but also it helps to connect and interact with all the relative industries of the country. Due to the world's current conditions from the spread of COVID-19, the imposition of a curfew, and adequate spacing between citizens, all fields are affected, especially the agriculture field. Farmers may have problems and complaints related to the agriculture process and the productivity of the percentage of the crops. It is difficult for farmers to communicate with agricultural experts to find appropriate solutions for their complaints. A semantic similarity approach for agriculture farmers' complaints is developed to solve these issues. This approach is based on LSA to measure semantic similarity between farmer query and the complaints document. The proposed model is applied to the MapReduce SVM using Hadoop for classifying the big agricultural dataset and the farmer complaint according to the crop type to improve the performance of the proposed approach. The results are evaluated on twenty-five crops and tested 25% of different complaint queries on each crop of them. These evaluations applied to our two proposed models of TF-based LSA, TF_IDFbased LSA, and previous work methods. The developed approach with TF_IDF-based LSA achieved better results than the TF-based LSA and previous work methods with an Fmeasure of 86.7%.