Producing Standard Rules for Smart Real Estate Property Buying Decisions based on Web Scraping Technology and Machine Learning Techniques

Purchasing of real estate property is a stressful and time-consuming activity, regardless of the individual in question is a buyer or seller. The act is also a major financial decision which can lead to numerous consequences if taken hastily. Therefore, it is encouraged that a person should properly invest their time and money in research relating to price demands, property type and location, etc. It can be a difficult task to assess what real estate property can be considered as the best property to buy. The key idea of the current research study is to create a set of standard rules, which should be embraced to make a smart decision of buying real estate property, based on web scraping technology and machine learning techniques. Keywords—Web scraping technology; HtmlAgilityPack; machine learning; C4.5 decision tree; Weka-J48


I. INTRODUCTION
Any decision in relation to a property purchase or sales is a vital decision. To say that it is difficult to make up one"s mind in that circumstance is an understatement [1]. However, that is not to say as it is impossible to do so as there are technological means available to the modern man that allow them to make the best decision. One such route is to take the assistance of web scraping technology. This form of tech allows the user to find various online real state property advertisements from different web sources [2]. Therefore, the individual will have a much better idea of what sort of decision they should be making in terms of selling or buying real estate. Furthermore, with the help of machine learning techniques such as decision tree C4.5 [3], in combination with the prior mentioned option, one can easily make a superior decision.
II. WEB SCRAPING USING HTML AGILITY PACK The term "Web Scraping" also referred to as the "screen scraping or web data extraction technique" is a program for mining huge volume of data from an internet source, removing the information and saving it to a local file in a computer or databank and it saves the table in a spreadsheet format [4].
The data exhibited on numerous internet sources can only be observed through a web browser. Therefore, the sole possibility is to physically copy-paste the information. This is a very monotonous task that can take a lot of time, even days to complete. In addition to this, web crawling is a procedure that mechanizes this process. As a result, Web Scraping software does not need to manually copy data from a source but can perform the same task quickly.

A. HTML Agility-Pack
This is a responsive "HTML parser" inscribed in C# that builds a read/write "DOM" and supports basic "XPATH" or "XSLT" [5]. It is a ".NET code library" that permits you to analyse "out of the web" HTML archives. For improved understanding, "HTML Agility pack" is used to contrivance scraping of several web pages present on the internet [6].
 HTML Parsing HTML parsing is fundamental as taking in HTML code and mining applicable data like the title of the page, subsections in the page, relations, bold text etc.

 Document Object Model
The "Document Object Model" is a software design "API" for "HTML" and "XML" documents. It outlines the rational construction of documents and the method by which a document is retrieved and deployed [7].

B. HTML Agility Pack Installation
Steps 1) First, install the "NuGet package". 2) Below the segment "Package Manager" copy the installed code. Such as, if there is a statement of "PM> Install-Package HtmlAgilityPack -Version 1.5.1," then the text following the "PM>" shall be copied by the user.
3) Afterwards, go to the "Visual Studio Application" and click on "Tools menu" in the menu bar. 4) Using the drop-down menu, go to the library "manager Package Manager Console." 5) Starting from the bottom, "Application," the "Package Manager Console" opened and the cursor blinking.
6) The copied code should be pasted from the internet site through the "help of step 2" using hotkeys "Ctrl and V." 7) Press enter and the application will install automatically.
*Corresponding Author www.ijacsa.thesai.org C. Steps To Load DOM Using HTML Agility Pack 1) Add a DLL reference by going into the "Visual Studio Application" and press on the "Solution Explorer" positioned in the sidebar.
2) Right-click on and then click on "Add Reference," in the context menu.
3) From the "Reference Manager window," click on the "browser button" and move to "HAP dll" to select it. 4) Press Ok and go back to the code area of the "Visual Studio application" and insert desired code. 5) Inside the Main-Function, write the following code.
HTML Agility Pack will be used to load the HTML Document "GetMetaInformation" method definition.

6)
For the main function, the user should click on "Start Button" after saving the code and place cursor on the line "doc=web.load" ("https://technologycrowds.com"); and click on "DocumentNode," then "InnerHtml." 7) Click on the "search icon" and a new window will pop up. The new window will have all the "DOM" contents which are "HTML" content.

III. RESEARCH METHODOLOGY
In the first step we have to briefly list the "URL addresses" of the best online real estate ad web sources, then pass all the URLs in "HtmlAgilityPack" to extract the real estate ad data (e.g. property positions, prices and publication date of the ads, etc.) from numerous web sources. In the next step, with the help of linear regression, we will find the average future growth rate of the prices of each real estate property. In the end, with changes in the current average property prices and the estimated average future growth rates, we create a set of standard rules for making decisions about buying a real estate property. Fig. 1 shows the steps of the research methodology.  Table I shows the average prices of the most popular housing areas in different periods of time.
The future price growth rate of any real estate property has become a very significant factor in order to make real estate property buying decisions. The average prices were seen in the below table i.e. Table I and by the help of those average prices of different time intervals, we can use the linear regression technique to assess the average growth rate of future real estate prices.
• Select URLs of best real estate advertising websites.
Step 1 • Pass all URLs in htmlagilitypack to extract Real Estate advertising data. • for instance: Property Locations, prices and ad posting dates etc.
Step 2 • Find each Property's average future price growth rate with the help of linear regression.
Step 3 • By using weka J48, generate decision tree C4.5 Step 4 • Producing standard rules for making the smart real estate property buying decisions.

V. SIMPLE LINEAR REGRESSION
Simple linear regression establishes the connection between "target variable" and "input variables" by fitting a line, called "regression line" [8]. Generally, the linear equation The above equation is used to represent the line. Within the equation, "y" acts as the dependent variable, whereas "x" is the independent. "m" depicts the "slope", and "b" is the "intercept point".
Machine learning requires the following iteration of the same equation.
Where "w" denotes the parameters, "x" acts as the input, and "y" is the target variable. Changing values of w0 and w1 will give us different lines, as seen in Fig. 2.
Based on "Linear Regression Analysis" Table II offered the estimated average future property values in the different lengths of time and price growth rate percentage.
As a portion of pre-processing the constant assessed real estate records shown in Table II is renewed to definite form by estimated width of the preferred intervals, as shown in Table III.   Table IV visibly demonstrates the projected real estate property data set that is converted into the categorical form.
Next, the categorical data is given as input to "Decision tree C4.5" (Weka-J4.8)    VI. DECISION TREE C4.5 Decision tree refers to a "supervised classification method" that is a structure in which the non-terminal nodes indicate the test of one or more features, and the terminal nodes indicate the result of the decision [9]. It has been apprehended from the studies that the basic algorithm for determining the tree ID3 derivation has been enhanced by the C4.5 algorithm [10]. The unique C4.5 version called J4.8 has a WEKA classification package [11]. In C4.5, the information gain ratio and its measurements are used as a splitting principle, respectively [12]. The steps of this algorithm are given as follows: Step 1: The set "t" is a set of class labels for tuple training. If an output test is selected, the sample "t" training set must be split into subsets {T1, T2...Tn}. So, the entropy of the set-T can be calculated (in bits).
Step 2: Divide the training sample by the value of the specified attribute, by which the value of property T will be: Step 3: Afterwards, the difference between basic information requirements and new information is referred by the information gain. The equation (3) and equation (4) can provide a gain standard: Step 4: When building a dense decision tree, the quality of the gain is beneficial, but the test has significant disadvantages because many outputs have large deviations. Therefore, it has to be determined by standardization: The new gain standard is represented as: The Real Estate training dataset (see Table V) is provided as input to "WekaJ48". 1) Generate datasets in "MS Excel," "MS Access" and save in "CSV" format.
2) Start the "weka Explorer." Click on "Visualize Tree" option to view the graphical representation of the tree from the pop-up menu. Fig. 3 depicts the graphical form of Weka J48 generated tree [13]. www.ijacsa.thesai.org These rules are classified into two classes "YES" and "NO". The following study discloses only one of the decisionrule for each class.

IX. CONCLUSION
The decision to buy real estate is a substantial financial decision. Buyers should spend a lot of time choosing the best property to buy from all available options. This research concludes that there are no existing standard rules for making smart real estate purchase decisions. However, we propose a method that can generate standard rules for selecting the best real estate property to buy through web scraping technology and machine learning algorithms. This research will save buyers" time and provide a complete guide to make smart real estate buying decisions.

ACKNOWLEDGMENT
Authors would like to thank Center of Innovations in Computer Science (CICS), Sir Syed University of Engineering and Technology for providing resources and support to perform experiments. This work is also supported by the University of Nottingham, UK.