Intelligent System for Price Premium Prediction in Online Auctions

The use of data mining techniques in the field of auctions has attracted considerable interest from the research community. In auctions, the users try to achieve the highest gain and avoid loss as much as possible. Therefore, data mining techniques can be implemented in the auctioning domain to develop an intelligent method that can be used by the users in online auctions. However, determining the factors that affect the result of an auction, especially the initial price, is critical. In addition, the intelligent system must be established based on clean data to ensure the accuracy of the results. In this paper, we propose an intelligent system (classifier) to predict the initial price of auctions. The proposed system uses the double smoothing method (DSM) for data cleaning in terms of preprocessing. This system is implemented on a data set collected from the eBay website and cleaned using the proposed DSM. In the training phase, the CART technique is employed for the classifier construction. Compared to similar techniques, the proposed system exhibits a better performance in terms of the accuracy and robustness against noisy data, as determined using ROC curves. Keywords—Classification; auction; CART; training; testing; preprocessing; noise; outlier; DSM


I. INTRODUCTION
Importance of the eBay website and auctions. The eBay website is one of the most important websites in the business area, and it can be considered as a leader among e-commerce and internet sites. eBay is used to sell and buy goods and products online or provide services worldwide. The importance of eBay is a result of the financial benefits generated by more than 168 million active buyers and more than 20 million active sellers, with billions of dollars of transactions occurring on the site [1]. By participating in online auctions, people can directly obtain and purchase the items that they desire, without the hassle/risk of traveling.
Motivation and problem statement. The final bid price of an auction is tightly coupled with the influencing factors. Although these factors were addressed in many previous works, the price premium was not considered. In the context of online auctions, the price premium is defined as "the monetary amount above the average price received by multiple sellers for a certain matching product" [2]. In addition, changes in the price premiums can indicate product shortages, excess inventories, or other changes in the relationship between the supply and demand. This aspect thus affects the gain earned by businesspeople. Therefore, the corresponding research question pertains to the determination or prediction of the factors that are related to and affect the price premium, thereby enabling businessmen to achieve the best gain and avoid loss by correctly estimating the initial price of the product. In addition to determining the factors, cleaning the data and eliminating the noise and outliers are critical to ensure accurate results.
The use of artificial intelligence in various domains has received considerable attention from the research community. In particular, by using an intelligent machine that takes some of the influencing factors as inputs, businesspeople can predict the price premium. In this context, the contributions of this work are as follows:  A novel preprocessing method for data cleaning, known as the double smoothing method (DSM), is proposed. This method uses the binning method to process the noisy data, and the clustering-based technique is later used to filter the outliers.
 The CART technique is used to produce the decision rules for sellers in online auctions. CART is a type of decision tree, which acts as a classifier, and it can be used in the classification process to address categorical data as a final decision (in this work, reaching or not reaching the price premium). Furthermore, this technique can be used for regression to deal with continuous data [3].
 The variables that exert the most considerable effects on the auction outcomes are examined using the CART technique.
 Extensive experiments are performed on real data driven from the eBay website to evaluate the proposed approach against other similar approaches.
The remaining work is structured as follows: Section II presents the related work, followed by the description of the proposed artificial system and approach in Section III. Section IV describes the metrics used for the evaluation, and the subsequent section presents the experimental results and evaluations. Finally, the work is concluded in Section VI.

II. RELATED WORK
In this section, first, we present an overview of internet auctions and later describe the decision tree induction techniques. Finally, we explore some works related to the conduction of auctions.

A. Internet Auctions
The development of the internet led to a revolution in the field of auctions, which were conducted only physically in the past. By using the internet, auctions could be conducted via emails or discussion lists [4]. At the end of the 1990s and beginning of the 2000s, the number of auction sites was estimated to be 200 [5]. Subsequently, competing sites were generated to manage auctions by famous companies, such as QXL.com in Europe, Taobao.com in Asia, and MercadoLibre in Latin America. Recently, the website of the eBay company has become the most famous site used to conduct auctions.

B. Decision Tree Induction Techniques
Decision trees can be considered as a reflection of the rules used for the classification in artificial intelligence research. Such trees visually represent the rules in the form of nested ifthen statements [6]. Various algorithms are utilized to form decision trees, such as the CART, QUEST, ID3 and CHAID. These algorithms differ in terms of the mechanism used to determine the root of the tree and subsequently form the other branches. In addition, these techniques differ depending on the data type involved in the manipulation. For example, ID3 and CHAID can manipulate only categorical data, while CART can manipulate both categorical and continuous data [7].
The stochastic differential equation (SDE) has been used to model eBay prices [8]. In this study, the authors utilized the SDE to represent the price velocity and accelerator. In the experiments, as the database, the authors used 63 training samples and 30 testing samples from the auctions of the Microsoft X-box gaming system. Subsequently, by performing a differential analysis, the authors extracted the features and collected the results. Most importantly, it was indicated that the use of the SDE is more suitable for this task than the ordinary differential equation (ODE) approach [9].
The authors of the work [10] previously proposed a price prediction and insurance service for online auctions. The final objective of this service was to guarantee the minimum end price for the sellers in auctions such as those conducted on eBay. It was concluded that auctions with a reserve price option lead to a worse price at the end, which is the underlying reason for why the use of the price insurance service is desirable. For data gathering, a crawler was employed to collect data from the eBay website for two months. The features extracted to build the database were related to the seller, items, and the auction. By using multiple classification regressions algorithms, the final classifier was generated to classify the new data.
Dass et al. proposed a dynamic price forecasting method [11]. The features that distinguish this work are as follows: (1) the technique manipulates the same product that is involved in multiple auctions; (2) the price dynamics are considered, and the static data such as the initial price and seller reputation are ignored; and (3) the source of the price dynamics is manipulated in the context of the buyer's competitions. The drawback of ignoring such static data is that the initial price can be variable, and it depends on the seller opinion rather than eBay's policy, which in turn increases the bidding price, especially at the first stage of the auction.
A neural network-based approach was proposed [12] to solve nonparametric price prediction models. The key idea was to map the nonlinear data and approximate the end price regardless of any assumptions, by adjusting the weights of the inputs of the neural network.
Gregg et al. proposed an intelligent recommendation system, which can act as an adviser to the users for price prediction [13]. Under the time performance term and to make bidding decisions within a short amount of time, this system targets the search for bargain processes. The key idea is to present the users with relevant information such as the current bid and a recommended price based on the recently closed auctions. This system was enhanced in another study [14], primarily by linking the price prediction process with various features of the auction, such as the feedback rating and item description.

III. PROPOSED ARTIFICIAL SYSTEM ARCHITECTURE
In this section, we describe the system architecture and present the details regarding the data set used for the training and testing phases.

A. Artificial System Architecture
The general system architecture consists of three main components, as illustrated in Fig. 1.
As shown in Fig. 1, the first component of the system is the database, which is represented by tables that contain data. The second component is the classifier, which needs to train considering the data collected and stored in the database. The trained classifier follows certain rules, which may be complex. Therefore, the third component is responsible for constructing and pruning the decision tree that the classifier later uses to make the final decision (i.e., classifying a new record or unknown data). Table I summarizes the data set used in this work, on which the classifier is trained and tested.  The eBay website is used for the data collection for the following reasons: First, this website includes real data, which is preferable when conducting practical experiments [16]. Second, the website is considered as a continuous resource of data by the sellers because it often motivates the users to compete at any time and in any location [17]. Finally, the mechanism employed by eBay is favorable to the auctioneers in most cases [18].

B. Used Data Set
Data preprocessing. The data collected from the eBay website represent real data. In data mining, real data are often considered dirty, as they may be incomplete, noisy, inconsistent, missed, or including outliers. Such dirty data (due to instrument faults, human or computer errors, or transmission errors) negatively affects the quality of the intelligent system and its outcomes in terms of the accuracy [19,27]. In addition, many issues should be taken into consideration during the preprocessing phase, such as ensuring data privacy [28,29,30,31]. Moreover, enhancing the performance using high performance computing techniques [32] as well as ensuring the security of the data using some hiding or blurring techniques is required [33], or employing agent based software technology for solving transmission challenge problem [34]. The previous issues do not be taken into consideration in this work, and they will be manipulated as a future work.
To solve this problem and clean the data, a double smoothing method (DSM) is utilized, in which the binning method is followed by a clustering based technique. The binning method is used for data smoothing to eliminate noisy data [20], and it consists of two main steps: (1) sorting the data and partitioning them into (equal frequency) bins; and (2) smoothing the data (the boundary based method is used in this study [21]). The goal of the clustering method is to detect and eliminate the outliers, which are considered as the most negative type of noises that can be located within the data. In the context of this work, the outliers refer to the extremely low (or high) values of the attributes used for constructing the database [22].
After cleaning and smoothing the data, the curse of dimensionality problem may arise. Because the analysis of complex data (that include many attributes or dimensions) on the complete data set may require a considerable amount of time, the dimensionality of the data must be reduced. In this work, we rely on the feature selection method to reduce the dimensionality [23]. After this process, the obtained database, as shown in Fig. 2, can be used to train the classifier.
The features selected to reduce the dimensionality are those that have the most considerable impact on the auction decision. Such factors include the shipping cost, reputation (expressed by rating), initial bid price, and auction ending time. These features are considered as the variables that determine the final price in the context of the auction process.
Statistics operations are applied to the auction data to obtain the descriptive statistics, that is, the mean and standard deviation for continuous or real (float values) data variables and the frequencies of the categorical data. Table II summarizes the descriptive statistics obtained in this work.

C. Model Construction (Classifier)
To create the classifier, the cross-validation method is used, which is a common tool in data mining. This approach consists of two main stages, namely, the training stage and testing stage [24]. The final goal of the training stage is to construct the classifier by training it on the dataset, while the objective of the second stage is to estimate the performance of the classifier in terms of the accuracy. The cross-validation method involves two main steps: (1) randomly partitioning the data into mutually exclusive subsets, with all the subsets having an approximately equal size; and (2) at the ℎ iteration, using as the test set and other sets as the training set. Fig. 3 illustrates the process flow of the cross-validation method. The classifier follows certain rules in the process of defining the initial price to decide if a user should continue in the auction or not. In other words, the class that the classifier predicts is the initial price based on the following rules that are formed using (and, or) operators located among the predefined features. The rules are considered as the training space and used to form a decision tree.
Algorithm 1 shows the steps of the cross-validation method.

D. Construction of the Decision Tree
The classifier follows certain rules in the process of defining the initial price to decide if a user should continue in the auction or not. In other words, the class that the classifier predicts is the initial price based on the following rules that are formed using (and, or) operators located among the predefined features. The rules are considered as the training space and used to form a decision tree.
The CART algorithm, which involves a nonparametric procedure, is employed to create the optimal decision tree. The main advantage of CART is that it supports certain data types in the classification and real or continues data types in the regression. The findings obtained using the CART are simple to understand and visualize. The strategy followed by the CART to construct the decision tree can be summarized as follows: 1) Select features or variable. In this step, the features extracted to achieve the dimensionality reduction are used as the variables for the CART algorithm.
2) Determine the splitting condition, that is, determine the best selected features as the root of the decision tree.
3) Determine the stopping criteria, which indicates the completion of the decision tree. In this work, the stop condition is achieved when no more data are available in the data set.

4)
Perform pruning, which is aimed at avoiding the overfitting problem. In this work, this problem is avoided as the most suitable features are selected as variables for the CART algorithm and double data cleaning is performed against noisy data and outliers.

IV. EVALUATION METRICS
In this work, two main performance metrics are utilized for the evaluation: the confusion matrix, and the ROC curve, which is inspired from the confusion matrix.

A. Confusion Matrix
In general, the confusion matrix is a useful tool for analyzing how well a classifier can recognize the tuples of different classes. The confusion matrix is formed considering the following terms [25]:  Depending on the confusion matrix, the accuracy of a given classifier can be calculated by considering the recognition rate, which is the percentage of the test set tuples that are correctly classified. The accuracy can be obtained using the following formula: Accuracy based valuation. In this context, a higher accuracy corresponds to a better classifier output. The maximum value of the accuracy metric is 1 (or 100%), which is achieved when the classifier classifies the data correctly without any error in the classification process.

B. ROC Curve
Receiver operating characteristic (ROC) curves are used to enable the visual comparison of different classification models. These curves indicate the balance between the true and false positive rates, and the area under the ROC curve denotes the accuracy of the classifier [26].
ROC based evaluation. In this context, a model representing a line closer to the diagonal line (i.e., the closer the area is to 0.5) is a less accurate model. The proposed approach is implemented using the MATLAB programming language. The system is executed on a laptop with the following configuration: Genuine Intel(R) 2.4 GHz PC with 4.00 G RAM, running Microsoft Windows 7 Ultimate. The proposed system is compared with the recommendation system for price prediction (RSPP) that was presented in [14].

1) Evaluation based on the confusion matrix:
In this context, the same data set size is used to enable a fair comparison. C1 refers to the suitable predicted initial price, while ¬ C1 refers to the unsuitable predicted initial price. These aspects are represented as C1=yes and ¬ C1= no. Discussion. The proposed system outperforms the RSPP system in terms of the accuracy. This finding can be attributed to the training phase. In the proposed system, the crossvalidation is performed 10 times, which means that the classifier is trained on the complete data set. In other words, in each run, a part of the data set is used a training set, which leads to comprehensive training, thereby providing the classifier with more alternatives to deal with new data. In the RSPP system, a holdout method is used, which divides the complete data set into two main data sets (training and testing). However, the training and testing set constitute 80% of the original data set, respectively. Since the training set is employed to construct the model (i.e., the classifier), the time spent in the training phase is considerable smaller compared to that in the proposed system. It is known that a higher training time leads to more accurate outputs.

2) Evaluation based on ROC curves:
In this context, we use the same data set size under the same conditions considered in the previous comparison. In addition, we evaluate the systems involved in the comparison in terms of the robustness. The robustness refers to the ability of a classifier to provide correct predictions in the case of noisy data or data with outliers.  Fig. 4 shows the ROC curves for both the proposed system and the RSPP system. Discussion. Fig. 4 illustrates that the proposed system performs better than the RSPP system under noisy data (particularly, in the presence of outliers). In general, the accuracy of a classifier is negatively affected by outliers because they lead to dramatic decreased (or increased) values that are reflected as low accuracy in the classification process. The low accuracy is a normal result of certain rules generated (which are suitable only) for outliers. However, the proposed system can address both noisy data and outliers effectively because the preprocessing step is performed before training the classifier (i.e., the DSM). The noisy data added to the cleaned data for robustness testing are filtered using the binning method. Moreover, the outliers inserted within the original data are groped and deleted using the clustering-based method. Consequently, the abnormal data are filtered before training the classifier in the proposed system, which is reflected in a high classification accuracy. In the RSPP system, bag of words (BOW) as well as frequency-based methods are used to preprocess the data. However, a certain threshold of frequency is used to determine if a given value corresponds to a noisy data point or outlier. Therefore, many values are used in the training phase. The presence of such abnormal data (that are not filtered and contribute to the classifier construction) leads to the lower classification accuracy compared to that of the proposed system.

VI. CONCLUSION
The final bid price of the auction is tightly coupled with the factors influencing the price. Among these factors, the initial price plays a vital role in the decision making process conducted by the user for being involved or not in an auction. To realize effective decision making and help users, artificial intelligence techniques can be employed. The process of development of an intelligent system involves two main phases, namely, training and testing. However, preprocessing the data used in the training phase for the model construction is critical to ensure accurate results. In this work, we employ www.ijacsa.thesai.org the double smoothing method (DSM) for data cleaning. The data are cleaned by subjecting them to two processes, namely, (1) the binning method which is responsible for noisy data elimination; and (2) the clustering-based method, which is responsible for outlier detecting and deletion. The classifier is built based on the cleaned data by using the cross-validation method (with k=10). The data that the classifier trains on are collected from the eBay website and arranged in a database of (1000) cleaned records. The decision tree, which contains the rules that the classifier follows in the process of classification, is formed using the CART algorithm as it can deal with numerical, continuous and categorical data. A confusion matrix and ROC curves are employed to evaluate the proposed system against similar systems. The results show that the proposed system achieved an accuracy of 95% compared to that of 75% for an existing system. This result is supported by the ROC curves, which indicate that the proposed system exhibits a better accuracy and robustness against noisy data and outliers.
In future work, we intend to enhance the proposed system to achieve a higher accuracy and ensure the privacy protection of the manipulated data. In addition, different data set can be used to prove scalability of this proposed work.