A Fast and Efficient Algorithm for Outlier Detection Over Data Streams

Outlier detection over data streams is an important task in data mining. It has various applications such as fraud detection, public health, and computer network security. Many approaches have been proposed for outlier detection over data streams such as distance-,clustering-, density-, and learning-based approaches. In this paper, we are interested in the densitybased outlier detection over data streams. Specifically, we propose an improvement of DILOF, a recent density-based algorithm. We observed that the main disadvantage of DILOF is that its summarization method has many drawbacks such as it takes a lot of time and the algorithm accuracy is significant degradation. Our new algorithm is called DILOF that utilizing an efficient summarization method. Our performance study shows that DILOF outperforms DILOF in terms of total response time and outlier detection accuracy. Keywords—Data mining; outlier detection; data streams; density-based approach; clustering-based approach


I. INTRODUCTION
Outlier detection (OD) is considered an important data mining task. The objective of this task is to discover elements (points) that show significant diversion from the expected behaviour called outliers. For example, consider the two dimensional data points in Fig. 1. This dataset contains three normal regions namely N 1 , N 2 , and N 3 . We can observe that data points that are significantly far away from the three regions are outliers. In this example o 1 , o 2 , o 3 , and o 4 are outliers. The prominent causes for outliers are malicious activity, change in the environment, instrumentation error, and human error. OD plays a significant role and has been useful for several real-world applications such as intrusion detection systems, interesting sensor events, credit-card fraud, law enforcement, and medical diagnosis.
Outlier Detection raises significant challenges when a stream-based environment is considered [1], [2], [3]. A data stream potentially contains an infinite number of data points. Memory limitations constrain the amount of data points that can be held and processed at a given time. Moreover, no information related to data points appearing in the data stream are available before entering the memory. That is, the state of the current data point as an outlier/inlier must be established before dealing with subsequent data points. For example, in wireless sensor networks, a limited memory is available at each sensor node and outliers must be detected in reasonable time. The communication cost of these networks is also an essential factor. There are many approaches of outlier detection over data stream such as clustering based outlier detection  [4],statistical based outlier detection [5], [6], distance based outlier detection [7], [8], [9], [10] [11], and density based outlier detection [12], [13], [14] [15], [16]. In this paper,we are interested in the density-based outlier detection over data streams. Specifically, we propose an improvement of DILOF, a recent density-based algorithm. Our new algorithm is called DILOF C (Density Incremental LOF using summarization that based on novel m-Center clustering algorithm). We observed that the main problem in DILOF is that the summarization method has drawbacks such as it takes a lot of time and the algorithm accuracy is significant degradation. Note that DILOF is one of the most known algorithm that apply density based outlier detection approach. In density based outlier detection approach, the density of each point is compared with the density of its local neighbors. This approach is based on the assumption that the density of the normal data point is the same as the density of its neighbors and the density of outliers are dissimilar to their local neighbors. For each point, the density is computed by outlier score called LOF (Local Outlier Factor) [17]. We will discuss LOF and DILOF in Sections II-A and II-B respectively in details.
In the remaining sections, we discuss the problem definition and related work in Section II. Section III presents our proposed algorithm. We report the experimental results in Section IV. Finally, Section V concludes the paper.

II. PROBLEM DEFINITION AND RELATED WORK
Definition 2.1 A data stream is a possible infinite sequence of data points P = {p 1 , p 2 , p 3 , ...., p n , ....}, where data point p n is arrived at time p n .t In previous definition, the data points are sorted by the timestamp at which it arrives. Since data stream size is unbounded thus data stream will be processed in a sliding window, i.e. a collection of active data points. Window is small enough to be held in the main memory. Windowing splits the data stream into overlapping finite sets of data points (sliding windows). The splitting can be done by arrival time of the data points, namely, time-based windows or by the count of the data points namely, count-based window. In this paper, we focus on the count-based window.

Problem Definition
Given a data stream P = {p 1 , p 2 , ...., p n , ....}, the objective is to calculate the LOF score for each data point p i and check p i outlier or not with respect to the following constraints.
• We store only a small number of data points, m << |P |. Note that here m equals to the window size, |W | • The outlier detection of coming data point p i , must be done when p i arrives.
• The data distribution is unknown.

A. LOF (Local Outlier Factor), iLOF, and MILOF
LOF [17] is well-known algorithm for outlier detection in static datasets. The objective of LOF is to calculate the LOF score for each data point. Suppose the following: • We have all data points, • The count of the nearest neighbors is k, • dist(x, y) is the Euclidean distance between the two data points x and y, • dist k (x) is the Euclidean distance between a data point x and its k nearest neighbor.
• N k (x) is the set of the k-nearest neighbors of the data point x.
According to the following definitions, we will compute the LOF score for all data points. Definition 2.2 Given two data points x and y, reachability distance reach dist k (x, y) is defined by Definition 2.3 Local reachability density of data point x, lrd k ( x ) is given by Definition 2.4 Local outlier factor of data point x, LOF k (x) is given by To check if data point x is outlier or not, we compare its local outlier factor LOF k (x) with a given threshold T . If LOF k (x) ≥ T then the data point x is classified as outlier. Note that LOF algorithm is used to compute the LOF scores of all data points only once. Recall LOF algorithm detects outliers in static datasets iLOF (incremental LOF) [18] was proposed for stream datasets but it stores all data points in memory. Thus iLOF requires a very large memory and is not applicable to stream datasets whose size sharply increasing. Another algorithm called MiLOF [19] was proposed to decrease the space complexity. It stores in memory a small number of data points by using k-means clustering [20] method to summarize old data points. The accuracy of MILOF is inefficient since it uses kmeans for summarization which does not preserve the dataset density. To overcome the drawbacks of MILOF, authors of [13] proposed a new algorithm called DILOF (Density summarizing Incremental LOF). Since our proposed algorithm is based on DILOF, we will discuss DILOF algorithm in the next section in details.

B. DILOF Algorithm
DILOF [13] is well-known algorithm for outlier detection over data stream. It is density-based algorithm and applies two steps as follows. The first one is Last Outlier-aware Detection step, LOD which check if the incoming data point x is outlier or not. This done by computing LOF k (x) on the a variable window of data, W . Then the algorithm updates the information of the old data points (lrd k and LOF k ) that exist in W and affected by inserting the data point x (i.e. the data points whose neighbor information will be modified when inserting the data point x). Note that the data point x is inserted to W no matter whether it is an outlier or inlier. Also skipping scheme strategy was proposed to detect a long sequence of outliers. in other words, this scheme was proposed to distinguish outliers from the data points in newly emerging classes. This can be done by deleting the new outliers from the window to preserve the low density region where outliers are existed. The following formally outlines skipping scheme strategy in details. First, DILOF computes dist 1 (p) for each data point p ∈ M (Note that M is the set of data points in memory). Then it computes the average of distance be the Euclidean distance between the last detected outlier o and the current data point p c . If avg dist 1 > dist(o, p c ) then set o to p c and the data point p c is not inserted to W . In this case, Skipping Schema parameter will be set to true.
The second step is Nonparametric Density Summarization step, NDS which decrease the memory consumption by summarizing the old data points with respect to the dataset density. In NDS, an estimator called nonparametric Renyi divergence was used to specify the divergence between the original data points and summary candidate of data points. When the count of data points is equals to |W | data points, NDS will summarize the oldest |W/2| data points to |W/4| representative data points such that the density difference between them is minimized. See Fig. 2.
Note that the previous two steps are repeated. LOD executes on every insertion of a data point. NDS is executed when the number of data points in memory equals to |W |. In experimental results of DILOF on real-world datasets, DILOF significantly outperforms MiLOF with respect to accuracy and execution time.

III. PROPOSED ALGORITHM
Since the summarization method of DILOF algorithm taking many iterations to find its output. Therefore, the summarization method of DILOF algorithm takes a lot of time. Also, by using expensive experiments in many real datasets, we found that the algorithm accuracy is significant degradation. To overcome this issue, in this paper, we will propose a new summarization method called sum m center which will be injected in DILOF algorithm instead of its current inefficient summarization method. The proposed summarization method based on a new clustering technique called m-center. First, we propose m-center clustering algorithm then we will discuss the proposed summarization method. The previous clustering algorithms such as k-means require a large number of iterations to compute its output. To address this problem, we propose m-center clustering algorithm that is the partitioning representative, medoid, is sampled from the original data. In this method, we efficiently search for each cluster medoid as follows. In the first iteration we will search for the medoid of the first cluster and in the second iteration we will search for the medoid of the second cluster and so on. So if we set the number of clusters as k then we have only k iterations. In each iteration I we will execute the following steps. For each data point p ∈ P (the set of all data points), we calculate its m nearest neighbors set, mN N (p). Then we compute the distance between p and each p j belongs to mN N (p) (here we will use Euclidean Distance, dist(p, p j )) and compute the summation sum dist(p) = m j=1 dist(p, p j ). After that we select the point p i ∈ P with minimum sum dist(p) as a medoid of the cluster being processing C I then add all points in mN N (p) to C I . Finally we remove each point p ∈ C I from P. Recall we have k iterations, then we repeat the previous steps k − 1 times after the initial iteration. Now we have k clusters. If there are remaining data points do not belong to any cluster (i.e. after k iterations we have |P| ̸ = ϕ), then we add each remaining data point p r to its closest cluster based on the distance between the medoid of each cluster and p r . Next algorithm (Lines 1 -13) outlines the m-center algorithm.
Algorithm: m-center(P, k, m, d t ) Input: P: the set of data points, k: the number of clusters, m: the size of the m nearest neighbors set of a specified data point, d t : distance threshold. Output: C: k-clusters set; for each data point p i ∈ P do 3.
Compute the m nearest neighbors set of end for 6.
Select p x ∈ P where sum dist(p x ) has the smallest value. 7.
Add the point p x and the set mN N (p i ) to cluster C l where p x is the cluster medoid. 8.
Remove the point p x and the set mN N (p i ) from P. 9. end for 10. for each remaining data point p r ∈ P do 11.
Add p r to its closest cluster. 12. end for   In the second iteration we found that the point p 6 = (7, 2) is the medoid of the second cluster, C 2 , since mN N (p 6 ) = {(6, 2), (7, 1), (7, 3), (8, 2)} has minimum sum dist(p 6 ) that is sum dist(p 6 ) = 4. Then we remove p 6 and mN N (p 6 ) from P. After the two iterations we check if there exist a remaining data points or not in P. In this example there are a remaining data points. The count of remaining data points is nine since we remove five data points in each iteration. Therefore, we add each remaining data point to its closest cluster. Now we have two clusters as follows

A. Optimization
In the previous example we set k = 2. If we set k = 3 or 4 then we have three or four clusters respectively. But our original data has only two clusters. If we set k large than the number of original clusters in our data then m-center will cluster the data in inefficient way. Therefore, m-center will be optimized as follows. In the case above, some clusters should be merged efficiently. Which clusters will be merged?. The answer is the nearest clusters will be merged. First we define the nearest clusters as follows.  Let k = 5 then we have five clusters C 1 , C 2 , C 3 , C 4 , and C 5 . Assume after checking the nearest clusters we found two sets of nearest clusters as follows the first nearest clusters set is N C 1 = {C 1 , C 2 , C 3 } and the second one is N C 2 = {C 4 , C 5 }. We will merge the clusters in the same nearest cluster set into one big cluster. Therefore, we have two big clusters CB 1 = C 1 ∪ C 2 ∪ C 3 and CB 2 = C 4 ∪ C 5 . Lines 14-16 in the m-center algorithm outlines our optimization. Also next example illustrate this optimization.
Example 3.2 Given two dimensional data point set P with size 32 data points as in Fig. 4. Let k = 5, m = 4 and d t = 5. After applying m-center clustering method we have five clusters, namely C 1 with medoid m1 = (3, 4), C 2 with medoid m2 = (7, 5), C 3 with medoid m3 = (7, 2), C 4 with medoid m4 = (17, 7), and C 5 with medoid m5 = (16, 4). From above optimization we have two nearest clusters set based on the distance threshold d t = 5. The first nearest clusters set is N C 1 = {C 1 , C 2 , C 3 } since every pair in N C 1 contains two nearest clusters, for example C 1 and C 2 are nearest clusters since dist(m 1 , m 2 ) = 4.12 < 5 = d t . The second nearest clusters set is N C 2 = {C 4 , C 5 } since C 4 and C 5 are nearest clusters with dist(m 4 , m 5 ) = 3.16 < 5 = d t . Based on above optimization, we will merge clusters in each nearest clusters set into one big cluster as follows.

B. Summarization Step
In summarization step, we will delete a half of data points in the window. Which half of data points will be deleted? we will delete a half of data points according to two different deletion methods. In the first deletion method, we keep in each cluster C I only the half of data points which close to C I medoid and delete the other half.If we apply the optimization of merge clusters, the medoid of the big cluster CB will be the average of the medoids of the small clusters that contained by CB.
In the second deletion method, the unuseful old data points that do not effect the data density will be deleted that is we keep the half of data points in each cluster that preserve the cluster density and delete the other half. In other words, we delete a half of data points such that these data points are old and its LOF score is high. In experimental evaluation section, we will compare the two deletion methods. Next algorithm outlines the summarization step, sum m center.
Algorithm: sum m-center(P, k, m, d t ) Input: P: the set of data points, k: the number of clusters, m: the size of the m nearest neighbors set of a specified data point, d t : distance threshold. Output: S: summary of k clusters; 1. C = m-center(P, k, m, d t ) 2. if Enable first deletion method 3.
for each cluster x in C 4.
Delete half of data points in x that are far from the medoid of x 5.
S = C 6. if Enable second deletion method 7.
for each cluster x in C 8.
Delete half of data points in x that are old and its LOF score is high 9.
S = C 10. return S (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 11, 2021 C. DILOF C Pseudocode Recall, the adaptive algorithm is called DILOF C . The next algorithm outlines DILOF C . For each data point p i coming from stream we do the following. If we enable the skipping scheme and the return value is true then we continue to the next data point (lines [3][4][5]. See section II-B for more details about skipping scheme strategy. Otherwise, we add p i to the set of data points in memory, M (line 6) . Then we compute LOF score of p i according to equations 1, 2, and 3 and add p i to the set of outliers, O, if LOF score of p i is greater than LOF threshold, T (lines 7-10). At the same time, we update LOF score of each data point p j in the reverse neighbour set of p i and if the data point p j transformed from outlier to inlier, we remove it from O (lines [11][12][13][14][15][16]. If the size of data points in memory,|M | reach the window size |W |, we call the function, sum m center, to summarize the oldest |W |/2 data points in M . Then we replace the oldest |W |/2 data points in M by the outputted summary of sum m center function lines (17)(18)(19)(20)(21)(22). if Enable Skipping Schema Strategy and Skipping Schema = TRUE 5.
Compute the LOF score of p i according to equations 1, 2, and 3 8. if for each data point p j in the set of reverse mNN(p i ) do 12.
Update the LOF score of p j 13.
if p j transfered from outlier to inlier then 14.
end for 17.
if |M | = |W | then end if 23. end while

D. Time Complexity
Note that the summarization step is one of the main operations in the outlier detection algorithms over data streams. Therefore, in this section, we analyze the time complexity of m-center method for summarization step in the proposed algorithm, DILOF C . The time complexity of m-center is O(|W |km), where |W | is the window size, k is the number of clusters, and m is the size of the nearest neighbors set of a specified data point. While the time complexity of summarization step in DILOF algorithm is O((|W |/2) 2 ) [13] and the time complexity of summarization step in MILOF algorithm is O(IDk|W |/2) [19], where I is the maximum count of iterations, D is the dimensionality of dataset, and k is the number of clusters.

IV. EXPERIMENTAL EVALUATION
In this section, we evaluate the performance of DILOF C on four real datasets. we compare the performance of DILOF C with the DILOF [13]. Here, MiLOF [19] was exculded from this experiment since the experiment results of DILOF showed that DILOF has better performance than MILOF. All experiments were performed on a PC with Intel i5-6700 2.4 GHz, 8G memory running Windows 10 64-bit operating system. DILOF C was implemented in standard C++ with STL library support. In next section, we discuss the datasets and experiment settings.

A. Dataset and Experiment Settings
DILOF C performance was evaluated by applying it to four real-world datasets. Table I listed the properties of the four datasets. For the two datasets, KDD Cup 99 smtp and KDD Cup 99 http, the number of classes is set to 10 since the number of classes of theses datasets is unknown. The hyper-parameters of DILOF, η and λ are set to 0.3 and 0.001 respectively for all datasets. We set the default values of the parameter t that used in DILOF (t-nearest neighbors in DILOF) as the following. we set t to 19 for UCI Vowel and 8 for the three datasets UCI Pendigit, KDD Cup 99 smtp, and KDD Cup 99 http.
For our algorithm DILOF C , the two parameters m (number of nearest neighbors) and k (number of cluster) are set to 5 and 11, respectively, for the three datasets UCI Pendigit, KDD Cup 99 smtp, and KDD Cup 99 http. For the dataset UCI Vowel, k and m are set to 10 and 11, respectively. For the parameter d t (distance threshold for merging clusters) is set to 3.3 for the three datasets UCI Vowel, KDD Cup 99 smtp, and KDD Cup 99 http. For the dataset UCI Pendigit, d t is set to 2.4.
Recall, DILOF C and DILOF apply the summarization method when the count of data points equal to window size, |W |. Therefore, the performance of outlier detection is measured with different values of |W |. Since the two datasets, UCI Vowel and UCI Pendigit, contain a small number of data points, we selected a small window size for them |W | = {100, 120, 140, 160, 180

B. Effects of Optimization and Deletion Methods
In this experiment, we show the effect of optimization (merge clusters) and the two different deletion methods. For this purpose, we implemented four versions of DILOF C , namely, DILOF C1 that does not use the optimization of merge clusters and use the first deletion method, DILOF C2 that does not use the optimization of merge clusters and use the second deletion method, DILOF C3 that uses the optimization of merge clusters and use the first deletion method, DILOF C4 that uses the optimization of merge clusters and use the second deletion method. 1) Outlier Detection Accuracy: Fig. 5 reports the outlier detection accuracy obtained by running the four versions on two datasets (UCI Vowel: Fig. 5(a) and KDD Cup 99 smtp: Fig. 5(b)). In UCI Vowel dataset, we can see that DILOF C4 always has high accuracy compared with the other versions except for window size 100 and 120, where DILOF C2 shows the best accuracy. In KDD Cup 99 smtp dataset, we can see that DILOF C4 always has high or same accuracy compared with the other versions except for window size 300, where DILOF C2 shows the best accuracy. 2) Total Response Time: Fig. 6 reports the total response time (sec) obtained by running the four versions on two datasets (UCI Vowel: Fig. 6(a) and KDD Cup 99 smtp: Fig.  6(b)). In UCI Vowel dataset, we can see that DILOF C4 always has less time compared with the other versions. In KDD Cup 99 smtp dataset, we can see that DILOF C4 always has less time compared with the other versions except for window size 100 and 120, where DILOF C2 shows the best execution time.

C. DILOF C against DILOF
From section IV-B, we showed that DILOF C4 has the best performace among the other versions of the proposed algorithm with respect to outlier detection accuracy and total response time. Therefore, in this experiment, we will use DILOF C4 version and we will call it as DILOF C for abbreviation. Here, we compare DILOF C against DILOF on the four datasets, namely, UCI Vowel, UCI Pendigit, KDD Cup 99 smtp, and KDD Cup 99 http with respect to outlier detection accuracy and total response time. See next two sections. Fig. 7 shows the outlier detection accuracy of DILOF C and DILOF with respect to the window size using the four datasets (UCI Vowel: Fig. 7(a), UCI Pendigit: Fig. 7(b), KDD Cup 99 smtp: Fig. 7(c), and KDD Cup 99 http: Fig. 7(d)). In UCI Vowel dataset, DILOF C shows higher accuracy than DILOF at all window sizes. For example, at window size 200, the accuracy of DILOF C is 95% whereas the accuracy of DILOF is 91%. In UCI Pendigit dataset, DILOF C and DILOF have approximately the same accuracy. In KDD Cup 99 smtp dataset, DILOF C shows higher accuracy than DILOF at most cases (for example, at window size 700, the accuracy of DILOF C is 86% whereas the accuracy of DILOF is 73%) except for window size 600, where accuracy of DILOF C is 86% whereas the accuracy of DILOF is 88%. In KDD Cup 99 smtp dataset, DILOF C shows higher accuracy than DILOF at all window sizes (for example, at window size 700, the accuracy of DILOF C is 83% whereas the accuracy of DILOF is 75%).

1) Outlier Detection Accuracy:
2) Total Response Time: Fig. 8 shows the total respose time (sec) of DILOF C and DILOF with respect to the window size using the four datasets (UCI Vowel: Fig. 8(a), UCI Pendigit: Fig. 8(b), KDD Cup 99 smtp: Fig. 8(c), and KDD Cup 99 http: Fig. 8(d)). DILOF C shows the best execution time on all datasets. On UCI Vowel dataset, UCI Pendigit dataset, KDD Cup 99 smtp, and KDD Cup 99 http, DILOF C outperforms  DILOF by four factors, three factors, more than two factors, and approximately two factors respectively.

V. CONCLUSION
Outlier detection over data streams is one important task in data mining. In this paper, we proposed an efficient algorithm called DILOF C for outlier detection over data streams. Our algorithm used the density based approach. It based on DILOF which is the state-of-the-art density-based algorithm for outlier detection over data streams. Our modification in DILOF as follows. We replace the inefficient summarization method of DILOF by a new efficient summarization method that based on a new clustering technique called m-center. Note that, the time complexity of our summarization method is very small compared to the time complexity of DILOF summarization method. We also optimize m-center clustering technique by merging the nearest clusters. Via an extensive evaluation on real datasets, we show that DILOF C outperforms the state-ofthe-art competitor, DILOF by over two factors. Also DILOF C achieves very high accuracy for detecting outliers. As future work, we plan to evaluate our method in a real sensor network.