Local Average of Nearest Neighbors: Univariate Time Series Imputation

The imputation of time series is one of the most important tasks in the homogenization process, the quality and precision of this process will directly influence the accuracy of the time series predictions. This paper proposes two simple algorithms, but quite powerful for univariate time series imputation process, which are based on the means of the nearest neighbors for the imputation of missing data. The first of them Local Average of Neighbors Neighbors (LANN) calculates the missing value from the average of the previous neighbor and the following neighbor to the missing value. The second Local Average of Neighbors Neighbors+ (LANN+), considers a threshold parameter, which allows to differentiate the calculation of the missing values according to the difference between the neighbors: for the differences less than or equal to the threshold the missing value is calculated through of LANN and for major differences the missing value is calculated from the average of the four closest neighbors, two previous and two subsequent to the missing value. Imputation results on different time series are promising. Keywords—Univariate time series; imputation; LANN; LANN+


I. INTRODUCTION
Time series data are used in a large variety of real-world applications, and they often encounter the missing value problem due to data transmisión errors, machine malfunction, or human errors [1]. While imputation in general is a wellknown problem and widely covered by different tools, finding algorithms or techniques able to fill missing values in univariate time series is more complicated [2]. The reason for this lies in fact is that the most imputation algorithms rely on inter-attribute correlations, while univariate time series imputation instead needs to employ time dependencies.
For univariate time series, the techniques that can be applied range from univariate algorithms, univariate time series algorithms and multivariate algorithms on lagged data [3].
In time series, can be find different gap sizes for NA values, a quick classification could be: short-gaps from 1 to 2 consecutive NAs; medium-gaps from 3 to 10 consecutive NAs; and big-gaps more than 10 consecutive NAs. In this paper we focus only on short-gaps.
In meteorological time series we find the three types of gaps mentioned above, we could even add a new category very big-gaps, since in some time series, there are gaps of NAs that range between approximately 1 and 72 months. A 72-month NA gap was found in the Punta de Coles time series between 1960/01/01 and 1965/12/31 (1978 consecutive NAs).
In this paper, we propose two algorithms for short-gaps of NAs within the univariate time series algorithms category and these are based on local averages of numerical time series. The first Local Average of Nearest Neighbors (LANN) algorithm is based on the average of the two nearest neighbors to the missing value or NA, the previous neighbor and the neighbor after the missing data. The second Local Average of Neighbors Neighbors+ algorithm (LANN+) is based on the difference (d) between the previous value and the value close to the missing value, this difference is compared with a threshold parameter that allows determining the way in which the missing value is calculated. When the differences are less than or equal to the threshold value, the missing value is calculated with the LANN algorithm and when the difference is greater than the threshold value, the missing value is calculated from the 4 neighbors closest to the NA value, the two previous and the two next to the NA value or missing value.
The paper is structured as follows: In Section II, a brief review of the state of the art regarding the proposals in this work is shown; in Section III, the fundamental theoretical bases for the better understanding of the paper content are shown; in Section IV, the proposed algorithms are described; in Section V, the results with different sizes of time series are described and discussed, likewise, they are compared with similar works; in Section VI, the conclusions reached at the end of the study are described and finally in the last Section VII, the future work is shown, which can be done to improve the proposals.

II. RELATED WORK
A review of the state of the art of imputation works in univariate time series has been carried out and the results are shown below.
Commonly-used methods for univariate time series are relatively simple and include the arithmetic mean, interpolation, and last observation carried forward (LOCF) [4].
Last Observed Carried Forward LOCF [5] is a technique for replacing each NA with the most recent non-NA prior to it. For each individual missing value are replaced by the last observed value of that variable. In this work, zoo R package was used to implement LOCF imputation.
Hot-deck [6] imputation dates back to the days when data sets were saved on punch cards, the hot-deck referring to the "hot" staple of cards (in opposite to the "cold" deck of cards from the previous period). Most of the time, hot-deck 46 | P a g e www.ijacsa.thesai.org imputation refers to sequential hot-deck imputation, meaning that the data set is sorted and missing values are imputed sequentially running through the data set line (observation) by line (observation). In this work VIM R package was used to implement hot-deck imputation.
Missing Value Imputation by Weighted Moving Average [7], the mean in this implementation taken from an equal number of observations on either side of a central value. This means for an NA value at position i of a time series, the observations i-1,i+1 and i+1, i+2 (assuming a window size of k=2) are used to calculate the mean. We have three types of algorithms in this category: Simple Moving Average (SMA), Linear Weighted Moving Average (LWMA) and Exponential Weighted Moving Average (EWMA).
Simple Moving Average (SMA) [2], all observations in the window are equally weighted for calculating the mean. For gap sizes equal to 1, and the parameter k equal to 1, SMA produces the same results as LANN in other cases results are different.
Linear Weighted Moving Average (LWMA) [2], weights decrease in arithmetical progression. The observations directly next to a central value i, have weight 1/2, the observations one further away (i-2,i+2) have weight 1/3, the next (i-3,i+3) have weight 1/4. Exponential Weighted Moving Average (EWMA) [2] [8], is an approach that imputes the missing values by calculating the exponentially weighted moving average (EWMA). Initially, the value of the moving average window is set; the mean thereafter is calculated from equal number of observations on either side of a central missing value [8]. The observations directly next to a central value i, have weight (1/2) 1 , the observations one further away (i-2,i+2) have weight (1/2) 2 , the next (i-3,i+3) have weight (1/2) 3 ,.
In this work, imputeTS R package is used to implement SMA, LWMA y EWMA imputations.
Kalman Smoothing [8] on the state space representation of an autoregressive integrated moving average (ARIMA) model, is usually a good approach for imputation of highly seasonal univariate data [9]. In this work, we use imputeTS R package to implement ARIMA Kalman imputation.
Datawig 1 is a Python library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. This method works very well with categorical and non-numerical features, therefore, it was not considered in the comparisons made in this work.
In order to compare the accuracy of the imputation techniques proposed with multivariable imputation techniques, two well-known multiple imputation algorithms were experimented, such as MICE [10] (Multiple Imputation by Chained Equations) and KNN [11] [12] (K-Nearest Neighbor), results can be seen in Section V.

A. Time Series
A time series is a set of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Some examples of time series are daily temperatures, weekly sales, customers per day, number of monthly visits, etc.
Studying the past behavior of a series will help to identify patterns and make better forecasts. When plotted, many time series exhibit one or more of the following features:

B. Missing Data
Depending on what causes missing data, the gaps will have a certain distribution. Understanding this distribution may be helpful in two ways [3]. First, it may be employed as background knowledge for selecting an appropriate imputation algorithm. Second, this knowledge may help to design a reasonable simulator that removes missing data from a test set; such a simulator will help to generate data where the true values are known. Hence, the quality of an imputation algorithm can be tested.
Missing data mechanisms can be divided into three categories: Missing Completely at Random (MCAR), Missing at Random (MAR) and Not Missing at Random (NMAR). In practice, assigning data-gaps to a category can be blurry, because the underlying mechanisms are simply unknown [3]. While MAR and MNAR diagnosis needs manual analysis of the patterns in the data and application of domain knowledge, MCAR can be tested for with t-test [3]

C. Univariate Time Series
A univariate time series is a sequence of single observations o 1 ,o 2 ,o 3 ,…, on at sucessive points t 1 ,t 2 ,t 3 ,… t n in time. Although a univariate time series is usually considered as one column of observations, time is in fact an implicit variable [3].

D. Univariate Imputation Methods
Techniques capable of doing imputation for univariate time series can be roughly divided into three categories [3]:  Multivariate algorithms on lagged data. Usually, multivariate algorithms cannot be applied on univariate data. But since time is an implicit variable for time series, it is possible to add time information as covariates in order to make it possible to apply multivariate imputation algorithms. This process is all about making the time information available for multivariate algorithms. The usual way to do this is via lags and leads. Lags are variables that take the value of another variable in the previous time period, whereas leads take the value of another variable in the next time period.

A. Local Average of Nearest Neighbors (LANN)
LANN is an imputation algorithm for a univariate time series, which is fundamentally based on the average of the two closest neighbors, this according to the analysis carried out in several meteorological time series, where, it was observed that the previous neighbor v i-1 and the next neighbor v i+1 usually have approximate values at a certain value v i . Where in an imputation problem v i would be the NA value or the value to be imputed. Table I shows the difference or distance between a time series value and the other values. The time series corresponds to meteorological data of maximum daily temperatures of 15 days at a weather station in the Moquegua Region, Ilo province from 2016-01-01 to 2016-01-15.
As mentioned earlier, this algorithm provides the same results as the SMA algorithm [2] when SMA is configured with the parameter k = 1 and the sizes of the gaps are equal to 1. When the size of the gaps is greater than 1 the results are different.
Then, from Table I, calculating the average of the diagonal elements that are exactly below the main diagonal or above, we will find the average difference between an element of the series and its first neighbor. Similarly, the following diagonal will give us the average difference between an element of the series and its second neighbor and so on. Table II shows the  average differences for the 15-day time series. According to Table II for the time series analyzed, we find that the closest neighbors to some value are 1st, 3rd, 6th, 9th and 2nd.
Next, we will experiment by generating random NA values in the previous time series and calculate the NA values by applying the average of the nearest neighbors (previous and next) with LANN algorithm according equation (1).
(1) Table III shows the randomly generated NAs and their respective calculation using equation (1) with a percentage of missing data of 40%, 26.67%, 13.33%. The algorithm in Table  IV was used in such a way that we make sure that we do not generate missing data at the beginning and at the end of the time series, likewise, the algorithm does not insert more than two NAs as gaps.
The LANN algorithm implemented in Javascript Language can be seen in Table V.

B. Local Average of Nearest Neighbors+ (LANN+)
LANN+ is based on the LANN technique, but it conditionally considers the average of the 4 closest neighbors instead of just two as in the LANN case. This algorithm uses a threshold parameter, which the higher it is, the imputation results will be very similar to the LANN algorithm. This parameter must be set according to the nature of the time series. For a temperature time series, the most appropriate is probably 1.0, in the case of an air passenger time series, the most suitable is probably 110.   The consideration of having a threshold is based on the fact that missing values in time series should not be imputed with the same technique, since each missing value and its neighbors have their own characteristics so there should be a technique that suits these characteristics in such a way that the imputed value has these characteristics. In that sense, although there is no exhaustive extraction of characteristics of time series with missing data, with LANN+, an effort is made to consider at least one characteristic that becomes the difference (d) between the previous neighbor and the neighbor after the missing value or NA data.
Regarding the alternation between two neighbors for differences less than or equal to the value of the threshold, and four neighbors for differences greater than the value of the threshold, it was considered so because when analyzing different time series of temperatures it was found that for small differences the average of the two closest neighbors (v i-1 , v i+1 ) in most cases produced good results, while for larger differences it was more appropriate to use the average of the four nearest neighbors (v i-2 , v i-1 , v i+1 , v i+2 ), something that can be seen if we compare the RMSE of Table III with those of  Table VI.

A. Comparison with other Univariate Imputation Techniques
The LANN and LANN+ algorithms are compared with other imputation techniques in a maximum temperature time series of 15 days, Table VIII shows the results.
According to Table VIII, it is appreciated that for the percentage of NAs equal to 40%, the algorithm that obtained the best precision was the LWMA (0.4572) followed by the EWMA algorithm (0.4692) and thirdly the proposed algorithm LANN+ (0.4873). For the percentage of NAs equal to 26.67%, the algorithm with the best performance was the proposed algorithm LANN+ (0.4308) followed by LANN, SMA and ARIMA Kalman with the same RMSE (0.4330). For a percentage of NAs equal to 13.33%, in the first place, we have matched the LANN, SMA and ARIMA-Kalman algorithms with the same RMSE (0.4950).
Also, the performance of the same algorithms was evaluated with a time series with more data, in this case instead of 15 days, it is considered 90 days of maximum daily temperatures, from 2016-01-01 to 2016-03-30. Table IX shows the results.
According to Table IX, it can be seen that for a percentage of NAs of 48.89%, the algorithm with better precision is LANN (0.6059), secondly, we have the LANN+ algorithm (0.6196) and thirdly the SMA algorithm (0.6211). For a percentage of NAs of 32.22%, again the best precision was obtained by the LANN algorithm (0.5099), followed by the LANN+ algorithm (0.5296) and thirdly by the SMA algorithm (0.5451). For a percentage of NAs of 23.33%, the best algorithm was EWMA (0.4765), followed by LWMA (0.4970) and thirdly we have two, LANN and SMA with a RMSE equal to 0.5085.
The proposed algorithms were also compared with the precision of two well-known multiple imputation algorithms such as MICE and KNN and the results shown in Table X were obtained. In this case, it's used the data from the same previous data range of the nearest meteorological station to the Punta de Coles station, which is the Ilo Station. Ilo station is located in the El Algarrobal district of the province of Ilo.   Table XI shows that for time series with different characteristics than maximum temperatures, the proposed algorithms also offered good performance. Table XII shows the results with the Beersales time series, where the LANN algorithm showed the best accuracy in the imputation process of missing data.

VI. CONCLUSIONS
The proposed algorithms showed a very good performance in the imputation process of NAs short-gaps in different time series in which they were analyzed. They outperformed many well-known imputation algorithms such as ARIMA-Kalman, Hotdeck, LOCF, MICE, KNN in different percentages of missing data.
For meteorological time series such as maximum temperature series, LANN and LANN+ are highly recommended due to the good accuracy achieved.
For the time series with high trend and seasonality, the use of the LANN+ algorithm is recommended and for time series with low trend and high seasonality, the use of LANN is recommended.

VII. FUTURE WORK
The algorithms proposed in the present work have been analysed and evaluated in short-gaps of NAs, it is important in future works to configure them for larger gaps, three or more data and evaluate the corresponding accuracy.
The proposed algorithms can be improved by combining with forecast models such as Deep Learning, especially Recurrent Neural Networks [14] especially Long Short Term Memory (LSTM) or Gate Recurrent Unit (GRU) that allow improving the accuracy of the estimates reached.