Outlier Detection using Nonparametric Depth-Based Techniques in Hydrology

Several issues arise when extending the methods of outlier detection from a single dimension to a higher dimension. These issues include limited methods for visualization, marginal methods inadequacy, lacking a natural order and limitation in parametric modeling. The intension to overcome and address such limitations the nonparametric outlier identifier, based on depth functions, is introduced. These identifiers comprise of four threshold type outlyingness functions for outlier detection that are Mahalanobis distance, Tukey depth, spatial Mahalanobis depth, and projection depth. The object of the present research is the application of the proposed nonparametric technique in hydrology. The study is intended to be executed in two different frameworks that are multivariate hydrological data analysis and functional hydrological data analysis. The event of a flood is graphically represented by hydrograph whose components are used for computing flood characteristics that are peak(p) and volume(v). These characteristics are frequently employed for the various types of analysis in the multivariate study. Whereas, hydrograph is exhaustively employed in the analysis of functional data so that all the important information regarding flood event are not missed while analysis. The proposed technique in a multivariate framework is applied to the bivariate flood characteristics while in functional framework proposed approach is applied to the initial two scores of principal components denoted as , since initial two principal components capture major variation of data employed for analysis. Keywords—Outlyingness functions; nonparametric techniques; flood characteristics; principal component scores; multivariate analysis; functional analysis


I. INTRODUCTION
The "outlier" observations in any data set is crucial to be detected and identified for nonparametric or parametric inferences. "Outliers" are the observations that are inconsistent or far from the majority of data points or within the chunk of data points with unusual behaviour. The presence of unusual observations in the data set acts as an outlier that can impact adversely the outcomes of estimation, inference, and testing procedures. Therefore, outliers are required to be identified and treated so that inferences are not violated due to unusual observations [1,2].
Outliers identified marginally suffer inadequacy of checking, in each coordinate, an outlier can find to be nonoutlying. Approaches that are algorithmic and take into account underlying geometry are required. A suitable function of outlyingness may be formulated with a threshold specified. A suitable choice can be Mahalanobis distance which is a highly tractable function of outlyingness but constrained for having elliptical contours of symmetric outlyingness, even though whether the model under consideration is symmetric elliptically.
The author in [3] introduced a nonparametric technique which is based on functions of depth and orders the multidimensional data in center-outward. Higher depth represents higher centrality whereas lower depth greater outlyingness. One can associate with any depth function an equivalent function of outlyingness. For a suitable selection of depth function, actual geometrical structure and data shape are formed by equal outlyingness contours. In general, four different affine invariant functions of outlyingness were derived which are based on Mahalanobis distance outlyingness (MO), projection depth outlyingness (PO), halfspace or Tukey depth outlyingness (TO), and Spatial Mahalanobis outlyingness (SO). Related to these outlyingness functions the corresponding points are "outliers" having values of outlyingness exceed the constrained threshold of a particular function.
The nonparametric approaches introduced by [3] have been practiced by [4] and [5] in hydrology while [4] executed multivariate hydrological data analysis using two frequently employed flood characteristics; peak(p) & volume(v), for the identification of unusual observations i.e. outliers. The author in [5] came up with groundbreaking research and extended the work of [4] by conducting functional hydrological data analysis. The nonparametric outlier identification technique was practiced in hydrology by [5] in such a way that the initial two scores of principal components were employed for the detection of outliers in a functional context. In multivariate analysis, employed flood characteristics are dependent and mutually correlated whereas scores of principal components employed in functional analysis are uncorrelated.
The execution of research in the functional framework follows the claim made by [5] that the characteristic of flood use in conducting the multivariate hydrological study are computed by subjective approach and do not encounter the complete series of employed data set, therefore, inferences of multivariate study suffer lack of authenticity. Hence it is crucial to conduct research in a functional framework so that authentic estimation regarding the associated risk of flood is obtained by incorporating complete phenomena produced through employed data series. www.ijacsa.thesai.org The objective carried by present research is the implementation of nonparametric techniques based on depth functions in both the context of a study that is a multivariate and functional framework using hydrological data of Kotri Barrage on Indus River in Pakistan.

II. LITERATURE REVIEW
The methods going to be presented are based mainly on the statistical notion of depth functions. These functions provide convenient ranking tools for ordering data variables. Depth functions were initiatively practiced in hydrology by [6]. Several techniques of univariate analysis were extended to execute multivariate analysis developed through analogy. The variables that are dependent mutually affect the performance badly when analysing data component-wise, whereas momentbased techniques required the moment's existence.
Review in detail regarding techniques use for conducting classical multivariate analysis, it is referred to follow [7,8]. Techniques that are developed on the basis of depth, avoid the earlier drawbacks science depth functions are ordered using multivariate inward and outward ranking [9]. Indeed, techniques based on depth aren't component-wise, also, they are affine invariant and moment-free. Numerous techniques of outlier detection are enabled by ranking based on depth. The number of depth function formulas have been derived for executing the multivariate study. Depth region location inference considered by [3] is evaluated on sample space. Description of connection and general treatment related to multivariate quantile and centre ranked functions can be studied through [10,11]. For other inferential applications of depth see [12,13]. Numerous studies conducted in hydrology using various nonparametric approaches. The functions based on depth have been recently employed for the detection of outliers by [14,15]. According to [16], nonparametric models are suitable for capturing subtle aspects related to the frequency estimation of a flood. Flood inundation and flood damage were analysed using hydrologically distributed models through nonparametric techniques [17]. Similar other studies recently conducted in hydrology for outlier detection and risk estimation using nonparametric approaches are [18,19]. Characteristics of drought evaluation were assessed in a multivariate context implementing a nonparametric approach by [20][21][22]. Further research of [23] discussed data cleaning of water consumption and estimation of uncertainty regarding hydrologic modeling. Depth notion in regression was practiced and the performance of runoff model was evaluated, see work of [24][25][26]. Author in [27] used parametric and nonparametric multivariate approaches for designing rainfall framework whereas [28] applied rank-based nonparametric techniques to study trends of rainfall.
Multidimensional data is reduced by of analysis of functional principal component (AFPC) techniques to attain an easy approach for analyzing hydrological data. Notable work includes profile classification of streamflow, minimum indicators selection and functional data analysis application on streamflow are the studies executed on the basis of AFPC. Simulation of drought interval and drought changes were analysed by [29,30]. [31][32][33] studied rainfall variability modeling, pattern identification, and outlier detection. Other relevant studies include work of [34][35][36][37][38], are also preferred for acquiring information about the useful application of AFPC in hydrology.
This paper is organized in such a way that the discussion regarding proposed methodologies is presented in Section 3. Section 4 provide description related to hydrological data employed for executing present research. Section 5 provides an application of the discussed methodology on employed hydrological data and obtained results are provided in Section 6 whereas Section 7 contain the conclusion drawn from the research.

III. METHODOLOGY
This section contains methods for computing bivariate series of flood characteristic and also bivariate series of principal component scores . Both the computed series and are required for obtaining outliers in multivariate and functional context, respectively, using proposed threshold type nonparametric techniques which will also be discussed later in this section.

A. Flood Characteristics
The flood peak (p) and volume (v) are the fundamental and most studied flood characteristics [39][40][41] and their computation based on the work of [41].
The bivariate series are generated through hydrograph components using following formulas.
The flow peak series is calculated as.
( 1) where is the highest recorded observation of flow on a kth day in a jth year.
The flow volume series is calculated as.
where are the recorded observations of flow on a kth day in a jth year, and are the recorded observation of flow on starting and ending day ( respectively, in the kth year of flood time span.

B. Analysis of Functional Principal Component
Analysis of principal component (APC) practices in a multivariate study for reducing the dimensionality through the computation of new variables which are the linear combination for original values so that the maximum of data variation could be captured. After the conversion of data as functions, analysis of functional principal component (AFPC) permits us to compute new functions so that special kind of variation for curve data could be revealed [5]. The AFPC method maximizes sample variance scores as orthonormal constraints. It divides the functional centred observations in orthogonal basis form and defined as follows.
Let functional observations be obtained after smoothing the discrete observations ( ) By definition, the curve of mean is a same variation for most of the curves which can be www.ijacsa.thesai.org fixed by centering. Let ̅ be functional centered observations where ̅ represents the function of mean for . Now AFPC is applied to for creating a set of small functions, known as harmonics which reveals the type of variation important for analysis. The first principal component denoted as be a function so that variance regarding corresponding scores of real value is as follows.
is maximized under ∫ constraint. The next ; a principal component computed by maximization of variance related to corresponding scores : under ∫ constraints.

C. Detection of Outliers
The approaches for detection of outliers employed by [4] in the multivariate context was adapted by [5] in functional context; applying functions of outlyingness on the scores of initial two principal components. The purpose of this adaption is to create a comparison between multivariate and functional results.
Functions of outlyingness in a multivariate context were described and employed for detecting outliers. These functions have values ranging [0,1] interval. The outlyingness of a particular point is measured related to the whole sample. A value of outlyingness close to 1 shows high outlyingness, and a value close to 0 shows centrality. An observation is determined to be an outlier by defining a threshold i.e. the outlyingness value corresponds to an outlier must exceed their respective threshold values. Reference [3] introduced outlyingness functions which are based on the functions of depth, are going to be presented in the following section.

1) Outlyingness functions:
A depth function is transformed to depth outlyingness for a F given distribution and . Reference [3] studied as follows. a) Half space where , and are given by [4], a location measure is and is non-singular measure of scale matrix.

Spatial
where the Euclidean norm is‖ ‖, F-distribution is X and the sign multidimensional function is given by ‖ ‖ also, C is any positive definite affine invariant symmetric matrix.
2) Threshold: An essential step in the detection of an outlier is the appropriate selection of the threshold. It relates to true positive and false positive rates.
denoted for a false positive arbitrary rate which is defined as the proportion of misidentified nonoutliers as outliers. This constant relates closely to the true positive rate by which the theoretical proportion for real outliers are represented (also known as contaminants). Ideally, suppose to be smaller than . Reference [3] fixed the false outliers' ratio and also used another coefficient √ , in order to define a threshold for the values of outlyingness as ) quantile.
where false positive rate is represented as √ and true positive rate represented as ; a number of true outliers are and a number of observations are , in such a way that . For further calculations and applications, readers are referred to follow [4].

IV. DATA DESCRIPTION
The major source of hydrological data is daily streamflow. The daily flow data series of the Kotri barrage are available from Sindh Irrigation department, Sindh Secretariat, Karachi, Pakistan.
A daily flow observations ( ) of Kotri barrage which is located between Jamshoro and Hyderabad in Sindh province on the Indus River, Pakistan. It has a discharge capacity of 875,000 cusecs (i.e. approximately 24800 ).

V. APPLICATION
The two most studied and examined characteristics of the flood that is peak (p) and volume (v) are focused here. The series of bivariate (p,v) are computed by using (1) and (2) and results are displayed in Table I. According to [4], an approach developed by [3] Table I. The thresholds correspond to each outlyingness functions are computed by selecting 15% false outlier ratio and the number of true outliers as 5, this selection is similar to the choices made by [4] in such a way that the outlyingness value corresponds to an outlier must exceed their respective threshold values.  The thresholds correspond to each outlyingness functions are computed by selecting 15% false outlier ratio and the number of true outliers as 5, this selection is similar to the choices made by [4] in such a way that the outlyingness value corresponds to an outlier must exceed their respective threshold values. Hence, 98% quantile is a corresponding  Table II whereas Fig. 3 displays the detected outliers correspond to MO, PO, SO & TO with respect to their respective threshold values.

A. Multivariate Result
The  Fig. 2 so that the above interpretation can explicitly comprehensible. The years 1978The years , 1990The years , 1994The years , 2000The years , 2004The years , 2010The years , 2011The years , 2012 and 2014 computed as outliers by the outlyingness functions, among them the years 1978 and 1992 are present outside compare to the rest of the years whereas the years 1994 and 2010 are appear as outliers. Hence, it can distinctly be inferred from the values of Table II, the year 1994 is detected by all the four outlyingness functions as an outlier. Whereas the years 1991, 1992 and 2010 are identified as outliers by the three outlyingness functions. Above interpretation can better be comprehended by the scatter plot constructed between scores of initial two principal components (i.e. PC score 1 & score 2) and represented by Fig. 3 which reveals that the years 1981,1985,1991,1992,1994,2000,2001,2004 and 2010 computed as outliers by the outlyingness functions, among them the years 1991 and 1992 are present outside compare to the rest of the years whereas the years 1994 and 2010 are appear as outliers.
The functional results are almost consistent with the results of the multivariate framework such that the years 1992, 1994 and 2010 have been detected as the most unusual flows in both the multivariate and functional context.

VII. CONCLUSION
The nonparametric techniques based on depth function for outlier identifiers have been practiced in two different frameworks of study that are multivariate hydrological data analysis and functional hydrological data analysis. The identification of outlier is essential for the appropriate selection of suitable hydrologic models so that risk associated with flood events can be authentically estimated. The methods employed in the present research are multivariate methods that are superior to previously practiced classical methods that were moment-based, follow normality assumption and component-wise techniques. The implemented techniques are based on depth function notion, free of moment, do not require normality assumption, and also affine invariant.
The proposed approaches have been implemented in two different frameworks of analysis. The intention of executing this study is to gauge the performance of proposed methodologies in both multivariate and functional context. The two most widely practice flood characteristics in hydrological analysis, peak (p) & volume (v) have been included to execute study in multivariate hydrological data analysis. Besides this, two initial scores of principal components used as a series of bivariate variables for executing functional hydrological data analysis since initial two principal components have a capability to capture major variation of data employed for analysis.
The outliers of both the framework are almost consistent but the results of functional analysis can be considered more reliable since it is based on complete information of flood hydrograph whereas flood characteristics are not able to generate hydrograph even though more than two characteristics of flood are included in study. Nevertheless, the multivariate results cannot be ignored and must be employed in a parallel complement to functional results so that dynamics of a hydrological event can be analysed to attain comprehensive information related to causes of flood.