Peeling multivariate data sets: a new approach, GC Porzio, G Ragozini

Tags: data set, convex hull, observations, computational effort, data structure, data region, peeling, data point, data analysis, peeling procedures, convex hulls, multivariate data sets, multivariate data, multivariate data analysis, data points, Titterington, SIAM Journal on Scientific and Statistical Computing, Exploratory Data Analysis, leverage points, M., Corbellini, outlier detection, Journal of the American Statistical Association, extreme observations, Computational Statistics and Data Analysis, univariate case
Content: Quaderni di Statistica Vol. 2, 2000 Peeling multivariate data sets: a new approach Giovanni C. Porzio Dipartimento di Economia e Territorio, Universita` degli Studi di Cassino E-mail: [email protected] Giancarlo Ragozini Dipartimento di Scienze Statistiche, Universita` degli Studi di Napoli Federico II E-mail: [email protected]
Summary: Peeling a data set consists of identifying its successive layers, from the outer-
most to the innermost. It is used for many purposes in multivariate data analysis, includ-
ing data ordering, trimming, outlier detection, robust estimation of location, correlation
and probability contours. However, existing peeling procedures, mainly based on the
convex hull idea, require a remarkable computational effort (many convex hulls have to
be computed), and can fail with respect to their own goals in presence of irregularities at
the boundary of the data region. To overcome these drawbacks, noting that in practice
peeling procedures are essentially used to identify the bulk of the data, we propose a
new peeling approach that splitЎўs Ў¤thЈ eҐ method distinguishes between
oabnsder¦ v§©aЁ tЈioҐ ndsaitna,jiudsetnttwifoyilnagyearss.¦
§©IЁnЈ
Ґparticular, the ЎўЎ¤thЈ oҐ se obser-
vations that lie closer to the boundary region than to the remainder (the
data). It
exploits the first outer convex hull and a quasi-clustering procedure: observations close
to the boundary region are clusterized through appropriate radial projections around the
convex hull vertices.
Key words: Convex hull, Partial ordering, §©Ё Ј Ґ and ЎўЎ¤Ј Ґ data.
85
1. Introduction The aim of a peeling procedure consists of identifying nested layers of a data set, assigning to each observation an index which considers the proximity of that point with respect to the outside of the data swarm. To determine such an index, an assigned shape with minimum volume that contains the data is computed, and observations lying on its border take index value one. Then, the procedure is iterated on the remaining observations yielding a sequence of shapes. Data points lying on the border of the shape ( "!ў# $ $ $ # ) will take index value (Green, 1981). Although given geometric shapes can be considered, such as rectangles (Nath, 1971), ellipses (Silverman and Titterington, 1980), or circles (Daniels, 1952), the main peeling approach is based on the convex hull of the data (i.e. the smallest convex set containing them). Indeed, usually it is not possible to make assumptions on the data set shape, and hence the use of a more flexible and conservative figure such as the convex hull is more appropriate (Brooks et al., 1988). Convex hull peeling is used in different settings in statistics. Considering the analogy between the convex hull vertices and the extremes of an univariate set, it has been applied to order multivariate data (Barnett, 1976; Eddy, 1981). Eddy and Hartigan (1977) used it to estimate probability density contours, while Bebbington (1978) exploited the peeling to trim bivariate data sets in order to obtain a robust estimate of the correlation coefficient. Recently, a bivariate boxplot based on this kind of method has been proposed by Zani et al. (1998), while Liu et al. (1999) considered convex hull peeling within the framework of multivariate analysis by data depth. However, this peeling approach has some drawbacks, related to the computational effort (Petitjean and Saporta, 1992) and the effectiveness of the procedure (Donoho and Gasko, 1992). As in the practice of data analysis a complete convex hull peeling may be not required, in this paper we propose a suitable peeling approach that identify only two layers, distinguishing between %&%&' ( and ) 01 ' ( 86
data. The method allows at the same time to correctely identify the bulk of the data, even under irregular data structure, and to avoid intensive computations. The paper is organized as follows. In section 2 we discuss convex hull peeling along with its main drawbacks, while section 3 proposes a distinction between 2©314 5 6 and 7 8&8&5 6 data, giving foundations for our approach. Sections 4 and 5 illustrate our proposal along with some examples, while final remarks follow in section 6. 2. Convex hull peeling Convex hull peeling consists of computing iteratively the nested hulls of the set, from the outermost to the innermost. At each step, the convex hull vertices are deleted, the convex hull of the remainder is constructed and new vertices are identified. Each convex hull in the nested sequence defines a deeper layer, from the outer to the inner. As an illustration, in Figure 1 we present a complete convex hull peeling for a 50 observation data set from a bivariate Normal Distribution. Some drawbacks affect convex hull peeling. First, the computational effort for a complete peeling of the data is heavy, especially when dimensions increase (Petitjean and Saporta, 1992). To overcome this drawback, it has been proposed to consider only few nested layers, or to peel projected data. However, both approaches could be misleading. The first one is uneffective in the case of multiple outliers, while projections could hide the real structure of the data. Besides, we note that the existing peeling procedures could fail with respect to their own goals. With irregular data structures, such as clustered data along the boundary or marked asimmetry, they assign the same index layer both to observations lying on the periphery of the data region and to others belonging to its bulk (see also Donoho and Gasko, 1992). To show such an effect, we consider the ad hoc simulated data set displayed in Figure 2. 87
2
1
0
x2
o
o o
o
o oo
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
oo o
o o
o o o
o oo
o
o
o
o
oo
o
o
oo o o
o
o
-2
-1
9 0
1
2
x1
-1
-2
Figure 1: Simulated data set. Complete convex hull peeling.
The data set consists of 200 observations generated from a mixture of two normal distributions: @ [email protected] PGQ RS T [email protected] P Q R` T UCF T
with
R
SGacbed dgf
ThR
`iacbed1B p
q qgf
TrU
acb A d¤p s
d¤A p s f p
D
The mixture parameter was set to 0.10, and the second distribution was
shifted in mean in order to obtain a small well-separated subsample.
According to the classical peeling procedure, the first seven nested
convex hulls were superimposed to the data set. We note that the same
index layer (e.g. the 6-th) is assigned to observations belonging both to
the periphery and to the core of the main structure of the data. As a
consequence, if the identification of the main structure of the data is the
issue, the procedure will fail. If in addition, according to the practice,
88
0
o
o
oo
o
o
o
o
ooooooooooooooooooo oooooooooooooooooooooooo o o ooooooooooooooooo oooooooooo oooooo o ooooooooo oo o ooooo oooo ooo ooooooooooooo oo o ooooo oooo oooooo oooooooooooo ooooooo o ooooo
o o
o
oo
o
o
o o
-2
x2
-4
o
oo
o ooo
oo
o
o
o
o
oo o
oo o
-3
-2
-1
90
1
2
3
x1
-6
Figure 2: The classical convex hull peeling when a cluster of data lies on the periphery of the scatter. The first seven layers.
observations lying on the most outer convex hull layers are removed, it will be caused a loss of information. To overcome the above drawbacks, we propose a new peeling approach, based on a distinction between t©u1v w x and y &&w x data. 3. u1v w x and y &&w x observations: a new peeling approach In order to analize a data set structure, a fast and broad ordering could be as useful as a complete one. In the univariate case, Tukey (1977, Chap.2) proposed a sort of partial order, distinguishing among far out, outside, adjacent observations and inner values. In the multivariate case, we propose to distinguish among t u1v w x and y &&w x observations, and we provide the corresponding tool to allocate the observations according to 89
such a distinction. We intend as ©1 observations those far from the rest of the data scatter, closer to its boundary then to the bulk (in some sense they correspond to the far out and the outside observations of Tukey's univariate case). As a counterpart, we call && observations those that are not 1 . Accordingly, to split the data in two layers (the 1 and && ones), we propose a new peeling approach that combines the main idea of the classical convex hull peeling procedure with an ad hoc quasi-clustering method. We identify as 1 observations the convex hull vertices along with their closest data points. The && layer is then defined as the complement to the ©1 one with respect to the whole data set. The convex hull vertices are included in the first layer as they are the most extreme observations by definition (Barnett, 1976). Then we clusterize around them all the observations lying in a neighbourhood of the boundary region. In particular, points closer to the vertices than to the rest of the data will be included in the 1 layer. To identify which are the closest observations, we consider for each vertex the distances to the remaining points along a radial projection. Then, along each of these directions, we look at the univariate ordering of the points from the closest to the furthest. The presence of 1 & , if any, in these univariate orderings will point out some empty space in the data structure, splitting the ©1 observations from the && ones. This kind of peeling allows us to strip out all the 1 observations in a single run. In this way we obtain a relevant computational gain, as we compute a single convex hull instead of a nested series. Consequently, it is more feasible to perform this peeling procedure in the complete variable space without relying on projections. Finally, we note that our method is able to point out particular structures, such as clusters of data, on the boundary of the data region. On the other hand, in their absence, the ©1 layer will consist essentially of the convex hull vertices. 90
4. Clustering around vertices: the algorithm To clarify how our peeling method works, we present in detail the algorithm, providing a graphical illustration of some of its steps. Given a data set of points in dimensions d e f g h h h g e&i¤jkg the peeling algorithm works as follows. Step 1: Construct the convex hull of , l&mW nj , and let odqp e&r s tug with v wexzyўw g be the set of { vertices of l&mW nj ; Step 2: Compute the {g|} distance matrix ~" og njcp k v w g ¤j t , v wYxy¤w , ўg h h h g , with & v w g ¤jz e&r sGe&© the distance between the v w -th vertex and the -th data point. Let k v w g j be the row vector of ~" o}g nj , which collect the distances between e&r s and ekexg , ўg h h h g (note that it is not required to compute the complete distance matrix ); Step 3: For each vertex v w , sort in ascending order the k v w g j vector; call 1 v w g j the -th element in the sorted sequence. Then compute the first difference of the sorted distance vectors 1 f v w g ¤jE¤ v w g © j , for q©g h h h g . For each vector compute the maximum of such first differences, and let r s be the corresponding : ©r sGў G ¤ f v w g ¤jE¤ v w g j h For each vertex er s , such a maximum identifies a possible ¤ in the data structure, and hence a cluster, if any, of 1Ў ў Ј observations around the vertex can be selected. Let us denote with ¤ r s such a cluster for the v w vertex, v wexzyўw . As an illustration, Figure 3 shows some radial projections with respect to the vertex er s , the corresponding univariate ordering (on the right side of the figure), the gap, and the cluster ¤ir s around er s for an artificial bҐ i v¦ a.riate data set with different kind of ©1Ў ў Ј observations created ў 91
Figure 3: Artificial data set. §Ё © Є« of the proposed procedure. The radial projections with respect to the vertex ¬ ® (left), along with the univariate ordering, the gap, and the cluster Їi ® (right). Step 4: Once ° ® has been identified, determine the clusters Їi ® of ± І Ё © і observations according to the following rules. ґ µ if ° ®i¶· , set Ї ®i¶ё ¬ ® № : in this case the maximum gap occurs next to the vertex ¬ ® , and hence the latter is an ±©І Ё © і observation lying far from the others, (see for instance obs. 1 in Figure 4) or on the border near the core of the data (obs. 31 and 112); ґ ґ µ if ·gє° ®єј» Ѕ , set Ї ®¶cѕ©¬kїeА¤Б¤В Г Д Е ґ Ж З И¤µiЙ Б1В Г ® Д Е ґ Ж З И©К µ Л : in this case the maximun gap occurs before the median position, the vertex ¬& ® is the outermost of an ±©І Ё © і group (e. g. groups around obs. 4 and 42), and all the observations before the gap, ¬& ® included, belong to the ±©І Ё © і layer; ґ ґ ґ µ if °© ®gМНј» Ѕ , set Їi ® ¶Оё ¬& ® № : the maximum gap occurs after the median position in the sequence of the sorted distances, usually 92
Figure 4: Artificial data set as in Figure 3. ПР С Т}У of the proposed peeling procedure: clusters of Ф Х1Р С Ц data with respect to the convex hull vertices. near the end of the sequence (e.g. obs. 47 has its maximum gap with respect to obs. 112), and Ч&Ш Щ is a single Ф©Х1Р С Ц observation as in Ъ Ы . This last step yields hence a set of clusters, each of them containing some Ф Х1Р С Ц observations. The Ф Х1Р С Ц layer Ь is obtained as the union of such clusters: ЬЭЯЮ Ш Щ а б Щ в г Ш Щ дўе The Ъ ж&ж&С Ц layer з will be its complement: зEЭйЬ и . 5. Some illustrative examples In this section, we illustrate our method when the data have a regular structure, and when such a structure presents either clustered observations along the boundary or outliers. For visual enhancement, for each data 93
sк л&etл&wм н elcaoymerp, uantedawlseohtihgehcliognhvtesxuchhulalnoк fл&tл&heм н
observations belonging to the region through a shaded area
in the plots.
To illustrate the method when data do not present irregularities, we
generated 500 observations from a normal bivariate distribution. The cor-
creosrpreosnpdoinndgisncgatttoertphleoкtл&isл&dм iн splalayyeer.d in Figure 5, along with the shaded area
x2
o
o
o
o
o
o
o
o
п
-2
0
2
o
o o o o
o
ooooooooooooooooo oo oo oo oooooooooooo oo ooooooooooooooooooooo ooooooo ooooooo ooooooooooooo oooooo ooo ooooooooooo o oo ooooo oo oooo ooo o ooo o ooo oo o ooooo oo ooooooo oo ooo oooo ooo oo o oooooo ooo oo oo ooo o ooo oo oo ooo ooo ooooo ooooooo oo oo o o oooooo ooooo oooo oo ooo oooo ooo o ooooooo oo oooo ooo o oo oooooo oo oo oo o oo oooo o oo ooo oooo oooo ooooo oooo ooooooo oooo oooo ooooooo o oo ooooooooooooo ooooooooooooooooooooo oo o oooooooo ooo oo o oooo oooo oooooo ooooooo oooo oooooooooooo ooooo ooooooo oooooooooooooooooooooooooooooooo
o oo o
o o
o o
o
oooo
o
oo
o o
-3
-2
-1
о0
1
2
3
x1
Figure 5: Normal data scatter. The propoк sл&eл&dм нpeeling approach. The
dshaatadeadreaрreс1aт м
iн n the plot corresponds observations.
to
the
layer. The remaining
We note that the р с1т м н layer includes just the observations lying on the first convex hull and few others thinly scattered. In other words, in such a case, our approach yields results in agreement with the classical peeling procedure. To allow for the presence of unusual structures, we consider again the data set displayed in Figure 2 (200 observations generated from a mixture
94
of two normal distributions). For these data, we recall that the classical peeling procedure fails, combining у ф&ф&х ц and ч ш1щ х ц data up to the ъ ы ь layer (Figure 2). On the other hand, our proposed approach correctly selects as ч ш1щ х ц the 20 clustered observations plus few more (Figure 6). That is, the method identify properly the ч ш1щ х ц and у ф&ф&х ц layer, providing in addition a computational gain.
o o
o
0
o
o oo
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooooooooooooooooooooooooooooooooooooo ooooooooooooooooooo ooooo oooooooooooooooooooooooooooooo ooooo
o o
o
o
oo
п
-2
o
o oo
o
x2
-4
o
oo
o ooo
oo
o
o
o
o
oo o
oo o
-3
-2
-1
о0
1
2
3
x1
-6
Figure 6: Data set as in Figure 2. A cluster of data lie along the periphery of the scatter. The proposed peeling approach. The shaded area in the plot corresponds to the у ф&ф&х ц layer. The remaining data are ч ш1щ х ц observations.
Finally, we consider the Brain and Body weight data (Rousseeuw and Leroy, 1987; pag. 57), a well known data set in outlier analysis, referring to the logarithm of the body and brain weight of 28 species. Rousseeuw and van Zomeren (1990) in a regression setting identified as outliers 5 observations (6, 16, 25, 14, and 17) using the minimum volume ellipsoid. Figure 7 shows the scatterplot of the data along with the shaded у ф&ф&х ц 95
region according to our procedure. The three clustered outliers on the right, and the other two atypical observations belong to the э©ю1я Ў layer, although as expected this latter includes other э ю1я Ў observations as well.
o o
8
6
4
Log Brain Weigth
o o o
o 14 o
o oo o
o 17 o
oo o oo o
o o o o
o
0
5
Log Body Weigth
16 o o 6
25o
10
2
0
Figure shaded
7: Brain and Body data. The propoў Ј¤seЈ d area in the plot corresponds to the
peeling Ў layer.
approach. The The remaining
data are э ю1я Ў observations.
It is worth noting that, in this example, many data points are identified as э ю1я Ў observations. This is due to the sparseness and the regression nature of the data. Indeed, our procedure looks for a dense core of the data scatter, and does not consider the remoteness from a possible regression line.
96
7. Final remarks Our approach can be viewed as a resistant partial peeling method, and can be applied in an iterative and interactive way as exploratory and diagnostic tool. It can be used to identify and compare the main structure of different groups in cluster analysis. Moreover, it may represent a first step in robust analysis and trimming procedures. In addition, we stress that Ґ§¦©Ё data are not necessarily outliers. However, our procedure seems to be particularly attractive to deal with such occurence in the data. Indeed, outliers will be naturally included in what we have called the Ґ§¦©Ё layer. In particular, our peeling method should be well suited to deal with the masking effect, that occurs when outlier clusters hide themselves to the single case diagnostic. In fact, our approach allows to identify candidate outlier clusters in neighbourhoods of the boundary region. This property allows also to avoid the combinatorial size computations required by the classical block-deletion procedures (see also Porzio and Ragozini, 2000). Finally, it has to be noted that, as any convex hull based procedure, our peeling approach is thought to deal with convex shape data sets. This notwithstanding, we believe that little deviation from convexity should not compromise its performance. In any case, although our approach is particularly effective with heavy and sparse tail data set, we recommend to iterate the procedure if unduly complex structures are suspected. further developments of this work could involve the possibility of considering more layers through an iterative use of our peeling approach. In this way a finer ordering of the whole data set could be obtained. Acknowledgment: Our work was supported by funds from the MURST. 97
References Barnett, V. (1976) The ordering of multivariate data (with discussion), Journal of Royal Statistical Society A, 139, 318-354. Bebbington, A.C. (1978) A method of bivariate trimming for robust estimation of the correlation coefficient, Applied Statistics, 27, 221-226. Brooks, D. G., Carroll, S. S., Verdini, W. A. (1988) Characterizing the Domain of Regression Model, The American Statistician, 42, 187-190. Daniels, H. E. (1952) The covering circle of a sample from a circular normal distribution, Biometrika, 39, 137-143. Donoho, D.L., Gasko, M. (1992) Breakdown properties of location estimates based on halfspace depth and projected outlyingness, Annals of Statistics, 20, 1803-1827. Eddy, W.F. (1981) Comment on: Graphics for Multivariate Two-Sample Problem (Friedman J.H., Rafsky, L.C.), Journal of the American Statistical Association, 76, 287-289. Eddy, W.F., Hartigan, J.A. (1977) Uniform convergence of the empirical distribution function over convex sets, Annals of Statistics, 5, 370-374. Green, P.J. (1981) Peeling bivariate data, in: Interpreting Multivariate Data, V. Barnett Ed., John Wiley, New York, 3-19. Liu, R.Y., Parelius, J.M., Singh, K. (1999) Multivariate Analysis by Data Depth: Descriptive statistics, Graphics and Inference, The Annals of Statistics, 27, 783-858. Nath, G. B. (1971) Estimation in truncate bivariate normal distributions, Applied Statistics, 20, 313-319. Petitjean, P., Saporta G. (1992) On the Performance of Peeling Algorithms, Applied stochastic models and Data Analysis, 9, 91-98. Porzio, G.C., Ragozini G. (2000) Exploring the Periphery of Data Scatters: Are There Outliers?, in: Data Analysis, Classification, and Related Methods, H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen, M. Schader Ed., Springer-Verlag, Heidelberg, 235-240. Rousseeuw, P.J., Leroy, A. (1987) Robust Regression and Outlier Detection, John Wiley, New York. 98
Rousseeuw, P.J., van Zomeren, B.C. (1990) Unmasking multivariate outliers and leverage points (with discussion), Journal of the American Statistical Association, 85, 633-651. Silverman, B. W. and Titterington, D. M. (1980), Minimum covering ellipses, SIAM Journal on Scientific and Statistical Computing, 1, 401409. Tukey, J.W. (1977) Exploratory Data Analysis, Addison-Wesley, Reading, MA. Zani, S., Riani, M., Corbellini, A. (1998) Robust bivariate boxplots and multiple outlier detection, Computational Statistics and Data Analysis, 28, 257-270. 99

GC Porzio, G Ragozini

File: peeling-multivariate-data-sets-a-new-approach.pdf
Author: GC Porzio, G Ragozini
Published: Mon Oct 30 16:51:04 2000
Pages: 15
File size: 0.12 Mb


, pages, 0 Mb

Re-Selling Japan, 11 pages, 1.24 Mb

, pages, 0 Mb
Copyright © 2018 doc.uments.com