P. 164

Data Warehousing and Data Mining

                    notes          Note: Methods 3 to 6 bias the data. The filled-in value may not be correct. Method 6, however,
                                   is a popular strategy. In comparison to the other methods, it uses the most information from the
                                   present data to predict missing values.

                                   8.3.2 noisy Data

                                   Noise is a random error or variance in a measured variable. Given a numeric attribute such as,
                                   say, price, we can “smooth” out the data by using the following techniques:

                                   Binning Methods

                                   Binning  methods  smooth  a  sorted  data  value  by  consulting  the  “neighborhood”,  or  values
                                   around it. The sorted values are distributed into a number of ‘buckets’ or bins. Because binning
                                   methods  consult  the  neighborhood  of  values,  they  perform  local  smoothing.  The  following
                                   example illustrates some binning techniques. In this example, the data for price are first sorted
                                   and partitioned into equi-depth bins (of depth 3).
                                   1.   In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
                                       For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
                                       in this bin is replaced by the value 9.
                                   2.   Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
                                       by the bin median.

                                   3.   In smoothing by bin boundaries, the minimum and maximum values in a given bin are
                                       identified as the bin boundaries. Each bin value is then replaced by the closest boundary
                                       value. In general, the larger the width, the greater the effect of the smoothing. Alternatively,
                                       bins may be equi-width, where the interval range of values in each bin is constant.
                                   1.   Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34

                                   2.   Partition into (equi-width) bins:
                                       (a)   Bin 1: 4, 8, 15
                                       (b)   Bin 2: 21, 21, 24
                                       (c)   Bin 3: 25, 28, 34
                                   3.   Smoothing by bin means:

                                       (a)   Bin 1: 9, 9, 9,
                                       (b)   Bin 2: 22, 22, 22
                                       (c)   Bin 3: 29, 29, 29
                                   4.   Smoothing by bin boundaries:

                                       (a)   Bin 1: 4, 4, 15
                                       (b)   Bin 2: 21, 21, 24
                                       (c)   Bin 3: 25, 25, 34


                                   Outliers  may  be  detected  by  clustering,  where  similar  values  are  organized  into  groups  or
                                   “clusters”. Intuitively, values which fall outside of the set of clusters may be considered outliers
                                   (Figure 8.1).

          158                              LoveLy professionaL university
   159   160   161   162   163   164   165   166   167   168   169