Page 164 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 164
Data Warehousing and Data Mining
notes Note: Methods 3 to 6 bias the data. The filled-in value may not be correct. Method 6, however,
is a popular strategy. In comparison to the other methods, it uses the most information from the
present data to predict missing values.
8.3.2 noisy Data
Noise is a random error or variance in a measured variable. Given a numeric attribute such as,
say, price, we can “smooth” out the data by using the following techniques:
Binning Methods
Binning methods smooth a sorted data value by consulting the “neighborhood”, or values
around it. The sorted values are distributed into a number of ‘buckets’ or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing. The following
example illustrates some binning techniques. In this example, the data for price are first sorted
and partitioned into equi-depth bins (of depth 3).
1. In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value
in this bin is replaced by the value 9.
2. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced
by the bin median.
3. In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary
value. In general, the larger the width, the greater the effect of the smoothing. Alternatively,
bins may be equi-width, where the interval range of values in each bin is constant.
Example
1. Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
2. Partition into (equi-width) bins:
(a) Bin 1: 4, 8, 15
(b) Bin 2: 21, 21, 24
(c) Bin 3: 25, 28, 34
3. Smoothing by bin means:
(a) Bin 1: 9, 9, 9,
(b) Bin 2: 22, 22, 22
(c) Bin 3: 29, 29, 29
4. Smoothing by bin boundaries:
(a) Bin 1: 4, 4, 15
(b) Bin 2: 21, 21, 24
(c) Bin 3: 25, 25, 34
Clustering
Outliers may be detected by clustering, where similar values are organized into groups or
“clusters”. Intuitively, values which fall outside of the set of clusters may be considered outliers
(Figure 8.1).
158 LoveLy professionaL university