Page 165 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 165
Unit 8: Data Warehouse Refreshment
notes
figure 8.1: outliers may be detected by clustering analysis
Combined Computer and Human Inspection
Outliers may be identified through a combination of computer and human inspection. In one
application, for example, an information-theoretic measure was used to help identify outlier
patterns in a handwritten character database for classification. The measure’s value reflected
the “surprise” content of the predicted character label with respect to the known label. Outlier
patterns may be informative (e.g., identifying useful data exceptions, such as different versions
of the characters “0” or “7”), or “garbage” (e.g., mislabeled characters). Patterns whose surprise
content is above a threshold are output to a list. A human can then sort through the patterns in
the list to identify the actual garbage ones.
This is much faster than having to manually search through the entire database. The garbage
patterns can then be removed from the (training) database.
Regression
Data can be smoothed by fitting the data to a function, such as with regression. Linear regression
involves finding the “best” line to fit two variables, so that one variable can be used to predict
the other. Multiple linear regression is an extension of linear regression, where more than two
variables are involved and the data are fit to a multidimensional surface. Using regression to find
a mathematical equation to fit the data helps smooth out the noise.
Task Discuss for the integration of multiple heterogeneous information sources,
many companies in industry prefer the update-driven approach rather than query driven
approach.
inconsistent Data
There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies
may be corrected manually using external references. For example, errors made at data entry may
be corrected by performing a paper trace. This may be coupled with routines designed to help
correct the inconsistent use of codes. Knowledge engineering tools may also be used to detect
the violation of known data constraints. For example, known functional dependencies between
attributes can be used to find values contradicting the functional constraints.
LoveLy professionaL university 159