Page 165 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 165

Unit 8: Data Warehouse Refreshment




                                                                                                notes
                           figure 8.1: outliers may be detected by clustering analysis























          Combined Computer and Human Inspection

          Outliers may be identified through a combination of computer and human inspection. In one
          application,  for  example,  an  information-theoretic  measure  was  used  to  help  identify  outlier
          patterns in a handwritten character database for classification. The measure’s value reflected
          the “surprise” content of the predicted character label with respect to the known label. Outlier
          patterns may be informative (e.g., identifying useful data exceptions, such as different versions
          of the characters “0” or “7”), or “garbage” (e.g., mislabeled characters). Patterns whose surprise
          content is above a threshold are output to a list. A human can then sort through the patterns in
          the list to identify the actual garbage ones.

          This is much faster than having to manually search through the entire database. The garbage
          patterns can then be removed from the (training) database.

          Regression

          Data can be smoothed by fitting the data to a function, such as with regression. Linear regression
          involves finding the “best” line to fit two variables, so that one variable can be used to predict
          the other. Multiple linear regression is an extension of linear regression, where more than two
          variables are involved and the data are fit to a multidimensional surface. Using regression to find
          a mathematical equation to fit the data helps smooth out the noise.




              Task    Discuss  for  the  integration  of  multiple  heterogeneous  information  sources,
             many companies in industry prefer the update-driven approach rather than query driven
             approach.


          inconsistent Data

          There may be inconsistencies in the data recorded for some transactions. Some data inconsistencies
          may be corrected manually using external references. For example, errors made at data entry may
          be corrected by performing a paper trace. This may be coupled with routines designed to help
          correct the inconsistent use of codes. Knowledge engineering tools may also be used to detect
          the violation of known data constraints. For example, known functional dependencies between
          attributes can be used to find values contradicting the functional constraints.



                                           LoveLy professionaL university                                   159
   160   161   162   163   164   165   166   167   168   169   170