Page 163 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 163

Unit 8: Data Warehouse Refreshment




          interface. This API is currently supported by several middleware products such as ODBC and   notes
          IDAPI.  The  RDA  standard  communication  protocol  specifies  the  messages  to  be  exchanged
          between clients and servers. Its specialization to SQL requests enables the transport of requests
          generated by a CLI interface.
          Despite these efforts, existing middleware products do not actually offer a standard interface for
          client-server developments. Some products such as DAL/DAM or SequeLink offer their own API,
          although some compatibility is sometimes offered with other tools, such as ODBC. Furthermore,
          database vendors have developed their own middleware. For instance, Oracle proposes several
          levels of interface, such as Oracle Common Interface (OCI), on top of its client-server protocol
          named SQL*Net. The OCI offers a set of functions close to the ones of CLI, and enables any client
          having SQL*NET to connect to an Oracle server using any kind of communication protocol.

          Finally the alternative way to provide a transparent access to database servers is to use Internet
          protocols. In fact it must be noted that the World Wide Web is simply a standard-based client-
          server architecture.

          8.3 Data cleaning

          Data  cleaning  can  be  applied  to  remove  noise  and  correct  inconsistencies  in  the  data.  It  is  a
          routine work to “clean” the data by filling in missing values, smoothing noisy data, identifying or
          removing outliers, and resolving inconsistencies. Dirty data can cause confusion for the mining
          procedure. Although most mining routines have some procedures for dealing with incomplete
          or noisy data, they are not always robust. Instead, they may concentrate on avoiding over fitting
          the data to the function being modeled. Therefore, a useful preprocessing step is to run your
          data through some data cleaning routines. Some of the basic methods for data cleaning are as
          follows:

          8.3.1 Data cleaning for Missing values

          The following methods can be used to clean data for missing values in a particular attribute:
          1.   Ignore the tuple: This is usually done when the class label is missing (assuming the mining
               task  involves  classification  or  description).  This  method  is  not  very  effective,  unless
               the tuple contains several attributes with missing values. It is especially poor when the
               percentage of missing values per attribute varies considerably.
          2.   Fill in the missing value manually: In general, this approach is time-consuming and may
               not be feasible given a large data set with many missing values.

          3.   Use a global constant to fill in the missing value: Replace all missing attribute values by
               the same constant, such as a label like “Unknown”. If missing values are replaced by, say,
               “Unknown”, then the mining program may mistakenly think that they form an interesting
               concept, since they all have a value in common — that of “Unknown”. Hence, although this
               method is simple, it is not recommended.
          4.   Use the attribute mean to fill in the missing value: You can fill the missing values with the
               average value in that attribute.
          5.   Use the attribute mean for all samples belonging to the same class as the given tuple.

          6.   Use the most probable value to fill in the missing value: This may be determined with
               inference-based tools using a Bayesian formalism or decision tree induction. For example,
               using the other customer attributes in your data set, you may construct a decision tree to
               predict the missing values for income.







                                           LoveLy professionaL university                                   157
   158   159   160   161   162   163   164   165   166   167   168