Page 163 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 163
Unit 8: Data Warehouse Refreshment
interface. This API is currently supported by several middleware products such as ODBC and notes
IDAPI. The RDA standard communication protocol specifies the messages to be exchanged
between clients and servers. Its specialization to SQL requests enables the transport of requests
generated by a CLI interface.
Despite these efforts, existing middleware products do not actually offer a standard interface for
client-server developments. Some products such as DAL/DAM or SequeLink offer their own API,
although some compatibility is sometimes offered with other tools, such as ODBC. Furthermore,
database vendors have developed their own middleware. For instance, Oracle proposes several
levels of interface, such as Oracle Common Interface (OCI), on top of its client-server protocol
named SQL*Net. The OCI offers a set of functions close to the ones of CLI, and enables any client
having SQL*NET to connect to an Oracle server using any kind of communication protocol.
Finally the alternative way to provide a transparent access to database servers is to use Internet
protocols. In fact it must be noted that the World Wide Web is simply a standard-based client-
server architecture.
8.3 Data cleaning
Data cleaning can be applied to remove noise and correct inconsistencies in the data. It is a
routine work to “clean” the data by filling in missing values, smoothing noisy data, identifying or
removing outliers, and resolving inconsistencies. Dirty data can cause confusion for the mining
procedure. Although most mining routines have some procedures for dealing with incomplete
or noisy data, they are not always robust. Instead, they may concentrate on avoiding over fitting
the data to the function being modeled. Therefore, a useful preprocessing step is to run your
data through some data cleaning routines. Some of the basic methods for data cleaning are as
follows:
8.3.1 Data cleaning for Missing values
The following methods can be used to clean data for missing values in a particular attribute:
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and may
not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant, such as a label like “Unknown”. If missing values are replaced by, say,
“Unknown”, then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown”. Hence, although this
method is simple, it is not recommended.
4. Use the attribute mean to fill in the missing value: You can fill the missing values with the
average value in that attribute.
5. Use the attribute mean for all samples belonging to the same class as the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with
inference-based tools using a Bayesian formalism or decision tree induction. For example,
using the other customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.
LoveLy professionaL university 157