Page 64 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 64

Data Warehousing and Data Mining




                    notes          introduction

                                   Classification is a data mining (machine learning) technique used to predict group membership
                                   for data instances.

                                          Example:  You may wish to use classification to predict whether the weather on a particular
                                   day will be “sunny”, “rainy” or “cloudy”. Popular classification techniques include decision trees
                                   and neural networks.
                                   Data classification is a two step process. In the first step, a model is built describing a predetermined
                                   set of data classes or concepts. The model is constructed by analyzing database tuples described
                                   by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of
                                   the  attributes,  called  the  class  label  attribute.  In  the  context  of  classification,  data  tuples  are
                                   also referred to as samples, examples, or objects. The data tuples analyzed to build the model
                                   collectively  form  the  training  data  set.  The  individual  tuples  making  up  the  training  set  are
                                   referred to as training samples and are randomly selected from the sample population.
                                   Since the class label of each training sample is provided, this step is also known as supervised
                                   learning (i.e., the learning of the model is ‘supervised’ in that it is told to which class each training
                                   sample belongs). It contrasts with unsupervised learning (or clustering), in which the class labels
                                   of the training samples are not known, and the number or set of classes to be learned may not be
                                   known in advance.
                                   Typically, the learned model is represented in the form of classification rules, decision trees,
                                   or  mathematical  formulae.  For  example,  given  a  database  of  customer  credit  information,
                                   classification rules can be learned to identify customers as having either excellent or fair credit
                                   ratings (Figure 4.1a). The rules can be used to categorize future data samples, as well as provide a
                                   better understanding of the database contents. In the second step (Figure 4.1b), the model is used
                                   for classification. First, the predictive accuracy of the model (or classifier) is estimated.
                                   The holdout method is a simple technique which uses a test set of class-labeled samples. These
                                   samples are randomly selected and are independent of the training samples. The accuracy of a
                                   model on a given test set is the percentage of test set samples that are correctly classified by the
                                   model. For each test sample, the known class label is compared with the learned model’s class
                                   prediction for that sample. Note that if the accuracy of the model were estimated based on the
                                   training data set, this estimate could be optimistic since the learned model tends to over fit the
                                   data (that is, it may have incorporated some particular anomalies of the training data which are
                                   not present in the overall sample population). Therefore, a test set is used.
                                   (a)   Learning: Training data are analyzed by a classification algorithm. Here, the class label
                                       attribute is credit rating, and the learned model or classifier is represented in the form of
                                       classification rule.
                                   (b)   Classification: Test data are used to estimate the accuracy of the classification rules. If the
                                       accuracy is considered acceptable, the rules can be applied to the classification of new data
                                       tuples.



















          58                               LoveLy professionaL university
   59   60   61   62   63   64   65   66   67   68   69