Page 77 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 77

Unit 4: Data Mining Classification




          6.   The recursive partitioning stops only when any one of the following conditions is true:  notes
               (a)   All samples for a given node belong to the same class (steps 2 and 3), or

               (b)   There are no remaining attributes on which the samples may be further partitioned
                    (step 4). In this case, majority voting is employed (step 5). This involves converting
                    the given node into a leaf and labeling it with the class in majority among samples.
                    Alternatively, the class distribution of the node samples may be stored; or
               (c)   There are no samples for the branch test-attribute=at (step 11). In this case, a leaf is
                    created with the majority class in samples (step 12).
          Decision tree induction algorithms have been used for classification in a wide range of application
          domains. Such systems do not use domain knowledge. The learning and classification steps of
          decision tree induction are generally fast. Classification accuracy is typically high for data where
          the mapping of classes consists of long and thin regions in concept space.
          Attribute Selection Measure

          The information gain measure is used to select the test attribute at each node in the tree. Such
          a  measure  is  also  referred  to  as  an  attribute  selection  measure.  The  attribute  with  the  highest
          information gain is chosen as the test attribute for the current node. This attribute minimizes
          the information needed to classify the samples in the resulting partitions and reflects the least
          randomness or “impurity” in these partitions. Such an information-theoretic approach minimizes
          the expected number of tests needed to classify an object and guarantees that a simple (but not
          necessarily the simplest) tree is found.

          Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values
          defining m distinct classes, d (for i = 1,..., m). Let s, be the number of samples of S in class d. The
          expected information needed to classify a given sample is given by:

                                                  m
                                   I(s , s  …, s ) =  – ∑  p  log ( )
                                                           p
                                     1  2   m        i   2  i
                                                 i= 1
          where p, is the probability than an arbitrary sample belongs to class d and is estimated by S /s.
                                                                                    i
          Note that a log function to the base 2 is used since the information is encoded in bits.
          Attribute A can be used to partition S into v subsets, {S ,S , ….. , S }, where S  contains those
                                                          2
                                                        1
                                                                          j
                                                                 V
          samples in S that have value a  of A. If A were selected as the test attribute (i.e., best attribute for
                                  j
          splitting), then these subsets would correspond to the branches grown from the node containing
          the set S. Let s  be the number of samples of class d in a subset S . The entropy, or expected
                                                                j
                      j
          information based on the partitioning into subsets by A is given by:
                                                 +
                                         v  s +  ... s
                                    A
                                  E ( ) = ∑  1 j    mj  ( I s 1 j ...s mj ).
                                         i= 1   s
                             +
          The term  ∑ v s +  i= 1  1 j  ... s mj   acts as the weight of the j  subset and is the number of samples in
                                                      th
                           s
          the subset (i.e., having value a of A) divided by the total number of samples in S. The smaller the
                                  j
          entropy value is, the greater the purity of the subset partitions. The encoding information that
          would be gained by branching on A is
          Gain(A) = I(s , s , . . . , s ) - E(A).
                     l  2   m
          In other words, Gain(A) is the expected reduction in entropy caused by knowing the value of
          attribute A.





                                           LoveLy professionaL university                                    71
   72   73   74   75   76   77   78   79   80   81   82