Page 55 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 55

Unit 3: Data Mining Techniques




          3.3.5 overlap                                                                         notes

          As its name implies, the Overlap coefficient attempts to determine the degree to which two sets
          overlap. The Overlap coefficient is compared as
                                                     |A  B|
                                 sim(d,d ) D(A,B) =
                                         =
                                        j
                                                  min(|A|,|B|)
                                                ∑ n  ww
                                         ≅        k1  kq  kj
                                                   =
                                                     kq ∑
                                                     2
                                           min ( ∑  n k1 w +  n k1 w 2 kj )
                                                  =
                                                          =
          The Overlap coefficient is sometimes calculated using the max operator in place of the min.
             Note  The denominator does not necessarily normalize the similarity values produced by
             this measure. As a result, the Overlap values are typically higher in magnitude than other
             similarity measures.




              Task    “Statistics is mathematics but it’s very useful in data mining.” Discuss




          3.4 Decision trees

          A decision tree is a structure that can be used to divide up a large collection of records into
          successively smaller sets of records by applying a sequence of simple decision rules. With each
          successive division, the members of the resulting sets become more and more similar to one
          another. The familiar division of living things into kingdoms, phyla, classes, orders, families,
          genera, and species, invented by the Dwedish botanist Carl Linnaeous in the 1730s, provides
          a  good  example.  Within  the  animal  kingdom,  a  particular  animal  is  assigned  to  the  phylum
          chordata if it has a spinal cord. Additional characteristics are used to further sub-divided the
          chordates into the birds, mammals, reptiles, and so on. These classes are further subdivided until,
          at the lowest level in the taxonomy, members of the same species are not only morphologically
          similar, they are capable of breeding and producing fertile offspring.
          Decision trees are simple knowledge representation and they classify examples to a finite number
          of classes, the nodes are labelled with attribute names, the edges are labelled with possible values
          for this attribute and the leaves labelled with different classes. Objects are classified by following
          a path down the tree, by taking the edges, corresponding to the values of the attributes in an
          object.
          The following is an example of objects that describe the weather at a given time. The objects
          contain information on the outlook, humidity etc. Some objects are positive examples denote by
          P and others are negative i.e. N. Classification is in this case the construction of a tree structure,
          illustrated in the following diagram, which can be used to classify all the objects correctly.













                                           LoveLy professionaL university                                    49
   50   51   52   53   54   55   56   57   58   59   60