Page 55 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 55
Unit 3: Data Mining Techniques
3.3.5 overlap notes
As its name implies, the Overlap coefficient attempts to determine the degree to which two sets
overlap. The Overlap coefficient is compared as
|A B|
sim(d,d ) D(A,B) =
=
j
min(|A|,|B|)
∑ n ww
≅ k1 kq kj
=
kq ∑
2
min ( ∑ n k1 w + n k1 w 2 kj )
=
=
The Overlap coefficient is sometimes calculated using the max operator in place of the min.
Note The denominator does not necessarily normalize the similarity values produced by
this measure. As a result, the Overlap values are typically higher in magnitude than other
similarity measures.
Task “Statistics is mathematics but it’s very useful in data mining.” Discuss
3.4 Decision trees
A decision tree is a structure that can be used to divide up a large collection of records into
successively smaller sets of records by applying a sequence of simple decision rules. With each
successive division, the members of the resulting sets become more and more similar to one
another. The familiar division of living things into kingdoms, phyla, classes, orders, families,
genera, and species, invented by the Dwedish botanist Carl Linnaeous in the 1730s, provides
a good example. Within the animal kingdom, a particular animal is assigned to the phylum
chordata if it has a spinal cord. Additional characteristics are used to further sub-divided the
chordates into the birds, mammals, reptiles, and so on. These classes are further subdivided until,
at the lowest level in the taxonomy, members of the same species are not only morphologically
similar, they are capable of breeding and producing fertile offspring.
Decision trees are simple knowledge representation and they classify examples to a finite number
of classes, the nodes are labelled with attribute names, the edges are labelled with possible values
for this attribute and the leaves labelled with different classes. Objects are classified by following
a path down the tree, by taking the edges, corresponding to the values of the attributes in an
object.
The following is an example of objects that describe the weather at a given time. The objects
contain information on the outlook, humidity etc. Some objects are positive examples denote by
P and others are negative i.e. N. Classification is in this case the construction of a tree structure,
illustrated in the following diagram, which can be used to classify all the objects correctly.
LoveLy professionaL university 49