Page 77 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 77
Unit 4: Data Mining Classification
6. The recursive partitioning stops only when any one of the following conditions is true: notes
(a) All samples for a given node belong to the same class (steps 2 and 3), or
(b) There are no remaining attributes on which the samples may be further partitioned
(step 4). In this case, majority voting is employed (step 5). This involves converting
the given node into a leaf and labeling it with the class in majority among samples.
Alternatively, the class distribution of the node samples may be stored; or
(c) There are no samples for the branch test-attribute=at (step 11). In this case, a leaf is
created with the majority class in samples (step 12).
Decision tree induction algorithms have been used for classification in a wide range of application
domains. Such systems do not use domain knowledge. The learning and classification steps of
decision tree induction are generally fast. Classification accuracy is typically high for data where
the mapping of classes consists of long and thin regions in concept space.
Attribute Selection Measure
The information gain measure is used to select the test attribute at each node in the tree. Such
a measure is also referred to as an attribute selection measure. The attribute with the highest
information gain is chosen as the test attribute for the current node. This attribute minimizes
the information needed to classify the samples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions. Such an information-theoretic approach minimizes
the expected number of tests needed to classify an object and guarantees that a simple (but not
necessarily the simplest) tree is found.
Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values
defining m distinct classes, d (for i = 1,..., m). Let s, be the number of samples of S in class d. The
expected information needed to classify a given sample is given by:
m
I(s , s …, s ) = – ∑ p log ( )
p
1 2 m i 2 i
i= 1
where p, is the probability than an arbitrary sample belongs to class d and is estimated by S /s.
i
Note that a log function to the base 2 is used since the information is encoded in bits.
Attribute A can be used to partition S into v subsets, {S ,S , ….. , S }, where S contains those
2
1
j
V
samples in S that have value a of A. If A were selected as the test attribute (i.e., best attribute for
j
splitting), then these subsets would correspond to the branches grown from the node containing
the set S. Let s be the number of samples of class d in a subset S . The entropy, or expected
j
j
information based on the partitioning into subsets by A is given by:
+
v s + ... s
A
E ( ) = ∑ 1 j mj ( I s 1 j ...s mj ).
i= 1 s
+
The term ∑ v s + i= 1 1 j ... s mj acts as the weight of the j subset and is the number of samples in
th
s
the subset (i.e., having value a of A) divided by the total number of samples in S. The smaller the
j
entropy value is, the greater the purity of the subset partitions. The encoding information that
would be gained by branching on A is
Gain(A) = I(s , s , . . . , s ) - E(A).
l 2 m
In other words, Gain(A) is the expected reduction in entropy caused by knowing the value of
attribute A.
LoveLy professionaL university 71