Page 78 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 78
Data Warehousing and Data Mining
notes The algorithm computes the information gain of each attribute. The attribute with the highest
information gain is chosen as the test attribute for the given set S. A node is created and labeled
with the attribute, branches are created for each value of the attribute, and the samples are
partitioned accordingly.
4.7.3 tree pruning
After building a decision tree a tree pruning step can be performed to reduce the size of the
decision tree. Decision trees that are too large are susceptible to a phenomenon called as
overfitting. Pruning helps by trimming the branches that reflects anomalies in the training
data due to noise or outliers and helps the initial tree in a way that improves the generalization
capability of the tree. Such methods typically use statistical measures to remove the least reliable
branches, generally resulting in faster classification and an improvement in the ability of the tree
to correctly classify independent test data.
There are two common approaches to tree pruning.
pre-pruning approach
In the pre-pruning approach, a tree is “pruned” by halting its construction early (e.g., by deciding
not to further split or partition the subset of training samples at a given node). Upon halting, the
node becomes a leaf. The leaf may hold the most frequent class among the subset samples, or the
probability distribution of those samples.
When constructing a tree, measures such as statistical significance, x , information gain, etc., can
2
be used to assess the goodness of a split. If partitioning the samples at a node would result in a split
that falls below a pre-specified threshold, then further partitioning of the given subset is halted.
There are difficulties, however, in choosing an appropriate threshold. High thresholds could
result in oversimplified trees, while low thresholds could result in very little simplification.
post-pruning approach
The post-pruning approach removes branches from a “fully grown” tree. A tree node is pruned
by removing its branches.
The cost complexity pruning algorithm is an example of the post-pruning approach. The pruned
node becomes a leaf and is labeled by the most frequent class among its former branches. For
each non-leaf node in the tree, the algorithm calculates the expected error rate that would occur if
the subtree at that node were pruned. Next, the expected error rate occurring if the node were not
pruned is calculated using the error rates for each branch, combined by weighting according to
the proportion of observations along each branch. If pruning the node leads to a greater expected
error rate, then the subtree is kept. Otherwise, it is pruned. After generating a set of progressively
pruned trees, an independent test set is used to estimate the accuracy of each tree. The decision
tree that minimizes the expected error rate is preferred.
Example: A company is trying to decide whether to bid for a certain contract or not. They
estimate that merely preparing the bid will cost £10,000. If their company bid then they estimate
that there is a 50% chance that their bid will be put on the “short-list”, otherwise their bid will be
rejected.
Once “short-listed” the company will have to supply further detailed information (entailing costs
estimated at £5,000). After this stage their bid will either be accepted or rejected.
The company estimate that the labour and material costs associated with the contract are £127,000.
They are considering three possible bid prices, namely £155,000, £170,000 and £190,000. They
72 LoveLy professionaL university