Page 78 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 78

Data Warehousing and Data Mining




                    notes          The algorithm computes the information gain of each attribute. The attribute with the highest
                                   information gain is chosen as the test attribute for the given set S. A node is created and labeled
                                   with  the  attribute,  branches  are  created  for  each  value  of  the  attribute,  and  the  samples  are
                                   partitioned accordingly.
                                   4.7.3 tree pruning


                                   After building a decision tree a tree pruning step can be performed to reduce the size of the
                                   decision  tree.  Decision  trees  that  are  too  large  are  susceptible  to  a  phenomenon  called  as
                                   overfitting.  Pruning  helps  by  trimming  the  branches  that  reflects  anomalies  in  the  training
                                   data due to noise or outliers and helps the initial tree in a way that improves the generalization
                                   capability of the tree. Such methods typically use statistical measures to remove the least reliable
                                   branches, generally resulting in faster classification and an improvement in the ability of the tree
                                   to correctly classify independent test data.

                                   There are two common approaches to tree pruning.

                                   pre-pruning approach

                                   In the pre-pruning approach, a tree is “pruned” by halting its construction early (e.g., by deciding
                                   not to further split or partition the subset of training samples at a given node). Upon halting, the
                                   node becomes a leaf. The leaf may hold the most frequent class among the subset samples, or the
                                   probability distribution of those samples.
                                   When constructing a tree, measures such as statistical significance, x , information gain, etc., can
                                                                                         2
                                   be used to assess the goodness of a split. If partitioning the samples at a node would result in a split
                                   that falls below a pre-specified threshold, then further partitioning of the given subset is halted.
                                   There are difficulties, however, in choosing an appropriate threshold. High thresholds could
                                   result in oversimplified trees, while low thresholds could result in very little simplification.

                                   post-pruning approach

                                   The post-pruning approach removes branches from a “fully grown” tree. A tree node is pruned
                                   by removing its branches.

                                   The cost complexity pruning algorithm is an example of the post-pruning approach. The pruned
                                   node becomes a leaf and is labeled by the most frequent class among its former branches. For
                                   each non-leaf node in the tree, the algorithm calculates the expected error rate that would occur if
                                   the subtree at that node were pruned. Next, the expected error rate occurring if the node were not
                                   pruned is calculated using the error rates for each branch, combined by weighting according to
                                   the proportion of observations along each branch. If pruning the node leads to a greater expected
                                   error rate, then the subtree is kept. Otherwise, it is pruned. After generating a set of progressively
                                   pruned trees, an independent test set is used to estimate the accuracy of each tree. The decision
                                   tree that minimizes the expected error rate is preferred.


                                          Example: A company is trying to decide whether to bid for a certain contract or not. They
                                   estimate that merely preparing the bid will cost £10,000. If their company bid then they estimate
                                   that there is a 50% chance that their bid will be put on the “short-list”, otherwise their bid will be
                                   rejected.
                                   Once “short-listed” the company will have to supply further detailed information (entailing costs
                                   estimated at £5,000). After this stage their bid will either be accepted or rejected.
                                   The company estimate that the labour and material costs associated with the contract are £127,000.
                                   They are considering three possible bid prices, namely £155,000, £170,000 and £190,000. They





          72                               LoveLy professionaL university
   73   74   75   76   77   78   79   80   81   82   83