Page 166 - DCAP208_Management Support Systems
P. 166

Unit 10: Data Mining Tools and Techniques




          You could create even more training records than 10 by creating a new record starting at every  Notes
          data point. For instance you could take the first 10 data points and create a record. Then you
          could take the 10 consecutive data points starting at the second data point, then the 10 consecutive
          data point starting at the third data point. Even though some of the data points would overlap
          from one record to the next the prediction value would always be different. In our example of
          100 initial data points 90 different training records could be created this way as opposed to the
          10 training records created via the other method.

          Why Voting is Better – K Nearest Neighbors

          One of the improvements that is usually made to the basic nearest neighbor algorithm is to take
          a vote from the “K” nearest neighbors rather than just relying on the sole nearest neighbor to
          the unclassified record.
          Example: In Figure 10.1 we can see that unclassified example C has a nearest neighbor that is a
          defaulter and yet is surrounded almost exclusively by records that are good credit risks. In this
          case the nearest neighbor to record C is probably an outlier - which may be incorrect data or
          some non-repeatable idiosyncrasy. In either case it is more than likely that C is a non-defaulter
          yet would be predicted to be a defaulter if the sole nearest neighbor were used for the prediction.
                Figure 10.1: The Nearest Neighbors are Shown Graphically for Three Unclassified
                                        Records: A, B, and C






















          In cases like these a vote of the 9 or 15 nearest neighbors would provide a better prediction
          accuracy for the system than would just the single nearest neighbor. Usually this is accomplished
          by simply taking the majority or plurality of predictions from the K nearest neighbors if the
          prediction column is a binary or categorical or taking the average value of the prediction
          column from the K nearest neighbors.
          Another important aspect of any system that is used to make predictions is that the user be
          provided with, not only the prediction, but also some sense of the confidence in that prediction
          (e.g. the prediction is defaulter with the chance of being correct 60% of the time). The nearest
          neighbor algorithm provides this confidence information in a number of ways.
          The distance to the nearest neighbor provides a level of confidence. If the neighbor is very close
          or an exact match then there is much higher confidence in the prediction than if the nearest
          record is a great distance from the unclassified record.
          The degree of homogeneity amongst the predictions within the K nearest neighbors can also be
          used. If all the nearest neighbors make the same prediction then there is much higher confidence




                                           LOVELY PROFESSIONAL UNIVERSITY                                   159
   161   162   163   164   165   166   167   168   169   170   171