Page 164 - DCAP208_Management Support Systems
P. 164

Unit 10: Data Mining Tools and Techniques




          business decisions still just become more informed guesses. The more and better the data and  Notes
          the better the understanding of statistics the better the decision that can be made.
          Statistics has been around for a long time easily a century and arguably many centuries when
          the ideas of probability began to gel. It could even be argued that the data collected by the
          ancient Egyptians, Babylonians, and Greeks were all statistics long before the field was officially
          recognized. Today data mining has been defined independently of statistics though “mining
          data” for patterns and predictions is really what statistics is all about.





             Notes  Some of the techniques that are classified under data mining such as CHAID and
            CART really grew out of the statistical profession more than anywhere else, and the basic
            ideas of probability, independence and causality and overfitting are the foundation on
            which both data mining and statistics are built.

          10.2.2 Nearest Neighbor

          Clustering and the Nearest Neighbor prediction technique are among the oldest techniques
          used in data mining. Most people have an intuition that they understand what clustering is –
          namely that like records are grouped or clustered together. Nearest neighbor is a prediction
          technique that is quite similar to clustering – its essence is that in order to predict what a
          prediction value is in one record look for records with similar predictor values in the historical
          database and use the prediction value from the record that it “nearest” to the unclassified record.
          A Simple Example of Clustering


          A simple example of clustering would be the clustering that most people perform when they do
          the laundry – grouping the permanent press, dry cleaning, whites and brightly colored clothes
          is important because they have similar characteristics. And it turns out they have important
          attributes in common about the way they behave (and can be ruined) in the wash. To “cluster”
          your laundry most of your decisions are relatively straightforward. There are of course difficult
          decisions to be made about which cluster your white shirt with red stripes goes into (since it is
          mostly white but has some color and is permanent press). When clustering is used in business
          the clusters are often much more dynamic – even changing weekly to monthly and many more
          of the decisions concerning which cluster a record falls into can be difficult.
          A Simple Example of Nearest Neighbor


          A simple example of the nearest neighbor prediction algorithm is that if you look at the people
          in your neighborhood (in this case those people that are in fact geographically near to you). You
          may notice that, in general, you all have somewhat similar incomes. Thus if your neighbor has
          an income greater than $100,000 chances are good that you too have a high income. Certainly the
          chances that you have a high income are greater when all of your neighbors have incomes over
          $100,000 than if all of your neighbors have incomes of $20,000. Within your neighborhood there
          may still be a wide variety of incomes possible among even your “closest” neighbors but if you
          had to predict someone’s income based on only knowing their neighbors you’re best chance of
          being right would be to predict the incomes of the neighbors who live closest to the unknown
          person.
          The nearest neighbor prediction algorithm works in very much the same way except that
          “nearness” in a database may consist of a variety of factors not just where the person lives.
          It may, for instance, be far more important to know which school someone attended and what




                                           LOVELY PROFESSIONAL UNIVERSITY                                   157
   159   160   161   162   163   164   165   166   167   168   169