Page 164 - DCAP208_Management Support Systems
P. 164
Unit 10: Data Mining Tools and Techniques
business decisions still just become more informed guesses. The more and better the data and Notes
the better the understanding of statistics the better the decision that can be made.
Statistics has been around for a long time easily a century and arguably many centuries when
the ideas of probability began to gel. It could even be argued that the data collected by the
ancient Egyptians, Babylonians, and Greeks were all statistics long before the field was officially
recognized. Today data mining has been defined independently of statistics though “mining
data” for patterns and predictions is really what statistics is all about.
Notes Some of the techniques that are classified under data mining such as CHAID and
CART really grew out of the statistical profession more than anywhere else, and the basic
ideas of probability, independence and causality and overfitting are the foundation on
which both data mining and statistics are built.
10.2.2 Nearest Neighbor
Clustering and the Nearest Neighbor prediction technique are among the oldest techniques
used in data mining. Most people have an intuition that they understand what clustering is –
namely that like records are grouped or clustered together. Nearest neighbor is a prediction
technique that is quite similar to clustering – its essence is that in order to predict what a
prediction value is in one record look for records with similar predictor values in the historical
database and use the prediction value from the record that it “nearest” to the unclassified record.
A Simple Example of Clustering
A simple example of clustering would be the clustering that most people perform when they do
the laundry – grouping the permanent press, dry cleaning, whites and brightly colored clothes
is important because they have similar characteristics. And it turns out they have important
attributes in common about the way they behave (and can be ruined) in the wash. To “cluster”
your laundry most of your decisions are relatively straightforward. There are of course difficult
decisions to be made about which cluster your white shirt with red stripes goes into (since it is
mostly white but has some color and is permanent press). When clustering is used in business
the clusters are often much more dynamic – even changing weekly to monthly and many more
of the decisions concerning which cluster a record falls into can be difficult.
A Simple Example of Nearest Neighbor
A simple example of the nearest neighbor prediction algorithm is that if you look at the people
in your neighborhood (in this case those people that are in fact geographically near to you). You
may notice that, in general, you all have somewhat similar incomes. Thus if your neighbor has
an income greater than $100,000 chances are good that you too have a high income. Certainly the
chances that you have a high income are greater when all of your neighbors have incomes over
$100,000 than if all of your neighbors have incomes of $20,000. Within your neighborhood there
may still be a wide variety of incomes possible among even your “closest” neighbors but if you
had to predict someone’s income based on only knowing their neighbors you’re best chance of
being right would be to predict the incomes of the neighbors who live closest to the unknown
person.
The nearest neighbor prediction algorithm works in very much the same way except that
“nearness” in a database may consist of a variety of factors not just where the person lives.
It may, for instance, be far more important to know which school someone attended and what
LOVELY PROFESSIONAL UNIVERSITY 157