Page 52 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 52
Data Warehousing and Data Mining
notes Another important aspect of knowledge discovery is unsupervised learning or clustering,
which is the categorization of the observations in a dataset into an a priori unknown number of
groups, based on some characteristic of the observations. This is a very difficult problem, and
is only compounded when the database is massive. Hierarchical clustering, probability based
methods, as well as optimization partitioning algorithms are all difficult to apply here. Maitra
(2001) develops, under restrictive Gaussian equal-dispersion assumptions, a multipass scheme
which clusters an initial sample, filters out observations that can be reasonably classified by
these clusters, and iterates the above procedure on the remainder. This method is scalable, which
means that it can be used on datasets of any size.
The field of data mining, like statistics, concerns itself with “learning from data” or “turning data
into information”.
3.2 What is statistics and why is statistics needed?
Statistics is the science of learning from data. It includes everything from planning for the
collection of data and subsequent data management to end-of-the-line activities such as drawing
inferences from numerical facts called data and presentation of results. Statistics is concerned
with one of the most basic of human needs: the need to find out more about the world and how
it operates in face of variation and uncertainty. Because of the increasing use of statistics, it has
become very important to understand and practice statistical thinking. Or, in the words of H.G.
Wells: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and
write”.
But, why is statistics needed? Knowledge is what we know. Information is the communication
of knowledge. Data are known to be crude information and not knowledge by themselves. The
sequence from data to knowledge is as follows: from data to information (data become information
when they become relevant to the decision problem); from information to facts (information
becomes facts when the data can support it); and finally, from facts to knowledge (facts become
knowledge when they are used in the successful competition of the decision process). Figure
3.1 illustrate this statistical thinking process based on data in constructing statistical models for
decision making under uncertainties. That is why we need statistics. Statistics arose from the
need to place knowledge on a systematic evidence base. This required a study of the laws of
probability, the development of measures of data properties and relationships, and so on.
figure 3.1: the statistical thinking process based on Data in constructing statistical Models
for Decision-making under uncertainties
46 LoveLy professionaL university