Page 52 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 52

Data Warehousing and Data Mining




                    notes          Another  important  aspect  of  knowledge  discovery  is  unsupervised  learning  or  clustering,
                                   which is the categorization of the observations in a dataset into an a priori unknown number of
                                   groups, based on some characteristic of the observations. This is a very difficult problem, and
                                   is only compounded when the database is massive. Hierarchical clustering, probability based
                                   methods, as well as optimization partitioning algorithms are all difficult to apply here. Maitra
                                   (2001) develops, under restrictive Gaussian equal-dispersion assumptions, a multipass scheme
                                   which  clusters  an  initial  sample,  filters  out  observations  that  can  be  reasonably  classified  by
                                   these clusters, and iterates the above procedure on the remainder. This method is scalable, which
                                   means that it can be used on datasets of any size.
                                   The field of data mining, like statistics, concerns itself with “learning from data” or “turning data
                                   into information”.

                                   3.2 What is statistics and why is statistics needed?


                                   Statistics  is  the  science  of  learning  from  data.  It  includes  everything  from  planning  for  the
                                   collection of data and subsequent data management to end-of-the-line activities such as drawing
                                   inferences from numerical facts called data and presentation of results. Statistics is concerned
                                   with one of the most basic of human needs: the need to find out more about the world and how
                                   it operates in face of variation and uncertainty. Because of the increasing use of statistics, it has
                                   become very important to understand and practice statistical thinking. Or, in the words of H.G.
                                   Wells: “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and
                                   write”.
                                   But, why is statistics needed? Knowledge is what we know. Information is the communication
                                   of knowledge. Data are known to be crude information and not knowledge by themselves. The
                                   sequence from data to knowledge is as follows: from data to information (data become information
                                   when they  become relevant to  the  decision problem);  from information  to facts (information
                                   becomes facts when the data can support it); and finally, from facts to knowledge (facts become
                                   knowledge when they are used in the successful competition of the decision process). Figure
                                   3.1 illustrate this statistical thinking process based on data in constructing statistical models for
                                   decision making under uncertainties. That is why we need statistics. Statistics arose from the
                                   need to place knowledge on a systematic evidence base. This required a study of the laws of
                                   probability, the development of measures of data properties and relationships, and so on.

                                       figure 3.1: the statistical thinking process based on Data in constructing statistical Models
                                                          for Decision-making under uncertainties






























          46                               LoveLy professionaL university
   47   48   49   50   51   52   53   54   55   56   57