Page 51 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 51

Unit 3: Data Mining Techniques




          Most companies already collect and refine massive quantities of data. Data mining techniques   notes
          can be implemented rapidly on existing software and hardware platforms to enhance the value of
          existing information resources, and can be integrated with new products and systems as they are
          brought on-line. When implemented on high performance client/server or parallel processing
          computers, data mining tools can analyze massive databases to deliver answers to questions such
          as, “Which clients are most likely to respond to my next promotional mailing, and why?”

          3.1 statistical perspective on Data Mining


          The information age has been matched by an explosion of data. This surfeit has been a result of
          modern, improved and, in many cases, automated methods for both data collection and storage.
          For instance, many stores tag their items with a product-specific bar code, which is scanned
          in when the corresponding item is bought. This automatically creates a gigantic repository of
          information  on  products  and  product  combinations  sold.  Similar  databases  are  also  created
          by automated book-keeping, digital communication tools or by remote sensing satellites, and
          aided by the availability of affordable and effective storage mechanisms – magnetic tapes, data
          warehouses and so on. This has created a situation of plentiful data and the potential for new and
          deeper understanding of complex phenomena. The very size of these databases however means
          that any signal or pattern may be overshadowed by “noise”.
          Consider  for  instance  the  database  created  by  the  scanning  of  product  bar  codes  at  sales
          checkouts. Originally adopted for reasons of convenience, this now forms the basis for gigantic
          databases as large stores maintain records of products bought by customers in any transaction.
          Some businesses have gone further: by providing customers with an incentive to use a magnetic-
          striped frequent shopper card, they have created a database not just of product combinations
          but also time-sequenced information on such transactions. The goal behind collecting such data
          is the ability to answer questions such as “If potato chips and ketchup are purchased together,
          what is the item that is most likely to be also bought?”, or “If shampoo is purchased, what is the
          most common item also bought in that same transaction?”. Answers to such questions result in
          what are called association rules. Such rules can be used, for instance, in deciding on store layout
          or on promotions of certain brands of products by offering discounts on select combinations.
          Applications of association rules transcend sales transactions data indeed.
          An oft-stated goal of data mining is the discovery of patterns and relationships among different
          variables  in  the  database.  This  is  no  different  from  some  of  the  goals  of  statistical  inference:
          consider for instance, simple linear regression. Similarly, the pair-wise relationship between the
          products sold above can be nicely represented by means of an undirected weighted graph, with
          products as the nodes and weighted edges for the presence of the particular product pair in as
          many transactions as proportional to the weights. While undirected graphs provide a graphical
          display, directed a cyclic graphs are perhaps more interesting – they provide understanding of
          the phenomena driving the relationships between the variables. The nature of these relationships
          can be analyzed using classical and modern statistical tools such as regression, neural networks
          and so on.
          Another  aspect  of  knowledge  discovery  is  supervised  learning.  Statistical  tools  such  as
          discriminant analysis or classification trees often need to be refined for these problems. Some
          additional methods to be investigated here are k-nearest neighbor methods, bootstrap aggregation
          or bagging, and boosting which originally evolved in the machine learning literature, but whose
          statistical properties have been analyzed in recent years by statisticians. Boosting is particularly
          useful in the context of data streams – when we have rapid data flowing into the system and
          real-time classification rules are needed. Such capability is especially desirable in the context
          of  financial  data,  to  guard  against  credit  card  and  calling  card  fraud,  when  transactions  are
          streaming in from several sources and an automated split-second determination of fraudulent or
          genuine use has to be made, based on past experience.





                                           LoveLy professionaL university                                    45
   46   47   48   49   50   51   52   53   54   55   56