Page 51 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 51
Unit 3: Data Mining Techniques
Most companies already collect and refine massive quantities of data. Data mining techniques notes
can be implemented rapidly on existing software and hardware platforms to enhance the value of
existing information resources, and can be integrated with new products and systems as they are
brought on-line. When implemented on high performance client/server or parallel processing
computers, data mining tools can analyze massive databases to deliver answers to questions such
as, “Which clients are most likely to respond to my next promotional mailing, and why?”
3.1 statistical perspective on Data Mining
The information age has been matched by an explosion of data. This surfeit has been a result of
modern, improved and, in many cases, automated methods for both data collection and storage.
For instance, many stores tag their items with a product-specific bar code, which is scanned
in when the corresponding item is bought. This automatically creates a gigantic repository of
information on products and product combinations sold. Similar databases are also created
by automated book-keeping, digital communication tools or by remote sensing satellites, and
aided by the availability of affordable and effective storage mechanisms – magnetic tapes, data
warehouses and so on. This has created a situation of plentiful data and the potential for new and
deeper understanding of complex phenomena. The very size of these databases however means
that any signal or pattern may be overshadowed by “noise”.
Consider for instance the database created by the scanning of product bar codes at sales
checkouts. Originally adopted for reasons of convenience, this now forms the basis for gigantic
databases as large stores maintain records of products bought by customers in any transaction.
Some businesses have gone further: by providing customers with an incentive to use a magnetic-
striped frequent shopper card, they have created a database not just of product combinations
but also time-sequenced information on such transactions. The goal behind collecting such data
is the ability to answer questions such as “If potato chips and ketchup are purchased together,
what is the item that is most likely to be also bought?”, or “If shampoo is purchased, what is the
most common item also bought in that same transaction?”. Answers to such questions result in
what are called association rules. Such rules can be used, for instance, in deciding on store layout
or on promotions of certain brands of products by offering discounts on select combinations.
Applications of association rules transcend sales transactions data indeed.
An oft-stated goal of data mining is the discovery of patterns and relationships among different
variables in the database. This is no different from some of the goals of statistical inference:
consider for instance, simple linear regression. Similarly, the pair-wise relationship between the
products sold above can be nicely represented by means of an undirected weighted graph, with
products as the nodes and weighted edges for the presence of the particular product pair in as
many transactions as proportional to the weights. While undirected graphs provide a graphical
display, directed a cyclic graphs are perhaps more interesting – they provide understanding of
the phenomena driving the relationships between the variables. The nature of these relationships
can be analyzed using classical and modern statistical tools such as regression, neural networks
and so on.
Another aspect of knowledge discovery is supervised learning. Statistical tools such as
discriminant analysis or classification trees often need to be refined for these problems. Some
additional methods to be investigated here are k-nearest neighbor methods, bootstrap aggregation
or bagging, and boosting which originally evolved in the machine learning literature, but whose
statistical properties have been analyzed in recent years by statisticians. Boosting is particularly
useful in the context of data streams – when we have rapid data flowing into the system and
real-time classification rules are needed. Such capability is especially desirable in the context
of financial data, to guard against credit card and calling card fraud, when transactions are
streaming in from several sources and an automated split-second determination of fraudulent or
genuine use has to be made, based on past experience.
LoveLy professionaL university 45