Page 121 - DMGT308_CUSTOMER_RELATIONSHIP_MANAGEMENT
P. 121

Customer Relationship Management




                    Notes          An oft-stated goal of data mining is the discovery of patterns and relationships among different
                                   variables in the database. This is no  different from some of the goals of statistical inference:
                                   consider for instance, simple linear regression. Similarly, the pair-wise relationship between
                                   the products sold above can be nicely represented by means of an undirected weighted graph,
                                   with products as the nodes and weighted edges for the presence of the particular product pair in
                                   as many transactions as proportional to the weights. While undirected graphs provide a graphical
                                   display, directed a cyclic graphs are perhaps more interesting – they provide understanding of
                                   the phenomena driving the relationships between the variables. The nature of these relationships
                                   can be analyzed using classical and modern statistical tools such as regression, neural networks
                                   and so on.
                                   Another  aspect of  knowledge  discovery  is  supervised  learning. Statistical  tools  such  as
                                   discriminant analysis or classification trees often need to be refined for these problems. Some
                                   additional  methods  to  be  investigated  here  are  k-nearest  neighbour  methods,  bootstrap
                                   aggregation  or bagging,  and  boosting  which  originally  evolved  in  the machine  learning
                                   literature, but whose statistical properties have been analyzed in recent years by statisticians.
                                   Boosting is particularly useful in the context of data streams – when we have rapid data flowing
                                   into the system and real-time classification rules are needed. Such capability is especially desirable
                                   in the context of financial data, to guard against credit card and calling card fraud, when transactions
                                   are streaming in from several sources and an automated split-second determination of fraudulent
                                   or genuine use has to be made, based on past experience.
                                   Another important aspect of knowledge discovery is unsupervised learning or clustering, which
                                   is the categorization  of the observations in a dataset  into an a priori unknown number of
                                   groups, based on some characteristic of the observations. This is a very difficult problem, and is
                                   only compounded when the  database is  massive. Hierarchical clustering, probability based
                                   methods, as well as optimization partitioning algorithms are all difficult to apply here. Maitra
                                   (2001) develops, under restrictive Gaussian equal-dispersion assumptions, a multipass scheme
                                   which clusters an initial sample, filters out observations that can be  reasonably classified by
                                   these clusters, and iterates the above procedure  on the remainder. This method is  scalable,
                                   which means that it can be used on datasets of any size.
                                   The field of data mining, like statistics, concerns itself with “learning from data” or “turning
                                   data into information”.

                                   5.2.3 Clustering

                                   Cluster analysis is used to form groups or clusters of similar records based on several measures
                                   made on these records. The key idea is to characterize the clusters in ways that would be useful
                                   for the aims of the analysis. This data has been applied in many areas, including astronomy,
                                   archaeology, medicine, chemistry, education, psychology, linguistics and sociology.

                                          Example: Biologists have made extensive use of classes and subclasses to organize species.
                                   A spectacular success of the clustering idea in chemistry was Mendeleev’s periodic table of the
                                   elements.

                                   One popular use of cluster analysis in marketing is for market segmentation: customers are
                                   segmented based on demographic and transaction history information and a marketing strategy
                                   is tailored for each segment. Another use is for market structure analysis identifying groups of
                                   similar products according to competitive measures of similarity. In marketing and political
                                   forecasting, clustering of neighbourhoods using U.S. postal zip codes has been used successfully
                                   to group neighbourhoods by lifestyles. Claritas, a company that pioneered this approach, grouped
                                   neighbourhoods  into  40  clusters  using  various  measures of  consumer  expenditure  and
                                   demographics. Examining the clusters enabled Claritas to come up with evocative names, such



          116                               LOVELY PROFESSIONAL UNIVERSITY
   116   117   118   119   120   121   122   123   124   125   126