Page 38 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 38

Data Warehousing and Data Mining




                    notes          Q appear together in a transaction and c (for confidence) is the conditional probability that Q
                                   appears in a transaction when P is present. For example, the hypothetic association rule

                                   Rent Type (X, “game”) ^ Age(X,“13-19”) → Buys(X, “pop”) [s=2% ,c=55%]
                                   would indicate that 2% of the transactions considered are of customers aged between 13 and
                                   19 who are renting a game and buying a pop, and that there is a certainty of 55% that teenage
                                   customers, who rent a game, also buy pop.

                                   Classification

                                   Classification  is  the  processing  of  finding  a  set  of  models  (or  functions)  that  describe  and
                                   distinguish data classes or concepts, for the purposes of being able to use the model to predict
                                   the class of objects whose class label is unknown. The derived model is based on the analysis of
                                   a set of training data (i.e., data objects whose class label is known). The derived model may be
                                   represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical
                                   formulae, or neural networks. For example, after starting a credit policy, the Our Video Store
                                   managers could analyse the customers’ behaviors vis-à-vis their credit, and label accordingly the
                                   customers who received credits with three possible labels “safe”, “risky” and “very risky”. The
                                   classification analysis would generate a model that could be used to either accept or reject credit
                                   requests in the future.

                                   prediction

                                   Classification can be used for predicting the class label of data objects. There are two major types
                                   of predictions: one can either try to predict (1) some unavailable data values or pending trends
                                   and (2) a class label for some data. The latter is tied to classification. Once a classification model
                                   is built based on a training set, the class label of an object can be foreseen based on the attribute
                                   values of the object and the attribute values of the classes. Prediction is however more often
                                   referred to the forecast of missing numerical values, or increase/ decrease trends in time related
                                   data. The major idea is to use a large number of past values to consider probable future values.




                                      Note     Classification and prediction may need to be preceded by relevance analysis,
                                     which attempts to identify attributes that do not contribute to the classification or prediction
                                     process. These attributes can then be excluded.

                                   clustering

                                   Similar  to  classification,  clustering  is  the  organisation  of  data  in  classes.  However,  unlike
                                   classification, it is used to place data elements into related groups without advance knowledge
                                   of the group definitions i.e. class labels are unknown and it is up to the clustering algorithm
                                   to discover acceptable classes. Clustering is also called unsupervised classification, because the
                                   classification is not dictated by given class labels. There are many clustering approaches all based
                                   on the principle of maximising the similarity between objects in a same class (intra-class similarity)
                                   and  minimising  the  similarity  between  objects  of  different  classes  (inter-class  similarity).
                                   Clustering can also facilitate taxonomy formation, that is, the organisation of observations into a
                                   hierarchy of classes that group similar events together.


                                          Example:  For  a  data  set  with  two  attributes:  AGE  and  HEIGHT,  the  following  rule
                                   represents most of the data assigned to cluster 10:

                                   If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10




          32                               LoveLy professionaL university
   33   34   35   36   37   38   39   40   41   42   43