Page 72 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 72

Data Warehousing and Data Mining




                    notes          How Effective are Bayesian Classifiers?

                                   In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
                                   However, in practice this is not always the case owing to inaccuracies in the assumptions made
                                   for its use, such as class conditional independence, and the lack of available probability data.
                                   However, various empirical studies of this classifier in comparison to decision tree and neural
                                   network classifiers have found it to be comparable in some domains.
                                   Bayesian  classifiers  are  also  useful  in  that  they  provide  a  theoretical  justification  for  other
                                   classifiers which do not explicitly use Bayes theorem. For example, under certain assumptions,
                                   it can be shown that many neural network and curve fitting algorithms output the maximum
                                   posteriori hypothesis, as does the naive Bayesian classifier.



                                      Task     “Classification is a data mining technique used to predict group membership
                                     for data instances”. Discuss.


                                   4.5 Distance-based algorithms

                                   Distance-based algorithms assume that an object is more similar to the objects within the same
                                   class as opposed to objects from other classes. Therefore, the classification of the target object
                                   is affected by the objects that are similar to it. The concept of distance is used to measure the
                                   dissimilarity between objects. In other words, two similar objects can be considered close to each
                                   other in the sample space. The two key issues in distance-based classification are choosing the
                                   proper distance function and the design of the classification algorithm. Many kinds of distance
                                   functions can be used, such as city block distance or Euclidean distance. Different distances have
                                   different characteristics, which fit various types of data. Classification algorithms must determine
                                   the class of target according to objects close to it. One of the most effective techniques is K-Nearest
                                   Neighbors (KNN). Using the K-closest objects, the target object is assigned the class that contains
                                   the most objects. KNN is widely used in text classification, web mining and stream data mining.

                                   4.6 Distance functions

                                   Distance-based algorithms rely on distance functions to measure the dis-similarity between the
                                   objects. Selecting a distance function is not only the first step of the algorithms, but also a critical
                                   step. Different distance functions have different characteristics, which fit various types of data.
                                   There does not exist a distance function that can deal with every type of data. So the performance
                                   of  the  algorithm  heavily  depends  on  whether  a  proper  distance  function  is    chosen  for  that
                                   particular data. For a set X, the distance function d: X x X  → R, for all x, y, z € X, satisfies

                                   d(x, y) ≥ 0,
                                   d(x, y) = 0 if and only if x = y,
                                   d(x, y) = d(y, x) (symmetry law), and
                                   d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).
                                   Interestingly, several distance functions used in practice do not necessarily satisfy all four of
                                   the constraints listed above. For example, the squared Euclidean distance does not satisfy the
                                   triangle inequality and the Kullback-Leibler distance function used in document clustering is not
                                   symmetric. A good distance function should be invariant to the natural data transformations that
                                   do not affect the class of the objects.






          66                               LoveLy professionaL university
   67   68   69   70   71   72   73   74   75   76   77