Page 72 - DCAP603_DATAWARE_HOUSING_AND

Page 72 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING

P. 72

Data Warehousing and Data Mining

notes How Effective are Bayesian Classifiers?

In theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
However, in practice this is not always the case owing to inaccuracies in the assumptions made
for its use, such as class conditional independence, and the lack of available probability data.
However, various empirical studies of this classifier in comparison to decision tree and neural
network classifiers have found it to be comparable in some domains.
Bayesian classifiers are also useful in that they provide a theoretical justification for other
classifiers which do not explicitly use Bayes theorem. For example, under certain assumptions,
it can be shown that many neural network and curve fitting algorithms output the maximum
posteriori hypothesis, as does the naive Bayesian classifier.

Task “Classification is a data mining technique used to predict group membership
for data instances”. Discuss.

4.5 Distance-based algorithms

Distance-based algorithms assume that an object is more similar to the objects within the same
class as opposed to objects from other classes. Therefore, the classification of the target object
is affected by the objects that are similar to it. The concept of distance is used to measure the
dissimilarity between objects. In other words, two similar objects can be considered close to each
other in the sample space. The two key issues in distance-based classification are choosing the
proper distance function and the design of the classification algorithm. Many kinds of distance
functions can be used, such as city block distance or Euclidean distance. Different distances have
different characteristics, which fit various types of data. Classification algorithms must determine
the class of target according to objects close to it. One of the most effective techniques is K-Nearest
Neighbors (KNN). Using the K-closest objects, the target object is assigned the class that contains
the most objects. KNN is widely used in text classification, web mining and stream data mining.

4.6 Distance functions

Distance-based algorithms rely on distance functions to measure the dis-similarity between the
objects. Selecting a distance function is not only the first step of the algorithms, but also a critical
step. Different distance functions have different characteristics, which fit various types of data.
There does not exist a distance function that can deal with every type of data. So the performance
of the algorithm heavily depends on whether a proper distance function is chosen for that
particular data. For a set X, the distance function d: X x X → R, for all x, y, z € X, satisfies

d(x, y) ≥ 0,
d(x, y) = 0 if and only if x = y,
d(x, y) = d(y, x) (symmetry law), and
d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).
Interestingly, several distance functions used in practice do not necessarily satisfy all four of
the constraints listed above. For example, the squared Euclidean distance does not satisfy the
triangle inequality and the Kullback-Leibler distance function used in document clustering is not
symmetric. A good distance function should be invariant to the natural data transformations that
do not affect the class of the objects.

66 LoveLy professionaL university

67 68 69 70 71 72 73 74 75 76 77