Page 87 - DCAP603_DATAWARE_HOUSING_AND

Page 87 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING

P. 87

Unit 4: Data Mining Classification

1r algorithm notes

One of the simple approaches used to find classification rules is called 1R, as it generated a one
level decision tree. This algorithm examines the “rule that classify an object on the basis of a
single attribute”.
The basic idea is that rules are constructed to test a single attribute and branch for every value
of that attribute. For each branch, the class with the best classification is the one occurring most
often in the training data. The error rate of the rules is then determined by counting the number
of instances that do not have the majority class in the training data. Finally, the error rate for each
attribute’s rule set is evaluated, and the rule set with the minimum error rate is chosen.
A comprehensive comparative evaluation of the performance of 1R and other methods on 16
datasets (many of which were most commonly used in machine learning research) was performed.
Despite it simplicity, 1R produced surprisingly accurate rules, just a few percentage points lower
in accuracy than the decision produced by the state of the art algorithm (C4). The decision tree
produced by C4 were in most cases considerably larger than 1R’s rules, and the rules generated
by 1R were much easier to interpret. 1R teherfore provides a baseline performance using a
rudimentary technique to be used before progressing to more sophisticated algorithms.

other algorithms

Basic covering algorithms construct rules that classify training data perfectly, that is, they tend to
overfit the training set causing insufficient generalization and difficulty for processing new data.
However, for applications in real world domains, methods for handling noisy data, mechanisms
for avoiding overfitting even on training data, and relaxation requirements of the constraints are
needed. Pruning is one of the ways of dealing with these problems, and it approaches the problem
of overfitting by learning a general concept from the training set “to improve the prediction of
unseen instance”. The concept of Reduced Error Pruning (REP) was developed by, where some
of the training examples were withheld as a test set and performance of the rule was measured
on them. Also, Incremental Reduced Error Pruning (IREP) has proven to be efficient in handling
over-fitting, and it form the basis of RIPPER. SLIPPER (Simple Learner with Iterative Pruning to
Produce Error Reduction) uses “confidence-rated boosting to learn an ensemble of rules.”

applications of rule-based algorithms

Rule based algorithms are widely used for deriving classification rules applied in medical
sciences for diagnosing illnesses, business planning, banking government and different
disciplines of science. Particularly, covering algorithms have deep roots in machine learning.
Within data mining, covering algorithms including SWAP-1, RIPPER, and DAIRY are used in
text classification, adapted in gene expression programming for discovering classification rules.

Task Explain how will you remove the training data covered by rule R.

4.10 combining techniques

Data mining is an application-driven field where research questions tend to be motivated by
real-world data sets. In this context, a broad spectrum of formalisms and techniques has been
proposed by researchers in a large number of applications. Organizing them in inherently rather
difficult; that is why we highlight the central role played by the various types of data motivating
the current research.

LoveLy professionaL university 81

82 83 84 85 86 87 88 89 90 91 92