Page 64 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 64
Data Warehousing and Data Mining
notes introduction
Classification is a data mining (machine learning) technique used to predict group membership
for data instances.
Example: You may wish to use classification to predict whether the weather on a particular
day will be “sunny”, “rainy” or “cloudy”. Popular classification techniques include decision trees
and neural networks.
Data classification is a two step process. In the first step, a model is built describing a predetermined
set of data classes or concepts. The model is constructed by analyzing database tuples described
by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of
the attributes, called the class label attribute. In the context of classification, data tuples are
also referred to as samples, examples, or objects. The data tuples analyzed to build the model
collectively form the training data set. The individual tuples making up the training set are
referred to as training samples and are randomly selected from the sample population.
Since the class label of each training sample is provided, this step is also known as supervised
learning (i.e., the learning of the model is ‘supervised’ in that it is told to which class each training
sample belongs). It contrasts with unsupervised learning (or clustering), in which the class labels
of the training samples are not known, and the number or set of classes to be learned may not be
known in advance.
Typically, the learned model is represented in the form of classification rules, decision trees,
or mathematical formulae. For example, given a database of customer credit information,
classification rules can be learned to identify customers as having either excellent or fair credit
ratings (Figure 4.1a). The rules can be used to categorize future data samples, as well as provide a
better understanding of the database contents. In the second step (Figure 4.1b), the model is used
for classification. First, the predictive accuracy of the model (or classifier) is estimated.
The holdout method is a simple technique which uses a test set of class-labeled samples. These
samples are randomly selected and are independent of the training samples. The accuracy of a
model on a given test set is the percentage of test set samples that are correctly classified by the
model. For each test sample, the known class label is compared with the learned model’s class
prediction for that sample. Note that if the accuracy of the model were estimated based on the
training data set, this estimate could be optimistic since the learned model tends to over fit the
data (that is, it may have incorporated some particular anomalies of the training data which are
not present in the overall sample population). Therefore, a test set is used.
(a) Learning: Training data are analyzed by a classification algorithm. Here, the class label
attribute is credit rating, and the learned model or classifier is represented in the form of
classification rule.
(b) Classification: Test data are used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the classification of new data
tuples.
58 LoveLy professionaL university