Page 38 - DCAP603_DATAWARE_HOUSING_AND

Page 38 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING

P. 38

Data Warehousing and Data Mining

notes Q appear together in a transaction and c (for confidence) is the conditional probability that Q
appears in a transaction when P is present. For example, the hypothetic association rule

Rent Type (X, “game”) ^ Age(X,“13-19”) → Buys(X, “pop”) [s=2% ,c=55%]
would indicate that 2% of the transactions considered are of customers aged between 13 and
19 who are renting a game and buying a pop, and that there is a certainty of 55% that teenage
customers, who rent a game, also buy pop.

Classification

Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purposes of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis of
a set of training data (i.e., data objects whose class label is known). The derived model may be
represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks. For example, after starting a credit policy, the Our Video Store
managers could analyse the customers’ behaviors vis-à-vis their credit, and label accordingly the
customers who received credits with three possible labels “safe”, “risky” and “very risky”. The
classification analysis would generate a model that could be used to either accept or reject credit
requests in the future.

prediction

Classification can be used for predicting the class label of data objects. There are two major types
of predictions: one can either try to predict (1) some unavailable data values or pending trends
and (2) a class label for some data. The latter is tied to classification. Once a classification model
is built based on a training set, the class label of an object can be foreseen based on the attribute
values of the object and the attribute values of the classes. Prediction is however more often
referred to the forecast of missing numerical values, or increase/ decrease trends in time related
data. The major idea is to use a large number of past values to consider probable future values.

Note Classification and prediction may need to be preceded by relevance analysis,
which attempts to identify attributes that do not contribute to the classification or prediction
process. These attributes can then be excluded.

clustering

Similar to classification, clustering is the organisation of data in classes. However, unlike
classification, it is used to place data elements into related groups without advance knowledge
of the group definitions i.e. class labels are unknown and it is up to the clustering algorithm
to discover acceptable classes. Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels. There are many clustering approaches all based
on the principle of maximising the similarity between objects in a same class (intra-class similarity)
and minimising the similarity between objects of different classes (inter-class similarity).
Clustering can also facilitate taxonomy formation, that is, the organisation of observations into a
hierarchy of classes that group similar events together.

Example: For a data set with two attributes: AGE and HEIGHT, the following rule
represents most of the data assigned to cluster 10:

If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10

32 LoveLy professionaL university

33 34 35 36 37 38 39 40 41 42 43