Page 67 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 67
Unit 4: Data Mining Classification
4. Predicting whether a particular molecule in drug discovery will lead to a profitable new notes
drug for a pharmaceutical company.
Any of the methods and techniques used for classification may also be used, under appropriate
circumstances, for prediction. These include the traditional statistical methods of point estimation
and confidence interval estimations, simple linear regression and correlation, and multiple
regression.
4.2 Issues regarding Classification and Prediction
To prepare the data for classification and prediction, following preprocessing steps may be
applied to the data in order to help improve the accuracy, efficiency, and scalability of the
classification or prediction process.
1. Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise
(by applying smoothing techniques, for example), and the treatment of missing values
(e.g., by replacing a missing value with the most commonly occurring value for that
attribute, or with the most probable value based on statistics). Although most classification
algorithms have some mechanisms for handling noisy or missing data, this step can help
reduce confusion during learning.
2. Relevance analysis: Many of the attributes in the data may be irrelevant to the classification
or prediction task. For example, data recording the day of the week on which a bank loan
application was filled is unlikely to be relevant to the success of the application. Furthermore,
other attributes may be redundant. Hence, relevance analysis may be performed on the data
with the aim of removing any irrelevant or redundant attributes from the learning process.
In machine learning, this step is known as feature selection. Including such attributes may
otherwise slow down, and possibly mislead, the learning step. Ideally, the time spent on
relevance analysis, when added to the time spent on learning from the resulting “reduced”
feature subset, should be less than the time that would have been spent on learning from
the original set of features. Hence, such analysis can help improve classification efficiency
and scalability.
3. Data transformation: The data can be generalized to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for continuous-valued
attributes. For example, numeric values for the attribute income may be generalized to
discrete ranges such as low, medium, and high. Similarly, nominal-valued attributes,
like street, can be generalized to higher-level concepts, like city. Since generalization
compresses the original training data, fewer input/output operations may be involved
during learning. The data may also be normalized, particularly when neural networks or
methods involving distance measurements are used in the learning step. Normalization
involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for
example, this would prevent attributes with initially large ranges (like, say income) from
outweighing attributes with initially smaller ranges (such as binary attributes).
4.3 statistical based algorithms
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities,
such as the prob ability that a given sample belongs to a particular class.
Bayesian classification is based on Bayes theorem. Bayesian classifiers exhibited high accuracy
and speed when applied to large databases.
Studies comparing classification algorithms have found a simple Bayesian classifier known as the
naive Bayesian classifier to be comparable in performance with decision tree and neural network
LoveLy professionaL university 61