Page 67 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 67

Unit 4: Data Mining Classification




          4.   Predicting whether a particular molecule in drug discovery will lead to a profitable new   notes
               drug for a pharmaceutical company.
          Any of the methods and techniques used for classification may also be used, under appropriate
          circumstances, for prediction. These include the traditional statistical methods of point estimation
          and  confidence  interval  estimations,  simple  linear  regression  and  correlation,  and  multiple
          regression.

          4.2 Issues regarding Classification and Prediction


          To  prepare  the  data  for  classification  and  prediction,  following  preprocessing  steps  may  be
          applied  to  the  data  in  order  to  help  improve  the  accuracy,  efficiency,  and  scalability  of  the
          classification or prediction process.

          1.   Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise
               (by applying smoothing techniques, for example), and  the treatment  of  missing values
               (e.g.,  by  replacing  a  missing  value  with  the  most  commonly  occurring  value  for  that
               attribute, or with the most probable value based on statistics). Although most classification
               algorithms have some mechanisms for handling noisy or missing data, this step can help
               reduce confusion during learning.
          2.   Relevance analysis: Many of the attributes in the data may be irrelevant to the classification
               or prediction task. For example, data recording the day of the week on which a bank loan
               application was filled is unlikely to be relevant to the success of the application. Furthermore,
               other attributes may be redundant. Hence, relevance analysis may be performed on the data
               with the aim of removing any irrelevant or redundant attributes from the learning process.
               In machine learning, this step is known as feature selection. Including such attributes may
               otherwise slow down, and possibly mislead, the learning step. Ideally, the time spent on
               relevance analysis, when added to the time spent on learning from the resulting “reduced”
               feature subset, should be less than the time that would have been spent on learning from
               the original set of features. Hence, such analysis can help improve classification efficiency
               and scalability.
          3.   Data  transformation:  The  data  can  be  generalized  to  higher-level  concepts.  Concept
               hierarchies may be used for this purpose. This is particularly useful for continuous-valued
               attributes. For example, numeric values for the attribute income may be generalized to
               discrete  ranges  such  as  low,  medium,  and  high.  Similarly,  nominal-valued  attributes,
               like  street,  can  be  generalized  to  higher-level  concepts,  like  city.  Since  generalization
               compresses the original training data, fewer input/output operations may be involved
               during learning. The data may also be normalized, particularly when neural networks or
               methods involving distance measurements are used in the learning step. Normalization
               involves scaling all values for a given attribute so that they fall within a small specified
               range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for
               example, this would prevent attributes with initially large ranges (like, say income) from
               outweighing attributes with initially smaller ranges (such as binary attributes).

          4.3 statistical based algorithms

          Bayesian classifiers are statistical classifiers. They can predict class membership probabilities,
          such as the prob ability that a given sample belongs to a particular class.
          Bayesian classification is based on Bayes theorem. Bayesian classifiers exhibited high accuracy
          and speed when applied to large databases.

          Studies comparing classification algorithms have found a simple Bayesian classifier known as the
          naive Bayesian classifier to be comparable in performance with decision tree and neural network




                                           LoveLy professionaL university                                    61
   62   63   64   65   66   67   68   69   70   71   72