Page 53 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 53

Unit 3: Data Mining Techniques




          3.3 similarity Measures                                                               notes

          Similarity measures provide the framework on which many data mining decision are based.
          Tasks  such  as  classification  and  clustering  usually  assume  the  existence  of  some  similarity
          measure, while fields with poor methods to compute similarity often find that searching data
          is a cumbersome task. Several classic similarity measures are discussed, and the application of
          similarity measures to other field are addressed.

          3.3.1 introduction

          The goal of information retrieval (IR) systems is to meet users needs. In practical terms, a need
          is usually manifested in the form of a short textual query entered in the text box of some search
          engine  online.  IR  systems  typically  do  not  directly  answer  a  query,  instead  they  present  a
          ranked list of documents that are judged relevant to that query by some similarity measure. Sine
          similarity measures have the effect of clustering and classifying information with respect to a
          query, users will commonly find new interpretations of their information need that may or may
          not be useful to them when reformulating their query. In the case when the query is a document
          from the initial collection, similarity measures can be used to cluster and classify documents
          within a collection. In short, similarity measure can add a rudimentary structure to a previously
          unstructured collection.
          3.3.2 Motivation


          Similarity measures used in IR systems can distort one’s perception of the entire data set. For
          example, if a user types a query into a search engine and does not find a satisfactory answer in
          the top ten returned web pages, then he/she will usually try to reformulate his/her query once
          or twice. If a satisfactory answer is still not returned, then the user will often assume that one
          does not exist. Rarely does a user understand or care what ranking scheme a particular search
          engine employs.
          An understanding of the similarity measures, however, is crucial in today’s business world. Many
          business decisions are often based on answers to questions that are posed in a way similar to how
          queries are given to search engines. Data miners do not have the luxury of assuming that the
          answers given to them from a database or IR system are correct or all-inclusive they must know
          the drawbacks of any similarity measure used and adjust their business decisions accordingly.

          3.3.3 classic similarity Measures

          A similarity measure is defined as a mapping from a pair of tuples of size k to a scalar number.
          By convention, all similarity measures should map to the range [-1, 1] or [0, 1], where a similarity
          score of 1 indicates maximum similarity. Similarity measure should exhibit the property that
          their value will increase as the number of common properties in the two items being compared
          increases.
          A  popular  model  in  many  IR  applications  is  the  vector-space  model,  where  documents  are
          represented by a  vector of size n, where  n is the  size of the dictionary. Thus, document I is
          represented by a vector d  = (w ,….,w ), where w denotes the weight associated with term k
                                                   ki
                                         ki
                               i
                                    1i
          in document i. in the simplest case, w  is the frequency of occurrence of term k in document i.
                                         ki
          Queries are formed by creating a pseudo-document vector q of size n, where w  is assumed to
                                                                          kq
          be non-zero if and only if term k occurs in the query.
          Given two similarity scores sim (q, d ) = s  and sim (q, d) = s , s  > s  means that document i is
                                                        j
                                                           2
                                            1
                                                              1
                                                                 2
                                        i
          judged m ore relevant than document j to query q. since similarity measure are a pairwise measure,
          the values of s  and s do not imply a relationship between documents i and j themselves.
                     1     2
                                           LoveLy professionaL university                                    47
   48   49   50   51   52   53   54   55   56   57   58