Page 54 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 54

Data Warehousing and Data Mining




                    notes          From a set theoretic standpoint, assume that a universe Ω exists from which subsets A, B are
                                   generated.  From  the  IR  perspective,  Ω  is  the  dictionary  while  A  and  B  are  documents  with
                                   A usually representing the query. Some similarity measures are more easily visualize via set
                                   theoretic notation.
                                   As a simple measure, A∩B denotes the number of shared index terms. However, this simple
                                   coefficient takes no information about the sizes of A and B into account. The Simple coefficient is
                                   analogous to the binary weighting scheme in IR that can be thought of as the frequency of term
                                   co-occurrence with respect to two documents. Although the Simple coefficient is technically a
                                   similarity measure,
                                   Most similarity measures are themselves evaluated by precision and recall, let A denote the set of
                                   retrieved documents and B denote the set of relevant documents. Define precision and recall as

                                                                         |A   B|
                                                                  P(A, B) =
                                                                           |A|
                                   And
                                                                         |A  B|
                                                                  P(A, B) =
                                                                           |A|
                                   respectively.  Informally,  precision  is  the  ratio  of  returned  relevance  documents  to  the  total
                                   number of documents returned, while recall is the ratio of returned relevant documents to the
                                   total number of relevant documents to the total number of relevant, documents. Precision is
                                   often evaluated at varying levels of recall (namely, I = 1, …., │B│) to produce a precision-recall
                                   graph. Ideally, IR systems generate high precision at all levels of recall. In practice, however,
                                   most systems exhibits lower precision values at higher levels of recall.
                                   While the different notation styles may not yield exactly the same numeric values for each pair of
                                   items, the ordering of the items within a set is preserved.

                                   3.3.4 Dice

                                   The dice coefficient is a generalization of the harmonic mean of the precision and recall measures.
                                   A system with a high harmonic mean should theoretically by closer to an ideal retrieval system in
                                   that it can achieve high precision values at high levels of recall. The harmonic mean for precision
                                   and recall is given by
                                                                          2
                                                                     E =
                                                                        1  +  1
                                                                        P  R
                                   while the Dice coefficient is denoted by
                                                                             |A  B|
                                                                 =
                                                         sim(d,d ) D(A,B) =  α |A| (1 −α )|B|
                                                               j
                                                                               +
                                                                        α ∑ n  ww
                                                                            =
                                                                 ≅   n     k1  kq  kj  n  ,
                                                                  α ∑  k1 w 2 kq  +  (1 −α ) ∑  k1 w 2 kj
                                                                                    =
                                                                      =
                                   with a ε [0, 1]. To show that the Dice coefficient is a weighted harmonic mean, let a = ½.











          48                               LoveLy professionaL university
   49   50   51   52   53   54   55   56   57   58   59