Page 54 - DCAP603_DATAWARE_HOUSING_AND_DATAMINING
P. 54
Data Warehousing and Data Mining
notes From a set theoretic standpoint, assume that a universe Ω exists from which subsets A, B are
generated. From the IR perspective, Ω is the dictionary while A and B are documents with
A usually representing the query. Some similarity measures are more easily visualize via set
theoretic notation.
As a simple measure, A∩B denotes the number of shared index terms. However, this simple
coefficient takes no information about the sizes of A and B into account. The Simple coefficient is
analogous to the binary weighting scheme in IR that can be thought of as the frequency of term
co-occurrence with respect to two documents. Although the Simple coefficient is technically a
similarity measure,
Most similarity measures are themselves evaluated by precision and recall, let A denote the set of
retrieved documents and B denote the set of relevant documents. Define precision and recall as
|A B|
P(A, B) =
|A|
And
|A B|
P(A, B) =
|A|
respectively. Informally, precision is the ratio of returned relevance documents to the total
number of documents returned, while recall is the ratio of returned relevant documents to the
total number of relevant documents to the total number of relevant, documents. Precision is
often evaluated at varying levels of recall (namely, I = 1, …., │B│) to produce a precision-recall
graph. Ideally, IR systems generate high precision at all levels of recall. In practice, however,
most systems exhibits lower precision values at higher levels of recall.
While the different notation styles may not yield exactly the same numeric values for each pair of
items, the ordering of the items within a set is preserved.
3.3.4 Dice
The dice coefficient is a generalization of the harmonic mean of the precision and recall measures.
A system with a high harmonic mean should theoretically by closer to an ideal retrieval system in
that it can achieve high precision values at high levels of recall. The harmonic mean for precision
and recall is given by
2
E =
1 + 1
P R
while the Dice coefficient is denoted by
|A B|
=
sim(d,d ) D(A,B) = α |A| (1 −α )|B|
j
+
α ∑ n ww
=
≅ n k1 kq kj n ,
α ∑ k1 w 2 kq + (1 −α ) ∑ k1 w 2 kj
=
=
with a ε [0, 1]. To show that the Dice coefficient is a weighted harmonic mean, let a = ½.
48 LoveLy professionaL university