P. 53
Unit 3: Data Mining Techniques
3.3 similarity Measures notes
Similarity measures provide the framework on which many data mining decision are based.
Tasks such as classification and clustering usually assume the existence of some similarity
measure, while fields with poor methods to compute similarity often find that searching data
is a cumbersome task. Several classic similarity measures are discussed, and the application of
similarity measures to other field are addressed.
3.3.1 introduction
The goal of information retrieval (IR) systems is to meet users needs. In practical terms, a need
is usually manifested in the form of a short textual query entered in the text box of some search
engine online. IR systems typically do not directly answer a query, instead they present a
ranked list of documents that are judged relevant to that query by some similarity measure. Sine
similarity measures have the effect of clustering and classifying information with respect to a
query, users will commonly find new interpretations of their information need that may or may
not be useful to them when reformulating their query. In the case when the query is a document
from the initial collection, similarity measures can be used to cluster and classify documents
within a collection. In short, similarity measure can add a rudimentary structure to a previously
unstructured collection.
3.3.2 Motivation
Similarity measures used in IR systems can distort one’s perception of the entire data set. For
example, if a user types a query into a search engine and does not find a satisfactory answer in
the top ten returned web pages, then he/she will usually try to reformulate his/her query once
or twice. If a satisfactory answer is still not returned, then the user will often assume that one
does not exist. Rarely does a user understand or care what ranking scheme a particular search
engine employs.
An understanding of the similarity measures, however, is crucial in today’s business world. Many
business decisions are often based on answers to questions that are posed in a way similar to how
queries are given to search engines. Data miners do not have the luxury of assuming that the
answers given to them from a database or IR system are correct or all-inclusive they must know
the drawbacks of any similarity measure used and adjust their business decisions accordingly.
3.3.3 classic similarity Measures
A similarity measure is defined as a mapping from a pair of tuples of size k to a scalar number.
By convention, all similarity measures should map to the range [-1, 1] or [0, 1], where a similarity
score of 1 indicates maximum similarity. Similarity measure should exhibit the property that
their value will increase as the number of common properties in the two items being compared
A popular model in many IR applications is the vector-space model, where documents are
represented by a vector of size n, where n is the size of the dictionary. Thus, document I is
represented by a vector d = (w ,….,w ), where w denotes the weight associated with term k
in document i. in the simplest case, w is the frequency of occurrence of term k in document i.
Queries are formed by creating a pseudo-document vector q of size n, where w is assumed to
be non-zero if and only if term k occurs in the query.
Given two similarity scores sim (q, d ) = s and sim (q, d) = s , s > s means that document i is
judged m ore relevant than document j to query q. since similarity measure are a pairwise measure,
the values of s and s do not imply a relationship between documents i and j themselves.
1 2
LoveLy professionaL university 47