Page 251 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 251
Information Analysis and Repackaging
Notes • Average distance between documents is the average distance to this “centroid”.
If DV(k) > 0, k is a good discriminator (the whole collection without “k” is less specific
than before).
– by statistical distribution
Trivial words obey a single-Poisson distribution
Therefore, term “relevance” can be computed by comparing its distribution to the single-
Poisson one.
This can be done via a chi-square test.
Statistical Selection of Phrases
A simple approach to phrase construction/selection is based on the following principles
– Phrase terms must occur frequently together (“with less than k words in between them”)
– Phrase components must be relatively frequent
The resulting algorithm reads:
Compute pair-wise co-ocurrence within constraints
If greater than a threshold value,
tt
Coocurrence Freq . ( , )
j
i
cohesion value (t , t ) = Freq . ( ) . Freq ( )
.
i
j
t
t
j
j
Problems of Automated Thesaurus Construction
It must be noted that a purely statistical analysis cannot expect to find the exact type of semantical
relationships between terms. This has much to do with the problem of natural language processing
(NLP), a promising field of research that involves both Artificial Intelligence and Linguistic.
A simple modification that can provide non-statistical insight into semantic is the
distinction of part-of-speech. This implies “tagging” each term with it’s corresponding
type (verb, noun, adjective, …). A program capable of doing this is called a “tagger”.
Although we have seen only the statistical approach, it is possible to either bypass it or complement
it with external relevance judgements. For an example, given a query that divides all documents
into either “relevant” or “irrelevant”, a thesaural term that appears in only one of the classes would
be better for that specific query than one that appears in both. Many such approaches have been
proposed and tested, mostly with good results.
The problem with relevance judgements is their availability: as yet, only humans can produce them,
and therefore they are scarce for most collections. A truly automatically generated thesaurus should
not depend on such judgements.
Another problem in automated thesaurus construction is verification: How can the performance of
a thesaurus be tested?. Usually, the ability of a thesaurus to extract more relevant documents during
a search is used as a pointer to thesaurus quality.
Phrase Finder
PhraseFinder is an automatically generated thesaurus that is integrated within a retrieval engine,
InQuery. InQuery is part of the TIPSTER project in the Information Retrieval Laboratory of the
Computer Science Department, University of Massachusetts, Amherst.
246 LOVELY PROFESSIONAL UNIVERSITY