Page 252 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 252
Unit 11: Indexing Language: Types and Characteristics
About TIPSTER: Notes
The TIPSTER Text Program was a Defense Advanced Research Projects Agency (DARPA) led
government effort to advance the state of the art in text processing technologies through the
cooperation of researchers and developers in Government, industry and academia. The resulting
capabilities were deployed within the intelligence community to provide analysts with improved
operational tools. Due to lack of funding, this program formally ended in the Fall of 1998.
Architecture
PhraseFinder takes tagged documents as input (the Church tagger is employed to assign each word
a part-of-speech), selects terms and phrases, and creates associations between these thesaural terms.
PhraseFinder distinguishes the following hierarchy in a text document:
Text Object a Paragraph a Sentence a Phrase a Word
Where the text object is simply the whole document, a paragraph is defined as either a “natural
paragraph” or a fixed number of sentences, and a phrase can be whatever fits a “phrase rule”.
Phrase rules are specified by their part-of-speech components and the restriction that a single phrase
cannot span more than one sentence. Simple stopword-list + stemming is used on individual words,
but phrases are treated more conservatively.
Paragraph limits (max. number of sentences in a paragraph) also mark the limits for association
finding. Associations are built only within a same paragraph, and have the following structure.
<termId, phraseId, associationFrequency>
where
associationFrequency = termFrequency x phraseFrequency
Since most associations (70%) occur only once in the TIPSTER database, and 90% only once in the
same document, association filtering is performed as follows:
– If an association has a frequency of 1, it is discarded
– If a simple term has too many associations, it is discarded as too general
This has the effect of both reducing the storage size and improving the search capabilities of the
thesaurus.
Access to the Thesaurus and Query Expansion
Access to the thesaurus is done via InQuery, an IR system based on probabilistic (Bayesian) inference
networks.
The associations for each term are added as “pseudo-documents”, and this “pseudo-database” is
later searched to find the relevant phrases for a given query. The output is ranked by the search
engine.
The original query is then expanded with these results, although the weighing of the added query-
terms is still done manually (smaller for smaller collections). On deciding which phrases to add to
a query, the following decision has to be made:
– duplicates: Add only those phrases where all the words where already present in the original
query (this “reweighs” the original query)
– nonduplicates: Add only those where at least a word is not present in the original one
– both
LOVELY PROFESSIONAL UNIVERSITY 247