Page 253 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 253
Information Analysis and Repackaging
Notes Results
The experimental results from this system are divided in two parts: those used to set up particular
parameters for the system (e.g.,: phrase rules), tested on smaller collections, and those designed to
test broader assumptions, done on greater collections.
The “small collection” used is NLP, a 11,429 document collection with titles and abstracts in the
area of physics.
The “big one” is TIPSTER, which includes 742,358 full-text documents from various sources (San
Jose Mercury news, Associated Press, Federal Register, ...). A thesaus for the whole TIPSTER database
takes (1993 computer) about 2 weeks to generate.
Best Type of Word
Figure 11.9
4.5
4
3.5
3
Duplicates
2.5
Nodup
2
Percent
Both
1.5
improvement
1
0.5
0
–0.5
{V, G, D} {J, R} {N} All
Phrase rule
Verbs are least usefull, and adjectives and adverbs alone perform better than a “all goes” approach.
Clearly, nouns are the most informative (although their effectiveness is most dramatic when used
to re-weigh a query).
Best Phrase Rule
Figure 11.10
10
9
8
7
6
5
Percent
Duplicates
improvement
4
Nondup
3
Both
2
1
0
{NNN, JNN, {NNN, {NNN, NN}
JJN, NN, NN, N}
JN, N}
Phrase rule
248 LOVELY PROFESSIONAL UNIVERSITY