Page 249 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 249
Information Analysis and Repackaging
Notes • Taxonomy and Synonymy: same meaning, different levels of specificity
• Antinomy: opposite (in some sense) meaning
However, these are not easily found during automatic thesaurus generation, as they require a great
deal of “semantic” knowledge that is not easy to capture from the documents alone. Instead, the
multi-purpose “associated with” relation is used.
Normalization
Manual thesauri use a very complex set of rules (few adjectives, strip some prepositions, noun
form, capitalization) to achieve vocabulary “normalization”: store only the “base form” of each
term, instead of all it’s variants. Normalization can be critical to reduce the amount of needed space.
The problem with this complex normalization is that the user must be aware of the normalized
form in order to use the thesaurus.
In automatic thesauri, a simpler (but less precise) approach is usually taken:
• Apply a stoplist filter
• Use a standard stemmer on the remaining words (e.g., Porter)
The other side of the problem (a single word for multiple meanings) arises with “homographs”.
Homographs can be handled in manual thesauri via parenthetical specification (in INSPEC, the
terms “bond(chemical)” and “bond(cohesive)”). This is not so easy to do in automatically generated
–
ones, as the meaning can only be extracted from the term’s context.
Automated Thesauri
Manual vs. Automatic thesauri for IR
This unit deals with the differences to be found between manually and automatically generated
thesauri for the field of IR. The following tables illustrate those in the fields of structure, goal,
construction and verification.
Table 11.3
Manual Automatic
Structure – Hierarchy of thesaural terms – Many different approaches, but not
– High level of coordination always hierarchical
– Many types of relations between terms – Lower level of coordination (phrase
selection not easy to do)
– Complex normalization rules – Simple normalization rules; hard to
separate homographs.
– Field limits are specified by the creators – Field limits are specified by the
collection
Goal – Main goal is to precisely define the – Depending on level of coordination,
vocabulary to be used in a technical field can be used for indexing.
– Due to this precise definition, useful to – Main use is to assist in retrieval
index documents. through (possibly automated) query
expansion/contraction
– Assistance in developing search strategy
– Assistance in retrieval through query
expansion/contraction
244 LOVELY PROFESSIONAL UNIVERSITY