Page 248 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 248
Unit 11: Indexing Language: Types and Characteristics
The focus will be on the creation and later use of the machine-generated sort, and as such I will try Notes
to enlighten the reader with the methods and pitfalls encountered.
We will also explain the structure of an existing system, PhraseFinder, and the decisions involved
in its design.
A thesaurus is (Merrian Webster Dictionary definition)
• book of words or of information about a particular field or set of concepts; especially : a book
of words and their synonyms
• a list of subject headings or descriptors usually with a cross-reference system for use in the
organization of a collection of documents for reference and retrieval”
This definition emphasises the difference between the “thesaurus” used by a creative author, and
that used in conjunction with information retrieval (IR) systems. A writer’s thesaurus contains
creative synonyms and related phrases that allow authors to enhance their vocabulary.
Thesaurus Structure
A thesaurus, as used for IR, is a collection of terms/phrases and relationships between those terms.
Basic decisions are:
Level of Coordination
Affects what the thesaurus considers to be a term or phrase. A high coordination seeks to build
bigger phrases, which produces a much more specific thesaurus. The problem with this is that too
much specificity is not useful (if we knew “exactly” what we were looking for, we would not need
it). Indeed, too much coordination is an evil: the user must be aware of the exact rules used by the
system for constructing the big phrase he is looking for. On the practical side, the problem of
automatically these phrases is a difficult one.
Minimum coordination comes in the form of using single terms as phrases. This is also not optimal:
for an example, it misses the distinction between “school library” and “library school”, and thus
mixes totally separate concepts.
Table 11.2
Coordination level High Low
+ Greater specificity Easy to build
Can be used for indexing Easy to search with (user need not worry
about term ordering)
High-frequency terms can be
included into phrases to make
them more specific
– Hard to build automatically Less specific
User must be familiar with Only good for retrieval (bad indexing
phrase-building rules capabilities)
Relationships
One of several taxonomies for relationships between thesaural terms is the following:Part – Whole:
element - set
• Collocation: words that frequently come together
• Paradigmatic: words with similar semantic core (lunar – moon)
LOVELY PROFESSIONAL UNIVERSITY 243