Page 250 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 250
Unit 11: Indexing Language: Types and Characteristics
Notes
Construction 1. define boundaries of field, subdivide 1. Identify the collection to be used
into areas
2. fix characteristics 2. Fix characteristics (less degrees of
liberty here)
3. collect term definitions from a variety 3. Select and normalize terms, phrase
of sources (including encyclopaedias, construction.
expert advise, …)
4. analize data and set up relationships. 4. Statistical analysis to find
From these, a hierarchy should arise relationships (only one kind)
5. Evaluate consistency, incorporate new 5. If desired, organize as a hierarchy
terms or change relationships [3, 4]
6. Create an inverted form, and release
the thesaurus
7. Periodical updates
Verification Soundness and coverage of concept Ability to improve retrieval
classification performance
Techniques for Automatic Thesaurus Generation
We will now sample some techniques used in automated thesaurus construction. It must be noted
that there are other approaches to this problem that do not involve statistical analysis of a document
collection, for an example:
Automatic merging existing thesauri to produce a combination of both
Use of an expert system in conjunction with a retrieval engine to learn, through user feed-
back, the necessary thesaural associations.
Some Techniques for Term Selection
Terms can be extracted from title, abstract, or full text (if it is available). The first step is usually to
normalize the terms via stopword-filter and stemmer.
Afterwards, “relevant” terms are determined (for an example) via one of the following techniques:
– by frequency of occurrence:
Terms can be classified into high, middle and low frequency
• High frequency terms are often too general to be of interest (although maybe they can
be used in conjunction with others to build a less-general phrase)
• Low frequency terms will be too specific, and will probably lack the necessary rela-
tionships to be of interest.
• Middle-frequency terms are usually the best ones to keep.
The thresholds must be manually specified
– by discrimination value
DV(k) = average similarity – average similarity without using term ‘k’
Similarity is defined as distance between documents, with “distance” any appropriate
measure (usually that of the vector-space model).
Average similarity in a collection can be calculated via the “method of centroids”:
• Calculate the average vector-space “vector” for the whole collection
LOVELY PROFESSIONAL UNIVERSITY 245