Page 250 - DLIS402_INFORMATION_ANALYSIS_AND

Page 250 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING

P. 250

Unit 11: Indexing Language: Types and Characteristics

Notes
Construction 1. define boundaries of field, subdivide 1. Identify the collection to be used
into areas
2. fix characteristics 2. Fix characteristics (less degrees of
liberty here)
3. collect term definitions from a variety 3. Select and normalize terms, phrase
of sources (including encyclopaedias, construction.
expert advise, …)
4. analize data and set up relationships. 4. Statistical analysis to find
From these, a hierarchy should arise relationships (only one kind)
5. Evaluate consistency, incorporate new 5. If desired, organize as a hierarchy
terms or change relationships [3, 4]
6. Create an inverted form, and release
the thesaurus
7. Periodical updates
Verification Soundness and coverage of concept Ability to improve retrieval
classification performance

Techniques for Automatic Thesaurus Generation

We will now sample some techniques used in automated thesaurus construction. It must be noted
that there are other approaches to this problem that do not involve statistical analysis of a document
collection, for an example:
Automatic merging existing thesauri to produce a combination of both
Use of an expert system in conjunction with a retrieval engine to learn, through user feed-
back, the necessary thesaural associations.

Some Techniques for Term Selection
Terms can be extracted from title, abstract, or full text (if it is available). The first step is usually to
normalize the terms via stopword-filter and stemmer.
Afterwards, “relevant” terms are determined (for an example) via one of the following techniques:
– by frequency of occurrence:
Terms can be classified into high, middle and low frequency
• High frequency terms are often too general to be of interest (although maybe they can
be used in conjunction with others to build a less-general phrase)
• Low frequency terms will be too specific, and will probably lack the necessary rela-
tionships to be of interest.
• Middle-frequency terms are usually the best ones to keep.
The thresholds must be manually specified
– by discrimination value
DV(k) = average similarity – average similarity without using term ‘k’
Similarity is defined as distance between documents, with “distance” any appropriate
measure (usually that of the vector-space model).

Average similarity in a collection can be calculated via the “method of centroids”:
• Calculate the average vector-space “vector” for the whole collection

LOVELY PROFESSIONAL UNIVERSITY 245

245 246 247 248 249 250 251 252 253 254 255