Page 250 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 250

Unit 11: Indexing Language: Types and Characteristics



                                                                                                     Notes
              Construction  1. define boundaries of field, subdivide  1. Identify the collection to be used
                              into areas
                            2. fix characteristics           2. Fix characteristics (less degrees of
                                                               liberty here)
                            3. collect term definitions from a variety  3. Select and normalize terms, phrase
                              of sources (including encyclopaedias,  construction.
                              expert advise, …)
                            4. analize data and set up relationships.  4. Statistical analysis to find
                              From these, a hierarchy should arise  relationships (only one kind)
                            5. Evaluate consistency, incorporate new  5. If desired, organize as a hierarchy
                              terms or change relationships [3, 4]
                            6. Create an inverted form, and release
                              the thesaurus
                            7. Periodical updates
              Verification  Soundness and coverage of concept  Ability to improve retrieval
                          classification                    performance


            Techniques for Automatic Thesaurus Generation

            We will now sample some techniques used in automated thesaurus construction. It must be noted
            that there are other approaches to this problem that do not involve statistical analysis of a document
            collection, for an example:
                 Automatic merging existing thesauri to produce a combination of both
                 Use of an expert system in conjunction with a retrieval engine to learn, through user feed-
                 back, the necessary thesaural associations.

            Some Techniques for Term Selection
            Terms can be extracted from title, abstract, or full text (if it is available). The first step is usually to
            normalize the terms via stopword-filter and stemmer.
            Afterwards, “relevant” terms are determined (for an example) via one of the following techniques:
              – by frequency of occurrence:
                       Terms can be classified into high, middle and low frequency
                    •  High frequency terms are often too general to be of interest (although maybe they can
                       be used in conjunction with others to build a less-general phrase)
                    •  Low frequency terms will be too specific, and will probably lack the necessary rela-
                       tionships to be of interest.
                    •  Middle-frequency terms are usually the best ones to keep.
                       The thresholds must be manually specified
              – by discrimination value
                    DV(k) = average similarity – average similarity without using term ‘k’
                    Similarity is defined as distance between documents, with “distance” any appropriate
                    measure (usually that of the vector-space model).

                    Average similarity in a collection can be calculated via the “method of centroids”:
                    •  Calculate the average vector-space “vector” for the whole collection





                                             LOVELY PROFESSIONAL UNIVERSITY                                   245
   245   246   247   248   249   250   251   252   253   254   255