Page 248 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 248

Unit 11: Indexing Language: Types and Characteristics




            The focus will be on the creation and later use of the machine-generated sort, and as such I will try  Notes
            to enlighten the reader with the methods and pitfalls encountered.
            We will also explain the structure of an existing system, PhraseFinder, and the decisions involved
            in its design.
            A thesaurus is (Merrian Webster Dictionary definition)
              •  book of words or of information about a particular field or set of concepts; especially : a book
                 of words and their synonyms
              •   a list of subject headings or descriptors usually with a cross-reference system for use in the
                 organization of a collection of documents for reference and retrieval”
            This definition emphasises the difference between the “thesaurus” used by a creative author, and
            that used in conjunction with information retrieval (IR) systems. A writer’s thesaurus contains
            creative synonyms and related phrases that allow authors to enhance their vocabulary.

            Thesaurus Structure
            A thesaurus, as used for IR, is a collection of terms/phrases and relationships between those terms.
            Basic decisions are:

            Level of Coordination
            Affects what the thesaurus considers to be a term or phrase. A high coordination seeks to build
            bigger phrases, which produces a much more specific thesaurus. The problem with this is that too
            much specificity is not useful (if we knew “exactly” what we were looking for, we would not need
            it). Indeed, too much coordination is an evil: the user must be aware of the exact rules used by the
            system for constructing the big phrase he is looking for. On the practical side, the problem of
            automatically these phrases is a difficult one.
            Minimum coordination comes in the form of using single terms as phrases. This is also not optimal:
            for an example, it misses the distinction between “school library” and “library school”, and thus
            mixes totally separate concepts.


                                                Table 11.2

              Coordination level  High                    Low
              +                 Greater specificity       Easy to build
                                Can be used for indexing  Easy to search with (user need not worry
                                                          about term ordering)
                                High-frequency terms can be
                                included into phrases to make
                                them more specific

              –                 Hard to build automatically  Less specific
                                User must be familiar with  Only good for retrieval (bad indexing
                                phrase-building rules      capabilities)

            Relationships

            One of several taxonomies for relationships between thesaural terms is the following:Part – Whole:
            element - set
              •  Collocation: words that frequently come together
              •  Paradigmatic: words with similar semantic core (lunar – moon)




                                             LOVELY PROFESSIONAL UNIVERSITY                                   243
   243   244   245   246   247   248   249   250   251   252   253