Page 216 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 216

Unit 11: Indexing Language: Types and Characteristics




            When full text documents are available in a digital medium may a simple kind of automatic indexing  Notes
            of course be made by putting all words (except stop words) into a database and produce an index in
            alphabetical order. Such a primitive, mechanical index is easily made by computer, but is extremely
            time consuming to produce by human beings. Although such an index is very primitive compared
            to other kinds of indexes, it has important merits for certain kinds of queries, and most of us expects
            today that we are able to identify documents and pages in which a certain word or phrase appears.

            We expect to do this kind of searches in full-text documents on the Internet, and we may, for example,
            on Amazon find books in which the phrase “domain analysis” is just mentioned on one arbitrary
            page. Clearly such a technique is valuable in the situations in which rare expressions are searched
            for.
            The main problems with such simple indexes are that they in many cases have too low precision
            because normally we are not searching rare expressions, but common words or phrases. Recall may
            also be a problem because of synonymy. We may, for example use a brand name in searching for a
            drug where the chemical name appears in the document. Another problem is generic level: we may
            use too broad or too narrow terms. Basically are problems in automatic indexing, as in other kinds
            of knowledge organization, thus related to meanings and semantic relations.
            Research in automatic indexing is like indexing and IR in general intended to improve recall and
            precision in document retrieval, including providing clues for query refinement and related
            problems. For this purpose are many different kinds of techniques tested and otherwise explored.




                     A very influential way to cope with the problem of lack of precision in common
                     search terms is to provide some kind of weighting of terms, for example, tf-idf (term
                     frequency–inverse document frequency), which is frequently used in many search
                     engines, without users have to know about the underlying technique. The intuitive
                     philosophy behind tf-idf is that terms that are frequent in many documents are less
                     suited to make discriminations, while terms that are frequent within a single
                     document may indicate that this document has much information about the things
                     the terms are referring to. This is, however, just one among a long range of actual
                     used or potential useful strategies to cope with these problems.

            Approaches to automatic indexing
            Some techniques are fully automated, while other are semi-automatic or machine-aided. For example
            is the technique “text categorization” based on manually predetermined categories, while another
            technique, “document clustering”,
            Automatic indexing may be based on terms and structures in documents alone or it may be based
            on information about user preferences, external semantic resources (e.g. thesauri) or other kinds of
            external information. (Relevance feedback is a technique that rely heavily on user preferences.
            Although it is less associated with automatic indexing than with information retrieval).
            Some techniques, such as those based on vector space models disregards structures in the texts,
            whereas other approaches are utilizing information about structures, for example, recent approaches
            in XML-based retrieval.
            “Natural language systems attempt to introduce a higher level of abstraction indexing on top of the
            statistical processes. Making use of rules associated with language assist in the disambiguation of
            terms and provide an additional layer of concepts that are not found in purely statistical systems.
            Use of natural language processing provides the additional data that could focus searches,’’ .





                                             LOVELY PROFESSIONAL UNIVERSITY                                   211
   211   212   213   214   215   216   217   218   219   220   221