Page 216 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 216
Unit 11: Indexing Language: Types and Characteristics
When full text documents are available in a digital medium may a simple kind of automatic indexing Notes
of course be made by putting all words (except stop words) into a database and produce an index in
alphabetical order. Such a primitive, mechanical index is easily made by computer, but is extremely
time consuming to produce by human beings. Although such an index is very primitive compared
to other kinds of indexes, it has important merits for certain kinds of queries, and most of us expects
today that we are able to identify documents and pages in which a certain word or phrase appears.
We expect to do this kind of searches in full-text documents on the Internet, and we may, for example,
on Amazon find books in which the phrase “domain analysis” is just mentioned on one arbitrary
page. Clearly such a technique is valuable in the situations in which rare expressions are searched
for.
The main problems with such simple indexes are that they in many cases have too low precision
because normally we are not searching rare expressions, but common words or phrases. Recall may
also be a problem because of synonymy. We may, for example use a brand name in searching for a
drug where the chemical name appears in the document. Another problem is generic level: we may
use too broad or too narrow terms. Basically are problems in automatic indexing, as in other kinds
of knowledge organization, thus related to meanings and semantic relations.
Research in automatic indexing is like indexing and IR in general intended to improve recall and
precision in document retrieval, including providing clues for query refinement and related
problems. For this purpose are many different kinds of techniques tested and otherwise explored.
A very influential way to cope with the problem of lack of precision in common
search terms is to provide some kind of weighting of terms, for example, tf-idf (term
frequency–inverse document frequency), which is frequently used in many search
engines, without users have to know about the underlying technique. The intuitive
philosophy behind tf-idf is that terms that are frequent in many documents are less
suited to make discriminations, while terms that are frequent within a single
document may indicate that this document has much information about the things
the terms are referring to. This is, however, just one among a long range of actual
used or potential useful strategies to cope with these problems.
Approaches to automatic indexing
Some techniques are fully automated, while other are semi-automatic or machine-aided. For example
is the technique “text categorization” based on manually predetermined categories, while another
technique, “document clustering”,
Automatic indexing may be based on terms and structures in documents alone or it may be based
on information about user preferences, external semantic resources (e.g. thesauri) or other kinds of
external information. (Relevance feedback is a technique that rely heavily on user preferences.
Although it is less associated with automatic indexing than with information retrieval).
Some techniques, such as those based on vector space models disregards structures in the texts,
whereas other approaches are utilizing information about structures, for example, recent approaches
in XML-based retrieval.
“Natural language systems attempt to introduce a higher level of abstraction indexing on top of the
statistical processes. Making use of rules associated with language assist in the disambiguation of
terms and provide an additional layer of concepts that are not found in purely statistical systems.
Use of natural language processing provides the additional data that could focus searches,’’ .
LOVELY PROFESSIONAL UNIVERSITY 211