Page 121 - DLIS405_INFORMATION_STORAGE_AND_RETRIEVAL
P. 121
Information Storage and Retrieval
Notes 12.3 Indexing Languages
There are three main types of indexing languages.
• Controlled indexing language: Only approved terms can be used by the indexer to describe
the document.
• Natural language indexing language: Any term from the document in question can be used
to describe the document.
• Free indexing language: Any term (not only from the document) can be used to describe the
document.
When indexing a document, the indexer also has to choose the level of indexing exhaustivity, the
level of detail in which the document is described. For example using low indexing exhaustivity,
minor aspects of the work will not be described with index terms. In general the higher the indexing
exhaustivity, the more terms indexed for each document.
Controlled vocabularies are often claimed to improve the accuracy of free text searching, such as to
reduce irrelevant items in the retrieval list. These irrelevant items (false positives) are often caused
by the inherent ambiguity of natural language. Take the English word football for example. Football
is the name given to a number of different team sports. Worldwide the most popular of these team
sports is Association football, which also happens to be called soccer in several countries.
Compared to free text searching, the use of a controlled vocabulary can dramatically increase the
performance of an information retrieval system, if performance is measured by precision (the
percentage of documents in the retrieval list that are actually relevant to the search topic).
In some cases controlled vocabulary can enhance recall as well, because unlike natural language
schemes, once the correct authorized term is searched, you don’t need to worry about searching for
other terms that might be synonyms of that term.
However, a controlled vocabulary search may also lead to unsatisfactory recall, in that it will fail to
retrieve some documents that are actually relevant to the search question.
This is particularly problematic when the search question involves terms that are sufficiently
tangential to the subject area such that the indexer might have decided to tag it using a different
term (but the searcher might consider the same). Essentially, this can be avoided only by an
experienced user of controlled vocabulary whose understanding of the vocabulary coincides with
the way it is used by the indexer.
Controlled vocabularies are also quickly out-dated and in fast developing fields of knowledge, the
authorized terms available might not be available if they are not updated regularly. Even in the best
case scenario, controlled language is often not as specific as using the words of the text itself. Indexers
trying to choose the appropriate index terms might misinterpret the author, while a free text search
is in no danger of doing so, because it uses the author’s own words.
The use of controlled vocabularies can be costly compared to free text searches because human
experts or expensive automated systems are necessary to index each entry. Furthermore, the user
has to be familiar with the controlled vocabulary scheme to make best use of the system. But as
already mentioned, the control of synonyms, homographs can help increase precision.
Numerous methodologies have been developed to assist in the creation of controlled vocabularies,
including faceted classification, which enables a given data record or document to be described in
multiple ways.
Types of Controlled Vocabularies
Currier (2005) distinguish between the following kinds of controlled vocabularies to which we added
metadata schemes.
116 LOVELY PROFESSIONAL UNIVERSITY