Page 229 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 229
Information Analysis and Repackaging
Notes I will begin by examining several tasks routinely performed by human book indexers, pointing out
along the way the difficulties involved and implications. That provides concrete examples to inform
the discussion of the relationship between author, indexer, reader, language, and the world. This
leads to considering the language acquisition process of children as an indexing process, and of
indexing processes in general. After considering the relationships of indexes and maps, a
requirements list for human-quality indexing by MIs (computer software and hardware) will be
presented.
Page Ranges
Let us begin by examing what seems to be a simple and easy to implement concept. One of the first
concepts book index users learn is that of page ranges. For example
HTML, 203-207.
would indicate that the topic of HTML is taken up from beginning on page 203 and ending
somewhere on page 207. Human indexers note page ranges almost automatically, perhaps
occasionally thinking “does HTML end here, or perhaps at this other point?” Yet MIs can’t accomplish
this simple task without humans prepping the text for them. Why? Because to a machine indexer
HTML is a string of characters. How far after the string HTML appears is the author still discussing
HTML?
As someone who is a computer programmer as well as an indexer, I could make some rules for MIs
to try to fake a knowledge of page ranges. A simple one would be:
If the string “HTML” appears, find its last appearance and mark the end of the page range at the
end of the paragraph in which the last “HTML” appears.
If, however, the author writes substantially about HTML for 5 pages, talking about hypertext links
and tags, there may only be one appearance of the string “HTML.” A computer algorithm that was
prepped with a list of words related to HTML might look for the last occurrence of the set of words
in the list. But words are ambiguous. For instance “tag” does not necessarily mean “HTML tag.” So
the list of related words may contain words that give false results because they are also used in
contexts other than related to the subject.
“The index has an error. The author stopped writing about HTML at page 205.” Human’s could
argue about such a statement, but what of machine indexers? When the humans are finished arguing,
will there be a rule that lets the MI know when the next passage about HTML ends? Will the rule be
easily transferable to passages about HTTP? To passages about RNA, or about vaguer concepts like
“immune system?” Will RNA-based virus boundary rules work for texts about computer viruses
written in C++?
Hand writing a computer program that will do a good job indexing a particular book is currently
more expensive than paying a professional indexer for the service.
Another problem with page ranges is that they can be subdivided. Unlike the subdivision algorithms
required for fractal generation or most forms of analysis, it takes real, that is human, intelligence to
appropriately subdivide an index entry with its page range. Some use guidelines of the sort “if the
page locators cover 6 or more pages, break down into sub-entries.” A more difficult question is
whether to break a page range into two or more entries at the same level, or to keep as one entry at
that level with subentries at the next level. The only general answer (if the goal is a good, useful
index) is that the indexer (machine or human) must understand the subject matter, the author’s
intent, and the needs of the index users in order to make such a decision in a specific case.
Repetition
Some indexers create page references for all mentions, or substantial mentions, of a topic. There are
some types of books where this may be appropriate. But have you ever looked in an index and
spent time finding page after page, getting no useful information that was not in an earlier reference
224 LOVELY PROFESSIONAL UNIVERSITY