Page 229 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 229

Information Analysis and Repackaging



                   Notes         I will begin by examining several tasks routinely performed by human book indexers, pointing out
                                 along the way the difficulties involved and implications. That provides concrete examples to inform
                                 the discussion of the relationship between author, indexer, reader, language, and the world. This
                                 leads to considering the language acquisition process of children as an indexing process, and of
                                 indexing processes in general. After considering the relationships of indexes and maps, a
                                 requirements list for human-quality indexing by MIs (computer software and hardware) will be
                                 presented.

                                 Page Ranges
                                 Let us begin by examing what seems to be a simple and easy to implement concept. One of the first
                                 concepts book index users learn is that of page ranges. For example
                                 HTML, 203-207.
                                 would indicate that the topic of HTML is taken up from beginning on page 203 and ending
                                 somewhere on page 207. Human indexers note page ranges almost automatically, perhaps
                                 occasionally thinking “does HTML end here, or perhaps at this other point?” Yet MIs can’t accomplish
                                 this simple task without humans prepping the text for them. Why? Because to a machine indexer
                                 HTML is a string of characters. How far after the string HTML appears is the author still discussing
                                 HTML?
                                 As someone who is a computer programmer as well as an indexer, I could make some rules for MIs
                                 to try to fake a knowledge of page ranges. A simple one would be:
                                 If the string “HTML” appears, find its last appearance and mark the end of the page range at the
                                 end of the paragraph in which the last “HTML” appears.
                                 If, however, the author writes substantially about HTML for 5 pages, talking about hypertext links
                                 and tags, there may only be one appearance of the string “HTML.” A computer algorithm that was
                                 prepped with a list of words related to HTML might look for the last occurrence of the set of words
                                 in the list. But words are ambiguous. For instance “tag” does not necessarily mean “HTML tag.” So
                                 the list of related words may contain words that give false results because they are also used in
                                 contexts other than related to the subject.
                                 “The index has an error. The author stopped writing about HTML at page 205.” Human’s could
                                 argue about such a statement, but what of machine indexers? When the humans are finished arguing,
                                 will there be a rule that lets the MI know when the next passage about HTML ends? Will the rule be
                                 easily transferable to passages about HTTP? To passages about RNA, or about vaguer concepts like
                                 “immune system?” Will RNA-based virus boundary rules work for texts about computer viruses
                                 written in C++?
                                 Hand writing a computer program that will do a good job indexing a particular book is currently
                                 more expensive than paying a professional indexer for the service.
                                 Another problem with page ranges is that they can be subdivided. Unlike the subdivision algorithms
                                 required for fractal generation or most forms of analysis, it takes real, that is human, intelligence to
                                 appropriately subdivide an index entry with its page range. Some use guidelines of the sort “if the
                                 page locators cover 6 or more pages, break down into sub-entries.” A more difficult question is
                                 whether to break a page range into two or more entries at the same level, or to keep as one entry at
                                 that level with subentries at the next level. The only general answer (if the goal is a good, useful
                                 index) is that the indexer (machine or human) must understand the subject matter, the author’s
                                 intent, and the needs of the index users in order to make such a decision in a specific case.

                                 Repetition

                                 Some indexers create page references for all mentions, or substantial mentions, of a topic. There are
                                 some types of books where this may be appropriate. But have you ever looked in an index and
                                 spent time finding page after page, getting no useful information that was not in an earlier reference




            224                              LOVELY PROFESSIONAL UNIVERSITY
   224   225   226   227   228   229   230   231   232   233   234