Page 191 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 191

Information Analysis and Repackaging



                   Notes         Autonomous citation indexing was introduced in 1998 by Giles, Lawrence and Bollacker  and enabled
                                 automatic algorithmic extraction and grouping of citations for any digital academic and scientific
                                 document. Where previous citation extraction was a manual process, citation measures could now
                                 scale up and be computed for any scholarly and scientific field and document venue, not just those
                                 selected by organizations such as ISI. This led to the creation of new systems for public and automated
                                 citation indexing, the first being CiteSeer (now CiteSeer X, soon followed by Cora (recently reborn
                                 as Rexa), which focused primarily on the field of computer and information science.
                                 These were later followed by large scale academic domain citation systems such as the Google
                                 Scholar and previously Microsoft Academic. Such autonomous citation indexing is not yet perfect
                                 in citation extraction or citation clustering with an error rate estimated by some at 10% though a
                                 careful statistical sampling has yet to be done. This has resulted in such authors as Ann Arbor,
                                 Milton Keynes, and Walton Hall being credited with extensive academic output. SCI claims to create
                                 automatic citation indexing through purely programmatic methods. Even the older records have a
                                 similar magnitude of error.

                                 10.4 Development of Indexing Concept


                                 Indexing is a technique for improving database performance. The many types of indexes share the
                                 common property that they eliminate the need to examine every entry when running a query. In
                                 large databases, this can reduce query time/cost by orders of magnitude. The simplest form of index
                                 is a sorted list of values that can be searched using a binary search with an adjacent reference to the
                                 location of the entry, analogous to the index in the back of a book. The same data can have multiple
                                 indexes (an employee database could be indexed by last name and hire date.)




                                          Indexes affect performance, but not results. Database designers can add or remove
                                         indexes without changing application logic, reducing maintenance costs as the database
                                         grows and database usage evolves.

                                 All three, Actually

                                 “Indexing is an arcane art whose time has not yet come,” according to Lise Kreps, an indexer worked
                                 with at Aldus and Microsoft long ago. This arcane art, in which a human designates the subject of a
                                 chunk of information, and where that chunk is, still hasn’t been completely mimicked by automation.
                                 Natural language search engines are providing some great results, and some very mixed results, and
                                 these results improve for people who are willing to learn the tricks for getting the best out of each
                                 engine.
                                 But natural language engines only solve one aspect of searching. And there are vast holes that are
                                 not picked up by an engine, due to metaphor, or the way a document is titled or structured, or how
                                 the search is phrased. If we rely totally on automation to retrieve information, some will be lost.
                                 “Important information will no longer be made retrievable. Instead, information will become
                                 important simply because it is retrievable” (Richard Evans,). If the information’s structure doesn’t
                                 work with the engine well, or if the user is not in a search engine mode, the search may come up
                                 short and miss pieces of important data.
                                 Easy answer searches, such as “What’s the shortcut for typing an em dash,” are great for search
                                 engines. But for a more complex problem, natural language engines can come up short. Try this
                                 one, for instance: “How do you freeze the header columns in Excel so that they don’t scroll, and so
                                 that they appear on each printed page?” Searching for an answer to a complex question is an iterative
                                 process. The user switches search modes several times in the course of a complex search. A search
                                 like the Excel question may have a user starting by typing words or terminology in the search box.




            186                              LOVELY PROFESSIONAL UNIVERSITY
   186   187   188   189   190   191   192   193   194   195   196