Page 191 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 191
Information Analysis and Repackaging
Notes Autonomous citation indexing was introduced in 1998 by Giles, Lawrence and Bollacker and enabled
automatic algorithmic extraction and grouping of citations for any digital academic and scientific
document. Where previous citation extraction was a manual process, citation measures could now
scale up and be computed for any scholarly and scientific field and document venue, not just those
selected by organizations such as ISI. This led to the creation of new systems for public and automated
citation indexing, the first being CiteSeer (now CiteSeer X, soon followed by Cora (recently reborn
as Rexa), which focused primarily on the field of computer and information science.
These were later followed by large scale academic domain citation systems such as the Google
Scholar and previously Microsoft Academic. Such autonomous citation indexing is not yet perfect
in citation extraction or citation clustering with an error rate estimated by some at 10% though a
careful statistical sampling has yet to be done. This has resulted in such authors as Ann Arbor,
Milton Keynes, and Walton Hall being credited with extensive academic output. SCI claims to create
automatic citation indexing through purely programmatic methods. Even the older records have a
similar magnitude of error.
10.4 Development of Indexing Concept
Indexing is a technique for improving database performance. The many types of indexes share the
common property that they eliminate the need to examine every entry when running a query. In
large databases, this can reduce query time/cost by orders of magnitude. The simplest form of index
is a sorted list of values that can be searched using a binary search with an adjacent reference to the
location of the entry, analogous to the index in the back of a book. The same data can have multiple
indexes (an employee database could be indexed by last name and hire date.)
Indexes affect performance, but not results. Database designers can add or remove
indexes without changing application logic, reducing maintenance costs as the database
grows and database usage evolves.
All three, Actually
“Indexing is an arcane art whose time has not yet come,” according to Lise Kreps, an indexer worked
with at Aldus and Microsoft long ago. This arcane art, in which a human designates the subject of a
chunk of information, and where that chunk is, still hasn’t been completely mimicked by automation.
Natural language search engines are providing some great results, and some very mixed results, and
these results improve for people who are willing to learn the tricks for getting the best out of each
engine.
But natural language engines only solve one aspect of searching. And there are vast holes that are
not picked up by an engine, due to metaphor, or the way a document is titled or structured, or how
the search is phrased. If we rely totally on automation to retrieve information, some will be lost.
“Important information will no longer be made retrievable. Instead, information will become
important simply because it is retrievable” (Richard Evans,). If the information’s structure doesn’t
work with the engine well, or if the user is not in a search engine mode, the search may come up
short and miss pieces of important data.
Easy answer searches, such as “What’s the shortcut for typing an em dash,” are great for search
engines. But for a more complex problem, natural language engines can come up short. Try this
one, for instance: “How do you freeze the header columns in Excel so that they don’t scroll, and so
that they appear on each printed page?” Searching for an answer to a complex question is an iterative
process. The user switches search modes several times in the course of a complex search. A search
like the Excel question may have a user starting by typing words or terminology in the search box.
186 LOVELY PROFESSIONAL UNIVERSITY