Page 135 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 135
Information Analysis and Repackaging
Notes
Our skills include analysis of information sources, design of information structures
and the configuration, implementation and integration of information retrieval
applications.
i-logue maintains broad market awareness of the development of methods, technique and tools to
support both structured and unstructured information retrieval. We are therefore well placed to
help organisation articulate their information need, identify appropriate solutions and implement
capabilities.
Web Information Retrieval Projects
Relevance
How can we ask about the “speed of a jaguar” and not run into fine automobiles and football teams?
Popular keyword search engines are only a beginning to harnessing the information in large
hyperlinked text repositories. If we could embed large sections of the web in a structured directory,
such as Yahoo!, searches can be constructed using not only keywords but also the topic paths induced
by the directory. Another benefit of such automatic classification is that people can be characterized
very compactly by how often they visit pages embedded in various nodes of the directory, and this
“profile” can then be used for collaborative search and recommendation.
Classifying web documents turns out to be much more difficult than standard Information Retrieval
benchmarks. To learn a domain as broad as the web, very many examples are needed. Existing
classification engines cannot handle giga-byte sized corpora. Second, text alone is often deceptive,
and the topic of a web page is often better assessed based on the link neighborhood of the page. It
need to built a fast, scalable hypertext classification engine called HyperClass. It uses efficient out-
of-core data structures to deal with large corpora and a new algorithm for topical analysis of citations
to achieve high speed and accuracy.
Popularity
Internet directories are popular not only because they are easier to search and navigate, but also
because they hand-pick sites and pages of high quality. The field of bibliometry is concerned with the
analysis of citation graphs, typically in academic publications. Jon Kleinberg designed a system called
HITS for hyperlink citation analysis on the web. HITS assigns two scores of merit to web pages related
to a topic: its hub score and authority score. A good hub is a useful resource to start browsing on a
topic. A good authority is a well cited, popular page on the topic.
Web authorship is less regulated and more diverse than academic publications.
Consequently, the simple model of web pages as nodes and hyperlinks as edges can be significantly
improved upon. This page can be segmented into Information Retrieval and Parallel Computing;
assigning a common score of merit would mislead the rating algorithm. As extended the HITS
model so that query-dependent keywords near outlinks influence the notion of authority conferred
from one page to another. The resulting automatic resource compilation system called Clever
outperformed Yahoo! as judged by two user groups. This work has received some press recently.
Information retrieval system parameter
To measure ad hoc information retrieval effectiveness in the standard way, we need a test collection
consisting of three things:
130 LOVELY PROFESSIONAL UNIVERSITY