Page 217 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 217
Information Analysis and Repackaging
Notes Automatic indexing may be related to particular views on semantics and on systems evaluation
that differs from philosophies associated with “intellectual indexing”. Semantic relations such as
synonymy may be understood as a strong degree of co-occurrences.
Anderson and Pérez-Carballo write:
“Throughout the history of automatic indexing, two major theoretical models have emerged: the
“vector-space model” and the probabilistic model. Sparck Jones, Walker and Robertson (2000) have
provided a through review of the development, versions, results, and current status of the
probabilistic model. In comparing this model to others, they conclude that “by far the best-developed
non-probabilistic view of IR is the vector-space model (VSM), most famously embodied in the SMART
system (Salton, 1975, Salton & McGill, 1983a). In some respect the basic logic of the VSM is common
to many other approaches, including our own [i.e., the probabilistic model] . . . In practice the
difference [between these two models] has become somewhat blurred.
Each approach has borrowed ideas from the other, and to some extent the original motivations
have become disguised by the process. . . . This mutual learning is reflected in the results of successive
round[s] of TREC. . . . It may be argued that the performance differences that do appear have more
to do with choices of the device set used, and detailed matters of implementation, than with
foundational differences of approach”.
The focus of our discussion will be on the automatic indexing of language texts. The various tactics
and strategies are emphasized, rather than the underlying theoretical models”.
Sparck Jones, Walker and Robertson (2000) compare their own probabilistic approach with other
“approaches, models, methods and techniques”:
The vector space model
Probabilistic indexing and a unified model
Dependency
Logical information retrieval
Networks
Regression
Other models (Hidden Markov Model)
Golub (2005) made a distinction between “text categorization” and “document clustering”. The last
approach is based on the information retrieval-tradition, while text-categorization is based on
machine-learning in the artificial intelligence-tradition.
Luckhardt (2006) presents the following approaches:
The general linguistic approach
The morpho-syntactic approach to automatic tagging
The sublanguage approach: How can different domains be dealt with?
The semantic relations approach: towards a semantic interlingua
The semantic (text) knowledge approach: ´classification and thesauri and their use in NLP.
As we see seem different authors writing on approaches to automatic indexing to disagree on what
approaches actually exists. One way to consider approaches would be to consider the different
levels of language considered.
212 LOVELY PROFESSIONAL UNIVERSITY