Page 251 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 251

Information Analysis and Repackaging



                   Notes                  •  Average distance between documents is the average distance to this “centroid”.

                                         If DV(k) > 0, k is a good discriminator (the whole collection without “k” is less specific
                                          than before).
                                    – by statistical distribution
                                         Trivial words obey a single-Poisson distribution
                                         Therefore, term “relevance” can be computed by comparing its distribution to the single-
                                          Poisson one.
                                         This can be done via a chi-square test.


                                 Statistical Selection of Phrases
                                 A simple approach to phrase construction/selection is based on the following principles
                                    – Phrase terms must occur frequently together (“with less than k words in between them”)
                                    – Phrase components must be relatively frequent
                                 The resulting algorithm reads:
                                         Compute pair-wise co-ocurrence within constraints
                                         If  greater than a threshold value,
                                                                                             tt
                                                                          Coocurrence Freq  . ( , )
                                                                                                j
                                                                                             i
                                                       cohesion value (t , t ) =   Freq  . ( ) . Freq ( )
                                                                                            .
                                                                     i
                                                                       j
                                                                                             t
                                                                                   t
                                                                                              j
                                                                                    j
                                 Problems of Automated Thesaurus Construction
                                 It must be noted that a purely statistical analysis cannot expect to find the exact type of semantical
                                 relationships between terms. This has much to do with the problem of natural language processing
                                 (NLP), a promising field of research that involves both Artificial Intelligence and Linguistic.

                                         A simple modification that can provide non-statistical insight into semantic is the
                                         distinction of part-of-speech. This implies “tagging” each term with it’s corresponding
                                         type (verb, noun, adjective, …). A program capable of doing this is called a “tagger”.

                                 Although we have seen only the statistical approach, it is possible to either bypass it or complement
                                 it with external relevance judgements. For an example, given a query that divides all documents
                                 into either “relevant” or “irrelevant”, a thesaural term that appears in only one of the classes would
                                 be better for that specific query than one that appears in both. Many such approaches have been
                                 proposed and tested, mostly with good results.
                                 The problem with relevance judgements is their availability: as yet, only humans can produce them,
                                 and therefore they are scarce for most collections. A truly automatically generated thesaurus should
                                 not depend on such judgements.
                                 Another problem in automated thesaurus construction is verification: How can the performance of
                                 a thesaurus be tested?. Usually, the ability of a thesaurus to extract more relevant documents during
                                 a search is used as a pointer to thesaurus quality.
                                 Phrase Finder
                                 PhraseFinder is an automatically generated thesaurus that is integrated within a retrieval engine,
                                 InQuery. InQuery is part of the TIPSTER project in the Information Retrieval Laboratory of the
                                 Computer Science Department, University of Massachusetts, Amherst.



            246                              LOVELY PROFESSIONAL UNIVERSITY
   246   247   248   249   250   251   252   253   254   255   256