Page 136 - DLIS402_INFORMATION_ANALYSIS_AND_REPACKAGING
P. 136

Unit 6: Information Retrieval Model and Search Strategies




              •  A document collection                                                               Notes
              •  A test suite of information needs, expressible as queries
              •  A set of relevance judgments, standard a binary assessment of either relevant or non-relevant
                 for each query-document pair.
            The standard approach to information retrieval system evaluation revolves around the notion of
            relevant and non-relevant documents. With respect to a user information need, a document in the
            test collection is given a binary classification as either relevant or non-relevant. This decision is
            referred to as the gold standard or ground truth judgment of relevance. The test document collection
            and suite of information needs have to be of a reasonable size: you need to average performance
            over fairly large test sets, as results are highly variable over different documents and information
            needs. As a rule of thumb, 50 information needs has usually been found to be a sufficient minimum.

            Relevance Assessment
            Information on whether drinking red wine is more effective at reducing your risk of heart attacks
            than white wine.
            This might be translated into a query such as:
            wine and red and white and heart and attack and effective
            A document is relevant if it addresses the stated information need, not because it just happens to
            contain all the words in the query. This distinction is often misunderstood in practice, because the
            information need is not overt. But, nevertheless, an information need is present. If a user types
            python into a web search engine, they might be requiring to know where they can purchase a pet
            python. Or they might be requiring information on the programming language Python.
            From a one word query, it is very difficult for a system to know what the information need is. But,
            nevertheless, the user has one, and can judge the returned results on the basis of their relevance to
            it. To evaluate a system, we require an overt expression of an information need, which can be used
            for judging returned documents as relevant or non-relevant. At this point, we make a simplification:
            relevance can reasonably be thought of as a scale, with some documents highly relevant and others
            marginally so. But for the moment, we will use just a binary decision of relevance.
            Many systems contain various weights (often known as parameters) that can be adjusted to tune
            system performance. It is wrong to report results on a test collection which were obtained by tuning
            these parameters to maximize performance on that collection. That is because such tuning overstates
            the expected performance of the system, because the weights will be set to maximize performance
            on one particular set of queries rather than for a random sample of queries. In such cases, the correct
            procedure is to have one or more development test collections, and to tune the parameters on the
            development test collection. The tester then runs the system with those weights on the test collection
            and reports the results on that collection as an unbiased estimate of performance.




                     Have a study on web information retrieval projects.



            6.6  Summary

              •  Information retrieval (IR) is the area of study concerned with searching for documents, for
                 information within documents, and for metadata about documents, as well as that of search-
                 ing relational databases and the World Wide Web.





                                             LOVELY PROFESSIONAL UNIVERSITY                                   131
   131   132   133   134   135   136   137   138   139   140   141