Information Analysis and Repackaging

                   Notes         An object is an entity that is represented by information in a database. User queries are matched
                                 against the database information. Depending on the application the data objects may be, for example,
                                 text documents, images, audio, mind maps or videos. Often the documents themselves are not kept
                                 or stored directly in the IR system, but are instead represented in the system by document surrogates
                                 or metadata.
                                 Most IR systems compute a numeric score on how well each object in the database match the query,
                                 and rank the objects according to this value. The top ranking objects are then shown to the user. The
                                 process may then be iterated if the user wishes to refine the query.

                                 Performance and Correctness Measures

                                 Many different measures for evaluating the performance of information retrieval systems have been
                                 proposed. The measures require a collection of documents and a query. All common measures
                                 described here assume a ground truth notion of relevancy: every document is known to be either
                                 relevant or non-relevant to a particular query. In practice queries may be ill-posed and there may be
                                 different shades of relevancy.

                                 Precision is the fraction of the documents retrieved that are relevant to the user’s information need.
                                                         |{relevant documents} ∩  {retrieved documents}|
                                                precision =
                                                                    |{ retrieved documents}|
                                 In binary classification, precision is analogous to positive predictive value. Precision takes all
                                 retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only
                                 the topmost results returned by the system. This measure is called precision at n or P@n.
                                 Note that the meaning and usage of “precision” in the field of Information Retrieval differs from
                                 the definition of accuracy and precision within other branches of science and technology.

                                 Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
                                                       |{relevant documents}∩  {retrieved documents}|
                                                recall =
                                                                 |{ relevant documents}|
                                 In binary classification, recall is called sensitivity. So it can be looked at as the probability that a
                                 relevant document is retrieved by the query.
                                 It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore
                                 recall alone is not enough but one needs to measure the number of non-relevant documents also,
                                 for example by computing the precision.

                                 The proportion of non-relevant documents that are retrieved, out of all non-relevant documents
                                                        |{non-relevant documents} ∩  {retrieved documents}|
                                                fall-out =        |{ non-relevant documents}|

                                 In binary classification, fall-out is closely related to specificity (1–specificity). It can be looked at as
                                 the probability that a non-relevant document is retrieved by the query.
                                 It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.

