Page 75 - DLIS405_INFORMATION_STORAGE_AND_RETRIEVAL
P. 75

Information Storage and Retrieval



                 Notes          Text Representation Approaches

                                We have tested two kinds of text representation approaches for TC: a bag of words model, and a
                                concept indexing model. Given that the TC problem is more oriented to genre than to subject, we
                                have considered four possibilities: using a stoplist and stemming, using a stoplist, using stemming,
                                and finally using words (without stoplist and stemming).
                                These approaches are coded below as BSS, BSN, BNS, and BNN respectively. The motivation of
                                considering these four approaches is that words ocurring in a stoplist (e.g. prepositions, etc.) and
                                original word forms (e.g. past suffixes as “-ed”) can be good indicators of different text genres The
                                three concept indexing approaches considered in our experiments are those regarding the level of
                                disambiguation, which are: using the correct word sense (CD), using the first sense given the part of
                                speech (CF), and using all the senses for the given part of speech (CA). We have weighted terms and
                                synsets n documents using the popular TF.IDF formula from the Vector Space Model.

                                Conclusion

                                In general, we have not been able to prove that concept indexing is better than the bag of words
                                model for TC. However, our results must be examined with care,because of at least two reasons: (1)
                                the lack of enough training data: stronger evidence is got when the number of documents increases;
                                also, the behavior of some algorithms is surprising (Naive Bayes, kNN); and (2) the genre identification
                                problem is such that meaning of used words is not more relevant than other text features, including
                                structure, capitalization, tense, punctuation, etc. In other words, this study needs to be extended to a
                                more populated, subject-oriented TC test collection (e.g. Reuters-21578). The work by Petridis et al.
                                adds evidence of concept indexing out performing the bag of words model on the SemCor collection,
                                specially with SVM, and Fukumoto and Suzuki work with them on Reuters-21578 allow to say that
                                such an study is weel motivated and promising.

                                Concept Indexing for Production Databases

                                To explore the feasibility of using the National Library of Medicine’s Unified Medical Language
                                System (UMLS) Metathesaurus as the basis for a computational strategy to identify concepts in medical
                                narrative text preparatory to indexing. To quantitatively evaluate this strategy in terms of true positives,
                                false positives (spuriously identified concepts) and false negatives (concepts missed by the
                                identification process).
                                Methods

                                Using the 1999 UMLS Metathesaurus, the authors processed a training set of 100 documents (50
                                discharge summaries, 50 surgical notes) with a concept-identification programme, whose output
                                was manually analyzed. They flagged concepts that were erroneously identified and added new
                                concepts that were not identified by the program, recording the reason for failure in such cases.
                                After several refinements to both their algorithm and the UMLS subset on which it operated, they
                                deployed the program on a test set of 24 documents (12 of each kind).
                                Results
                                Of 8,745 matches in the training set, 7,227 (82.6 percent ) were true positives, whereas of 1,701
                                matches in the test set, 1, 298 (76.3 percent) were true positives. Matches other than true positive
                                indicated potential problems in production-mode concept indexing. Examples of causes of problems
                                were redundant concepts in the UMLS, homonyms, acronyms, abbreviations and elisions, concepts
                                that were missing from the UMLS, proper names, and spelling errors.
                                Conclusions
                                The error rate was too high for concept indexing to be the only production-mode means of
                                preprocessing medical narrative. Considerable curation needs to be performed to define a UMLS
                                subset that is suitable for concept matching.



          70                               LOVELY PROFESSIONAL UNIVERSITY
   70   71   72   73   74   75   76   77   78   79   80