Page 75 - DLIS405_INFORMATION_STORAGE_AND_RETRIEVAL
P. 75
Information Storage and Retrieval
Notes Text Representation Approaches
We have tested two kinds of text representation approaches for TC: a bag of words model, and a
concept indexing model. Given that the TC problem is more oriented to genre than to subject, we
have considered four possibilities: using a stoplist and stemming, using a stoplist, using stemming,
and finally using words (without stoplist and stemming).
These approaches are coded below as BSS, BSN, BNS, and BNN respectively. The motivation of
considering these four approaches is that words ocurring in a stoplist (e.g. prepositions, etc.) and
original word forms (e.g. past suffixes as “-ed”) can be good indicators of different text genres The
three concept indexing approaches considered in our experiments are those regarding the level of
disambiguation, which are: using the correct word sense (CD), using the first sense given the part of
speech (CF), and using all the senses for the given part of speech (CA). We have weighted terms and
synsets n documents using the popular TF.IDF formula from the Vector Space Model.
Conclusion
In general, we have not been able to prove that concept indexing is better than the bag of words
model for TC. However, our results must be examined with care,because of at least two reasons: (1)
the lack of enough training data: stronger evidence is got when the number of documents increases;
also, the behavior of some algorithms is surprising (Naive Bayes, kNN); and (2) the genre identification
problem is such that meaning of used words is not more relevant than other text features, including
structure, capitalization, tense, punctuation, etc. In other words, this study needs to be extended to a
more populated, subject-oriented TC test collection (e.g. Reuters-21578). The work by Petridis et al.
adds evidence of concept indexing out performing the bag of words model on the SemCor collection,
specially with SVM, and Fukumoto and Suzuki work with them on Reuters-21578 allow to say that
such an study is weel motivated and promising.
Concept Indexing for Production Databases
To explore the feasibility of using the National Library of Medicine’s Unified Medical Language
System (UMLS) Metathesaurus as the basis for a computational strategy to identify concepts in medical
narrative text preparatory to indexing. To quantitatively evaluate this strategy in terms of true positives,
false positives (spuriously identified concepts) and false negatives (concepts missed by the
identification process).
Methods
Using the 1999 UMLS Metathesaurus, the authors processed a training set of 100 documents (50
discharge summaries, 50 surgical notes) with a concept-identification programme, whose output
was manually analyzed. They flagged concepts that were erroneously identified and added new
concepts that were not identified by the program, recording the reason for failure in such cases.
After several refinements to both their algorithm and the UMLS subset on which it operated, they
deployed the program on a test set of 24 documents (12 of each kind).
Results
Of 8,745 matches in the training set, 7,227 (82.6 percent ) were true positives, whereas of 1,701
matches in the test set, 1, 298 (76.3 percent) were true positives. Matches other than true positive
indicated potential problems in production-mode concept indexing. Examples of causes of problems
were redundant concepts in the UMLS, homonyms, acronyms, abbreviations and elisions, concepts
that were missing from the UMLS, proper names, and spelling errors.
Conclusions
The error rate was too high for concept indexing to be the only production-mode means of
preprocessing medical narrative. Considerable curation needs to be performed to define a UMLS
subset that is suitable for concept matching.
70 LOVELY PROFESSIONAL UNIVERSITY