Clustering acronyms in biomedical text for disambiguation

Naoaki Okazaki, Sophia Ananiadou

Research output: Contribution to conferencePaper

5 Citations (Scopus)

Abstract

Given the increasing number of neologisms in biomedicine (names of genes, diseases, molecules, etc), the rate of acronyms used in literature also increases. Existing acronym dictionaries cannot keep up with the rate of new creations. Thus, discovering and disam-biguating acronyms and their expanded forms are essential aspects of text mining and terminology management. We present a method for clustering long forms identified by an acronym recognition method. Applying the acronym recognition method to MEDLINE abstracts, we obtained a list of short/long forms. The recognized short/long forms were classified by a biologist to construct an evaluation set for clustering sets of similar long forms. We observed five types of term variation in the evaluation set and defined four similarity measures to gathers the similar long forms (i.e., orthographic, morphological, syntactic, lexico semantic variants, nested abbreviations). The complete-link clustering with the four similarity measures achieved 87.5% precision and 84.9% recall on the evaluation set.

Original languageEnglish
Pages959-962
Number of pages4
Publication statusPublished - 2006 Jan 1
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: 2006 May 222006 May 28

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
CountryItaly
CityGenoa
Period06/5/2206/5/28

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Fingerprint Dive into the research topics of 'Clustering acronyms in biomedical text for disambiguation'. Together they form a unique fingerprint.

  • Cite this

    Okazaki, N., & Ananiadou, S. (2006). Clustering acronyms in biomedical text for disambiguation. 959-962. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.