Semi-supervised lexicon mining from parenthetical expressions in monolingual web pages

Xianchao Wu, Naoaki Okazaki, Jun'ichi Tsujii

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-based term recognition approach is applied for extracting bilingual abbreviations. A self-training algorithm is proposed for mining transliteration and translation lexicons. In which, we employ available lexicons in terms of morpheme levels, i.e., phoneme correspondences in transliteration and grapheme (e.g., suffix, stem, and prefix) correspondences in translation. The experimental results verified the effectiveness of our approaches.

Original languageEnglish
Title of host publicationNAACL HLT 2009 - Human Language Technologies
Subtitle of host publicationThe 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages424-432
Number of pages9
ISBN (Print)9781932432411
DOIs
Publication statusPublished - 2009
EventHuman Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2009 - Boulder, CO, United States
Duration: 2009 May 312009 Jun 5

Publication series

NameNAACL HLT 2009 - Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Conference

Other

OtherHuman Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2009
CountryUnited States
CityBoulder, CO
Period09/5/3109/6/5

ASJC Scopus subject areas

  • Language and Linguistics
  • Social Sciences (miscellaneous)

Fingerprint Dive into the research topics of 'Semi-supervised lexicon mining from parenthetical expressions in monolingual web pages'. Together they form a unique fingerprint.

Cite this