Noisy-channel spelling correction models for Estonian learner language corpus lemmatisation

Kairit Sirts

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

4 Citations (Scopus)

Abstract

Morphological analysis is an important task in Estonian learner language studies that gives information about the words and forms used by the learners. Because of the spelling errors frequently occurring in language learner texts, these texts should undergo some error correction step before applying the conventional morphological analysis tools because the morphological analyser fails to find the correct analysis for the misspelled words. In this paper we compare several different spelling correction models with the aim of improving the lemmatisation accuracy of learner language texts. Experiments show that the simplest non-word noisy-channel spelling correction model with a disambiguation model applied on top of the morphological analyser output performs the best while some of the more complicated models even fail to beat the baseline that does not include any spelling correction.
Original languageEnglish
Title of host publicationHuman language technologies
Subtitle of host publicationthe Baltic perspective : proceedings of the Fifth International Conference Baltic HLT 2012
EditorsArvi Tavast, Kadri Muischnek, Mare Koit
Place of PublicationAmsterdam
PublisherIOS Press
Pages213-220
Number of pages8
ISBN (Print)9781614991328
DOIs
Publication statusPublished - 2012
Externally publishedYes
EventBaltic Conference on Human Language Technologies (5th : 2012) - Tartu, Estonia
Duration: 4 Oct 20125 Oct 2012

Publication series

NameFrontiers in artificial intelligence and applications
PublisherIOS Press
Volume247
ISSN (Print)0922-6389

Conference

ConferenceBaltic Conference on Human Language Technologies (5th : 2012)
CityTartu, Estonia
Period4/10/125/10/12

Keywords

  • spelling correction
  • learner languages analysis
  • lemmatisation

Fingerprint

Dive into the research topics of 'Noisy-channel spelling correction models for Estonian learner language corpus lemmatisation'. Together they form a unique fingerprint.

Cite this